Recent comments posted to this site:

comment 3

After a bug fix, it's now possible to make a sameas remote that is private to the local repository.

git-annex initremote bar --sameas=foo --private type=...

While not ephemeral as such, if you git remote remove bar, the only trace left of it will probably be in .git/annex/journal-private/remote.log, and possibly any creds that got cached for it. It would be possible to have a command that removes the remote, and also clears that.

If that is close enough to ephemeral, then we could think about the second part, extending the external special remote protocol with REDIRECT-REMOTE.

That is similar to Special remote redirect to URL. And a few comments over there go in a similar direction. In particular, the discussion of CLAIMURL. If TRANSFER-RETRIEVE-URL and TRANSFER-CHECKPRESENT-URL supported CLAIMURL, then if the ephermeral special remote had some type of url, that it claimed, those could be used rather than REDIRECT-REMOTE.

That would not cover TRANSFER STORE and REMOVE though. And it probably doesn't make sense to extend those to urls generally. (There are too many ways to store to an url or remove an url, everything isn't WebDAV..)

I don't know if it is really elegant to drag urls into this anyway. The user may be left making up an url scheme for something that does not involve urls at all.

Comment by joey
Re: download URL broker

Some of this strikes me as perhaps coming at Ephemeral special remotes from a different direction?

Re the inflation of the git-annex branch when using sameas, I fixed a bug (sameas private) and you'll be able to use git-annex initremote --sameas=foo --private to keep the configuration of the new sameas remote out of the git-annex branch.

So, it seems to me that your broker, if it knows of several different urls that can be used to access myplace, can be configured at initremote time which set of urls to use. And you can initialize multiple instances of the broker, each configured to use a different set of url, with --sameas --private.

Comment by joey
Re: CLAIMURL

CLAIMURL is not currently used for TRANSFER-RETRIEVE-URL. (It's also not quite accurate to say that the web special remote is used.)

Supporting that would mean that, each time a remote replies with TRANSFER-RETRIEVE-URL, git-annex would need to query each other remote in turn to see if they claim the url. That could mean starting up a lot of extenal special remote programs (when not running yet) and doing a roundtrip through them, so latency might start to become a problem.

Also, there would be the possibility of loops between 2 or more remotes. Eg, remote A replies with TRANSFER-RETRIEVE-URL with an url that remote B CLAIMURLs, only to then reply with TRANSFER-RETRIEVE-URL, with an url that remote A CLAIMURLs.

Comment by joey
Re: multiple URLS for a key

TRANSFER-RETRIEVE-URL was designed as a redirect, so it only redirects to one place. And git-annex won't try again to retrieve from the same remote if url fails to download.

I could imagine extending TRANSFER-RETRIEVE-URL to have a list of urls. But I can also imagine needing to extend it with HTTP headers to use for the url, and these things conflict, given the simple line and word based protocol.

I think that sameas remotes that use other urls might be a solution. Running eg git-annex get without specifying a remote, it will keep trying different remotes until one succeeds.

Comment by joey
Re: CHECKPRESENT

Yes CHECKPRESENT still needs the special remote to do HTTP.

I do think that was an oversight. The original todo mentioned "taking advantage of the testing and security hardening of the git-annex implementation" and if a special remote is read-only, CHECKPRESENT may be the only time it needs to do HTTP.

A protocol extension for this would look like:

EXTENSIONS CHECKPRESENT-URL
CHECKPRESENT-URL Key Url

Would it impact the usage of such a special remote, if it would be configured with sameas=otherremote? Would both remote implementations need to implement CHECKPRESENT (consistently), or would one (in this case otherremote) by enough.

git-annex won't try to use the otherremote when it's been asked to use the sameas remote.

If one implemented CHECKPRESENT and the other always replied with "CHECKPRESENT-UNKNOWN", then a command like git-annex fsck --fast --from when used with the former remote would be able to verify that the content is present, and when used with the latter remote would it would error out.

So you could perhaps get away with not implementing that. For a readonly remote, fsck is I think the only thing that uses CHECKPRESENT on a user-specified remote. It's more used on remotes that can be written to.

Comment by joey
Functionality gaps?

I looked into adopting this new feature for a special remote implementation. Four questions arose:

  1. In order to implement CHECKPRESENT it appears that a special remote still needs to implemented the logic for the equivalent of a HTTP HEAD request. From my POV this limits the utility of a git-annex based download, because significant logic still needs to be implemented in a special remote itself. Would it impact the usage of such a special remote, if it would be configured with sameas=otherremote? Would both remote implementations need to implement CHECKPRESENT (consistently), or would one (in this case otherremote) by enough.

  2. I am uncertain re the signaling in case of multiple possible URL targets for a key, and an eventual download failure regarding one URL communicated via TRANSFER-RETRIEVE-URL. I believe that, when git-annex fails to download from a reported URL successfully, it can only send another TRANSFER-RETRIEVE request to the special remote (possibly go to the next remote first). This would mean that the special remote either needs to maintain a state on which URL has been reported before, or it would need to implement the capacity to test for availability (essentially the topic of Q1), and can never report more than one URL. Is this correct?

  3. What is the logic git-annex uses to act on a URL communicated via TRANSFER-RETRIEVE-URL. Would it match it against all available special remotes via CLAIMURL, or give it straight to web (and only that)?

  4. I am wondering, if it would be possible and sensible, to use this feature for implementing a download URL "broker"? A use case would be an informed selection of a download URL from a set of URLs associated with a key. This is similar to the urlinclude/exclude feature of the web special remote, but (depending on Q3) is relevant also to other special remotes acting as downloader implementations.

Elaborating on (4) a bit more: My thinking is focused on the optimal long-term accessibility of keys -- across infrastructure transitions and different concurrent environments. From my POV git-annex provides me with the following options for making myplace as a special remote optimally work across space and time.

  • via sameas=myplace, I can have multiple special remotes point to myplace. In each environment I can use the additional remotes (by name) to optimally access myplace. The decision making process it independent of git-annex. However, the possible access options need to be encoded in the annex branch to make this work. This creates a problem of inflation of this space in case of repositories that are used in many different contexts (think public (research) data that want to capitalize on the decentralized nature of git-annex).

  • via enableremote I can swap out the type and parameterization of myplace entirely. However, unlike with initremote there is no --private, so this is more geared toward the use case of "previous access method is no longer available", rather than a temporary optimization.

  • when key access is (temporarily) preferred via URLs, I could generated a temporary web special remote via initremote --private and a urlinclude pattern.

In all cases, I cannot simply run git annex get, but I need to identify a specific remote that may need to be created first, or set a low cost for it.

I'd be glad to be pointed at omissions in this assessment. Thanks!

Comment by mih
"Gather inotify events"

The assistant has some very tricky, and probably also fragile code that gathers related inotify events. That would need to be factored out for this.

Comment by joey
comment 2

Looked in more detail into fixing this by moving the ignore check to after a set of files has been gathered and fed through git ls-files. Unfortunately that will be complicated significantly by the fact that, after the ignore check it currently does things like re-writing symlinks to annex objects when the link target needs updating. There is a chicken and egg problem here, because the type of Change that gets queued depends on parts of that same code having run.

BTW: Another way this same bug can manifest is that an annex object is added to a submodule, and the assistant updates its symlink to point out of the submodule, to the wrong annex objects directory.

There is some very delicate timing going on in Assistant.Threads.Committer in order to gather Changes that happen close together in time. Which makes me think that even a simple approach of running git ls-files once per changed file, before the ignore check, might throw the timing off enough to be a problem. As well as being murder on the CPU when eg, a lot of files have been moved around.

Note that replace assistant with assist would fix this bug, since git-annex assist does use git ls-files. Not that implementing that would be any easier than just fixing this bug. But, fixing this bug moves the assistant in the direction of that todo one way or the other.

Comment by joey
comment 3

Congrats I guess, that's the first LLM-generated patch to git-annex, and it seems approximately correct.

It was unambiguously helpful to get the hint that Remote/Git.hs:485 was the location of the bug. That probably saved 10 minutes of my time.

But, I probably would have found it easier to fix this on my own without seeing that patch than it was to fix it given that patch. I had to do a considerable amount of thinking about whether the patch was correct, or just confidently sounding incorrect in a different manner than a human-generated patch would be. (Not helped, certainly, by this being an area of the code with no type system guardrails helping it be correct.)

For one thing, I wondered, why does it use isUnescapedInURIComponent rather than isUnescapedInURI? The latter handles '/' correctly without needing a special case.

Being faced with an LLM-generated patch also meant that I needed to consider what its license is. I was faced with needing to clean-room my own version, which is a bit difficult given how short the patch is (while probably still long enough to be copyrightable).

But, it turns out that git-annex already contains essentially the same code in Remote/S3.hs, in genericPublicUrl:

        baseurl Posix.</> escapeURIString skipescape p
 where
        -- Don't need to escape '/' because the bucket object
        -- is not necessarily a single url component.
        -- But do want to escape eg '+' and ' '
        skipescape '/' = True
        skipescape c = isUnescapedInURIComponent c

This code was presumably in the LLM's training set, and certainly appeared to be available to it for context, so its mirroring of this could simply be a case of Garbage In, Garbage Out.

Note that "skipescape" is a much better name than the LLM-generated "escchar" which behaves backwards from what its name suggests.

Why did I use isUnescapedInURIComponent in that and isUnescapedInURI in Remote/WebDav/DavLocation.hs? I doubt there was a good reason for either choice, but a full analysis did find a reason to prefer the isUnescapedInURIComponent approach, to handle a path containing '[' or '].

So, in 8fd9b67ed82ca0f39796a8d59431d42a7eb84957, I've factored out a general purpose function, and fixed this bug by using it.

Comment by joey