Recent comments posted to this site:

comment 2

I may have actually come up with a solution. Instead of creating a second remote, I was able to make my ~/.ssh/config dynamic based on the results of a dig command: https://fmartingr.com/blog/2022/08/12/using-ssh-config-match-to-connect-to-a-host-using-multiple-ip-or-hostnames/

Thanks for going with me on this journey!

Comment by xentac
comment 1
Trying to set up and test this, I just realized these aren't special remotes. They're proper git (git-annex) remotes on my nas. I'm trying to figure out what setting I need to change to mark them as sameas and if cost will work in that case as well.
Comment by xentac
comment 3

After a bug fix, it's now possible to make a sameas remote that is private to the local repository.

git-annex initremote bar --sameas=foo --private type=...

While not ephemeral as such, if you git remote remove bar, the only trace left of it will probably be in .git/annex/journal-private/remote.log, and possibly any creds that got cached for it. It would be possible to have a command that removes the remote, and also clears that.

If that is close enough to ephemeral, then we could think about the second part, extending the external special remote protocol with REDIRECT-REMOTE.

That is similar to Special remote redirect to URL. And a few comments over there go in a similar direction. In particular, the discussion of CLAIMURL. If TRANSFER-RETRIEVE-URL and TRANSFER-CHECKPRESENT-URL supported CLAIMURL, then if the ephermeral special remote had some type of url, that it claimed, those could be used rather than REDIRECT-REMOTE.

That would not cover TRANSFER STORE and REMOVE though. And it probably doesn't make sense to extend those to urls generally. (There are too many ways to store to an url or remove an url, everything isn't WebDAV..)

I don't know if it is really elegant to drag urls into this anyway. The user may be left making up an url scheme for something that does not involve urls at all.

Comment by joey
Re: download URL broker

Some of this strikes me as perhaps coming at Ephemeral special remotes from a different direction?

Re the inflation of the git-annex branch when using sameas, I fixed a bug (sameas private) and you'll be able to use git-annex initremote --sameas=foo --private to keep the configuration of the new sameas remote out of the git-annex branch.

So, it seems to me that your broker, if it knows of several different urls that can be used to access myplace, can be configured at initremote time which set of urls to use. And you can initialize multiple instances of the broker, each configured to use a different set of url, with --sameas --private.

Comment by joey
Re: CLAIMURL

CLAIMURL is not currently used for TRANSFER-RETRIEVE-URL. (It's also not quite accurate to say that the web special remote is used.)

Supporting that would mean that, each time a remote replies with TRANSFER-RETRIEVE-URL, git-annex would need to query each other remote in turn to see if they claim the url. That could mean starting up a lot of extenal special remote programs (when not running yet) and doing a roundtrip through them, so latency might start to become a problem.

Also, there would be the possibility of loops between 2 or more remotes. Eg, remote A replies with TRANSFER-RETRIEVE-URL with an url that remote B CLAIMURLs, only to then reply with TRANSFER-RETRIEVE-URL, with an url that remote A CLAIMURLs.

Comment by joey
Re: multiple URLS for a key

TRANSFER-RETRIEVE-URL was designed as a redirect, so it only redirects to one place. And git-annex won't try again to retrieve from the same remote if url fails to download.

I could imagine extending TRANSFER-RETRIEVE-URL to have a list of urls. But I can also imagine needing to extend it with HTTP headers to use for the url, and these things conflict, given the simple line and word based protocol.

I think that sameas remotes that use other urls might be a solution. Running eg git-annex get without specifying a remote, it will keep trying different remotes until one succeeds.

Comment by joey
Re: CHECKPRESENT

Yes CHECKPRESENT still needs the special remote to do HTTP.

I do think that was an oversight. The original todo mentioned "taking advantage of the testing and security hardening of the git-annex implementation" and if a special remote is read-only, CHECKPRESENT may be the only time it needs to do HTTP.

A protocol extension for this would look like:

EXTENSIONS CHECKPRESENT-URL
CHECKPRESENT-URL Key Url

Would it impact the usage of such a special remote, if it would be configured with sameas=otherremote? Would both remote implementations need to implement CHECKPRESENT (consistently), or would one (in this case otherremote) by enough.

git-annex won't try to use the otherremote when it's been asked to use the sameas remote.

If one implemented CHECKPRESENT and the other always replied with "CHECKPRESENT-UNKNOWN", then a command like git-annex fsck --fast --from when used with the former remote would be able to verify that the content is present, and when used with the latter remote would it would error out.

So you could perhaps get away with not implementing that. For a readonly remote, fsck is I think the only thing that uses CHECKPRESENT on a user-specified remote. It's more used on remotes that can be written to.

Comment by joey
Functionality gaps?

I looked into adopting this new feature for a special remote implementation. Four questions arose:

  1. In order to implement CHECKPRESENT it appears that a special remote still needs to implemented the logic for the equivalent of a HTTP HEAD request. From my POV this limits the utility of a git-annex based download, because significant logic still needs to be implemented in a special remote itself. Would it impact the usage of such a special remote, if it would be configured with sameas=otherremote? Would both remote implementations need to implement CHECKPRESENT (consistently), or would one (in this case otherremote) by enough.

  2. I am uncertain re the signaling in case of multiple possible URL targets for a key, and an eventual download failure regarding one URL communicated via TRANSFER-RETRIEVE-URL. I believe that, when git-annex fails to download from a reported URL successfully, it can only send another TRANSFER-RETRIEVE request to the special remote (possibly go to the next remote first). This would mean that the special remote either needs to maintain a state on which URL has been reported before, or it would need to implement the capacity to test for availability (essentially the topic of Q1), and can never report more than one URL. Is this correct?

  3. What is the logic git-annex uses to act on a URL communicated via TRANSFER-RETRIEVE-URL. Would it match it against all available special remotes via CLAIMURL, or give it straight to web (and only that)?

  4. I am wondering, if it would be possible and sensible, to use this feature for implementing a download URL "broker"? A use case would be an informed selection of a download URL from a set of URLs associated with a key. This is similar to the urlinclude/exclude feature of the web special remote, but (depending on Q3) is relevant also to other special remotes acting as downloader implementations.

Elaborating on (4) a bit more: My thinking is focused on the optimal long-term accessibility of keys -- across infrastructure transitions and different concurrent environments. From my POV git-annex provides me with the following options for making myplace as a special remote optimally work across space and time.

  • via sameas=myplace, I can have multiple special remotes point to myplace. In each environment I can use the additional remotes (by name) to optimally access myplace. The decision making process it independent of git-annex. However, the possible access options need to be encoded in the annex branch to make this work. This creates a problem of inflation of this space in case of repositories that are used in many different contexts (think public (research) data that want to capitalize on the decentralized nature of git-annex).

  • via enableremote I can swap out the type and parameterization of myplace entirely. However, unlike with initremote there is no --private, so this is more geared toward the use case of "previous access method is no longer available", rather than a temporary optimization.

  • when key access is (temporarily) preferred via URLs, I could generated a temporary web special remote via initremote --private and a urlinclude pattern.

In all cases, I cannot simply run git annex get, but I need to identify a specific remote that may need to be created first, or set a low cost for it.

I'd be glad to be pointed at omissions in this assessment. Thanks!

Comment by mih
"Gather inotify events"

The assistant has some very tricky, and probably also fragile code that gathers related inotify events. That would need to be factored out for this.

Comment by joey