The external special remote protocol allows the following responses to TRANSFER RETRIEVE {key} {file}:
TRANSFER-SUCCESS RETRIEVE {key}TRANSFER-FAILURE RETRIEVE {key} {message}
I propose a third response: TRANSFER-REDIRECT-URL RETRIEVE {key} {url}
This will permit the following use cases:
1) Make a request against an authentication server that provides a short-lived access token to the same or a different server. The authentication server does not need to relay the data. 2) Deterministically calculate a remote URL (or local path) without reimplementing HTTP fetch logic, taking advantage of the testing and security hardening of the git-annex implementation.
This seems like a good design to me. It will need a protocol extension to indicate when a git-annex version supports it.
It occured to me that when
git-annex p2phttpis used and is proxying to a special remote that uses this feature, it would be possible to forward the redirect to the http client, so the server would not need to download the object itself.A neat optimisation potential, although implementing it would cut across several things in a way I'm unsure how to do cleanly.
That did make me wonder though, if the redirect url would always be safe to share with the client, without granting the client any abilities beyond a one-time download. And I think that's too big an assumption to make for this optionisation. Someone could choose to redirect to an url containing eg, http basic auth, which would be fine when using it all locally, but not in this proxy situation. So there would need to be an additional configuration to enable the proxy optimisation.
One problem with this design is that there may be HTTP headers that are needed for authorization, rather than putting authentication in the url.
I think we may have talked about this at the hackfest, and came down on the side of simplicity, supporting only an url. Can't quite remember.
It might also be possible to redirect to an url when storing an object. There it is more likely that a custom http verb would be needed, rather than PUT.
I think that protocol design should leave these possibilities open to be implemented later. So, I'm going with this:
Which leaves open the possibility for later things like:
TRANSFEREXPORT, in the "simple export interface" also uses TRANSFER-SUCCESS/TRANSFER-FAILURE, and should also support this extension.
I looked into adopting this new feature for a special remote implementation. Four questions arose:
In order to implement CHECKPRESENT it appears that a special remote still needs to implemented the logic for the equivalent of a HTTP HEAD request. From my POV this limits the utility of a git-annex based download, because significant logic still needs to be implemented in a special remote itself. Would it impact the usage of such a special remote, if it would be configured with
sameas=otherremote? Would both remote implementations need to implement CHECKPRESENT (consistently), or would one (in this caseotherremote) by enough.I am uncertain re the signaling in case of multiple possible URL targets for a key, and an eventual download failure regarding one URL communicated via TRANSFER-RETRIEVE-URL. I believe that, when git-annex fails to download from a reported URL successfully, it can only send another TRANSFER-RETRIEVE request to the special remote (possibly go to the next remote first). This would mean that the special remote either needs to maintain a state on which URL has been reported before, or it would need to implement the capacity to test for availability (essentially the topic of Q1), and can never report more than one URL. Is this correct?
What is the logic git-annex uses to act on a URL communicated via TRANSFER-RETRIEVE-URL. Would it match it against all available special remotes via CLAIMURL, or give it straight to
web(and only that)?I am wondering, if it would be possible and sensible, to use this feature for implementing a download URL "broker"? A use case would be an informed selection of a download URL from a set of URLs associated with a key. This is similar to the
urlinclude/excludefeature of thewebspecial remote, but (depending on Q3) is relevant also to other special remotes acting as downloader implementations.Elaborating on (4) a bit more: My thinking is focused on the optimal long-term accessibility of keys -- across infrastructure transitions and different concurrent environments. From my POV git-annex provides me with the following options for making
myplaceas a special remote optimally work across space and time.via
sameas=myplace, I can have multiple special remotes point tomyplace. In each environment I can use the additional remotes (by name) to optimally accessmyplace. The decision making process it independent of git-annex. However, the possible access options need to be encoded in the annex branch to make this work. This creates a problem of inflation of this space in case of repositories that are used in many different contexts (think public (research) data that want to capitalize on the decentralized nature of git-annex).via
enableremoteI can swap out the type and parameterization ofmyplaceentirely. However, unlike withinitremotethere is no--private, so this is more geared toward the use case of "previous access method is no longer available", rather than a temporary optimization.when key access is (temporarily) preferred via URLs, I could generated a temporary
webspecial remote viainitremote --privateand aurlincludepattern.In all cases, I cannot simply run
git annex get, but I need to identify a specific remote that may need to be created first, or set a low cost for it.I'd be glad to be pointed at omissions in this assessment. Thanks!
Yes CHECKPRESENT still needs the special remote to do HTTP.
I do think that was an oversight. The original todo mentioned "taking advantage of the testing and security hardening of the git-annex implementation" and if a special remote is read-only, CHECKPRESENT may be the only time it needs to do HTTP.
A protocol extension for this would look like:
git-annex won't try to use the otherremote when it's been asked to use the sameas remote.
If one implemented CHECKPRESENT and the other always replied with "CHECKPRESENT-UNKNOWN", then a command like
git-annex fsck --fast --fromwhen used with the former remote would be able to verify that the content is present, and when used with the latter remote would it would error out.So you could perhaps get away with not implementing that. For a readonly remote, fsck is I think the only thing that uses CHECKPRESENT on a user-specified remote. It's more used on remotes that can be written to.
TRANSFER-RETRIEVE-URL was designed as a redirect, so it only redirects to one place. And git-annex won't try again to retrieve from the same remote if url fails to download.
I could imagine extending TRANSFER-RETRIEVE-URL to have a list of urls. But I can also imagine needing to extend it with HTTP headers to use for the url, and these things conflict, given the simple line and word based protocol.
I think that sameas remotes that use other urls might be a solution. Running eg
git-annex getwithout specifying a remote, it will keep trying different remotes until one succeeds.CLAIMURL is not currently used for TRANSFER-RETRIEVE-URL. (It's also not quite accurate to say that the
webspecial remote is used.)Supporting that would mean that, each time a remote replies with TRANSFER-RETRIEVE-URL, git-annex would need to query each other remote in turn to see if they claim the url. That could mean starting up a lot of extenal special remote programs (when not running yet) and doing a roundtrip through them, so latency might start to become a problem.
Also, there would be the possibility of loops between 2 or more remotes. Eg, remote A replies with TRANSFER-RETRIEVE-URL with an url that remote B CLAIMURLs, only to then reply with TRANSFER-RETRIEVE-URL, with an url that remote A CLAIMURLs.
Some of this strikes me as perhaps coming at Ephemeral special remotes from a different direction?
Re the inflation of the git-annex branch when using sameas, I fixed a bug (sameas private) and you'll be able to use
git-annex initremote --sameas=foo --privateto keep the configuration of the new sameas remote out of the git-annex branch.So, it seems to me that your broker, if it knows of several different urls that can be used to access
myplace, can be configured atinitremotetime which set of urls to use. And you can initialize multiple instances of the broker, each configured to use a different set of url, with--sameas --private.