Add remote.<name>.annex-speculate-can-get
config setting for non-special remotes, with the meaning "speculate that the remote knows how to get the key". It's like remote.<name>.annex-speculate-present
, except you'd first try git-annex-get
in the remote before looking in its annex.
Then one can make a quick clone of the current repo, and instead of re-configuring all its remotes in the new clone, just configure the origin to be a speculate-can-get
remote.
This would also be useful when you have unconnected but related repos, and want to occasionally share files between them without merging their histories.
This would involve an extension to the P2P protocol to ask a remote to git-annex get a key from its remotes.
But, I'd worry this could be abused. Imagine for example that you have published a sanitized dataset by cloning the complete dataset and getting only the files you wish to publish, and then exposed that over the P2P protocol, with a locked-down ssh key. Such a new feature would make this previously secure setup be exploitable to expose the unsantizied data.
In such a scenario,
GIT_ANNEX_SHELL_READONLY
might be set, and could be used to avoid the unwanted behavior. But consider, the repo might be publishing the sanitized dataset and also accepting uploads of derived data from the people who have been given ssh keys to use it and so not have readonly set.A DOS attack seems even more likely, where you've only gotten a subset of files into a particular clone to avoid using up too much disk space, and then this is used to get many more files than you want there. This could happen without a trust boundary as well. Of course, git-annex repos with the assistant running and a bad preferred content configuration can similarly download too much data, but that takes an explicit configuration. This would change a scenario where "git annex get --from remote" had just failed into one where it suddenly ran the remote out of disk.
There's also the problem that it could take the remote arbitrarily long to perform the get, and so would it need to send back progress information? And how would that indirect download progress info be presented to the user? Consider there could be a chain of several transfers. If it was possible to stream the file back to the requestor as the remote received it, the progress display would work as-is, but many file retrievals are not streamable.
How about limiting this to just the local non-special remotes, i.e. git clones of the repo? Not ones accessible over ssh. And requiring the origin repo to have an explicitly set config setting, like annex.allow-speculate-can-get-from-this-repo, before it can be used that way.
I was thinking of something much simpler / less powerful than what you're describing, but it would address the real use cases I have.
git-annex already has several security settings that can expose data or enable attacks if used badly, but require enough explicit configuration that people who use them likely know what they're doing.
speculate-can-get
repo, after getting the file from it? In my usage scenarios,annex.hardlink
would also be on, so only one copy of the file would exist.speculate-can-get
clone as a config param, andspeculate-present
is then set to true for this external special remote.