speculate-can-get : extension of speculate-present

git-annex/ todo/ speculate-can-get : extension of speculate-present

Edit
RecentChanges
History
Preferences
Branchable
4 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

Add remote.<name>.annex-speculate-can-get config setting for non-special remotes, with the meaning "speculate that the remote knows how to get the key". It's like remote.<name>.annex-speculate-present, except you'd first try git-annex-get in the remote before looking in its annex.

Then one can make a quick clone of the current repo, and instead of re-configuring all its remotes in the new clone, just configure the origin to be a speculate-can-get remote. This would also be useful when you have unconnected but related repos, and want to occasionally share files between them without merging their histories.

RSS Atom

comment 1

This would involve an extension to the P2P protocol to ask a remote to git-annex get a key from its remotes.

But, I'd worry this could be abused. Imagine for example that you have published a sanitized dataset by cloning the complete dataset and getting only the files you wish to publish, and then exposed that over the P2P protocol, with a locked-down ssh key. Such a new feature would make this previously secure setup be exploitable to expose the unsantizied data.

In such a scenario, GIT_ANNEX_SHELL_READONLY might be set, and could be used to avoid the unwanted behavior. But consider, the repo might be publishing the sanitized dataset and also accepting uploads of derived data from the people who have been given ssh keys to use it and so not have readonly set.

A DOS attack seems even more likely, where you've only gotten a subset of files into a particular clone to avoid using up too much disk space, and then this is used to get many more files than you want there. This could happen without a trust boundary as well. Of course, git-annex repos with the assistant running and a bad preferred content configuration can similarly download too much data, but that takes an explicit configuration. This would change a scenario where "git annex get --from remote" had just failed into one where it suddenly ran the remote out of disk.

There's also the problem that it could take the remote arbitrarily long to perform the get, and so would it need to send back progress information? And how would that indirect download progress info be presented to the user? Consider there could be a chain of several transfers. If it was possible to stream the file back to the requestor as the remote received it, the progress display would work as-is, but many file retrievals are not streamable.

Comment by joey — Fri May 3 15:39:07 2019

Remove comment

comment 2

How about limiting this to just the local non-special remotes, i.e. git clones of the repo? Not ones accessible over ssh. And requiring the origin repo to have an explicitly set config setting, like annex.allow-speculate-can-get-from-this-repo, before it can be used that way.

I was thinking of something much simpler / less powerful than what you're describing, but it would address the real use cases I have.

git-annex already has several security settings that can expose data or enable attacks if used badly, but require enough explicit configuration that people who use them likely know what they're doing.

Comment by Ilya_Shlyakhter — Fri May 3 16:11:06 2019

Remove comment

comment 3

Re: disk space problem, can just drop the file from the speculate-can-get repo, after getting the file from it? In my usage scenarios, annex.hardlink would also be on, so only one copy of the file would exist.

Comment by Ilya_Shlyakhter — Fri May 3 16:26:40 2019

Remove comment

comment 4

I guess this could also be implemented with a read-only external special remote, which has the path to the speculate-can-get clone as a config param, and speculate-present is then set to true for this external special remote.