We are
- creating git-annex repositories by minting annex keys based on metadata (size, filename extension, checksum) which we share on https://github.com/dandisets/ and https://github.com/dandizarrs .
- back them up as a tandem of annex get
following by annex move --to=backup
commands.
That results - temporarily first fully populate those datasets (could be TBs in size) - for every annexed key to have two commits to adjust the state of that key in git-annex branch -- first to announce local availability and then that it moved to another remote
Ideally I would have loved if I could just say something like git annex copy --from=someremote --to=backup .
or even better just smth like git annex copy --auto-get --to=backup .
and ensure that
- we get the file before moving it to backup
- we do not bother recording in git-annex that the file ever existed locally
NB may be some annex sync
would do that magic?
See transitive transfers for past discussion of this kind of feature (and why supporting streaming would complicate git-annex significantly for typically little gain).
The extra commits to the git-annex branch can be avoided by annex.alwayscommit=false
I wonder to what extent alternating getting a file with sending it on (and dropping it) would meet the use case? Are some files so large a single file cannot fit in local storage at all?
Ok, I think this should be implemented, with alternating behavior.
As for disk speed, hopefully the files mostly fit in cache...
When there is already a local copy of a file, there are two different ways that
git-annex move --from foo --to bar
could behave:Also, when there is a local copy, but foo does not have a copy, there are two possible behaviors:
Picking the first choice makes the behavior always be the same as
git-annex move --from foo; git-annex move --to bar
, except using less disk space.Picking the second choice in both situations makes this a move that behaves much as if the local repository is not involved at all (besides temporarily storing the content as necessary for the transfers).
If
git-annex copy
also gets the feature, the choice made for move should probably also affect it.Following the first path would result in
git-annex copy --from foo --to bar
being equivilant togit-annex copy --from foo; git-annex copy --to bar
. This would not be a useful new feature because the local repo would end up containing all the content.Following the second path would mean that, when foo has a copy but there is no local copy, it would copy from foo, copy to bar, and drop the intermediate local copy. And when there is a local copy, but foo does not have a copy, it would do nothing.
Which is the better choice? The second is more complicated, but it adds a genuine new capability, rather than only a disk space optimisation.
If the second is implemented but the user wants the first behavior instead, they can first
git-annex move --from foo --to bar
followed bygit-annex move --to bar
, and the result will be the same.The second will be more complicated to implement, since it will sometimes need to use a temp file to store the object content (when the local repo does not contain a copy of the content).
I've started on an implementation of this, in the
fromto
branch.Downloading to a local temp file has some complications which make me want to avoid it is possible. For one thing, these temp files would have to somehow get cleaned up after an interrupted move. For another, two concurrent move processes from different remotes to different remotes would need to either use separate temp files (wasting disk space) or locking so only one uses the temp file at one time. The existing code in Annex.Transfer would have to be parameterized with the temp file to use, but then the transfer log/lock files that are used by that code would be problematic. So perhaps that Annex.Transfer code could not be reused, but then it would need to independeantly deal with resuming, locking, and stall detection.
So, I'm considering downloading --from the remote as usual, populating the local annex with the content, sending that --to the remote, and then dropping the local copy. That has its own complications, but they seem mostly less. Although there are two small races that I have not been able to resolve yet, which would result in
git-annex move --from --to
, when run concurrently with agit-annex get
type process, result in the local copy not being present at the end (see a46c385aec2584419330c5dbb571c19ceb92f6fb). That would be surprising behavior, but also unlikely to happen. (And perhaps not too surprising, since runninggit-annex move --to
concurrently withgit-annex get
can of course result in the local copy not being present at the end..)The latter approach also has the problem that, when the file is unlocked, the unlocked file would get populated after downloading the content, which would be unncessary work.
I did implement this using the second path described in comment #5.
And it does download the content to the local annex, and then after sending it to the destination remote, will drop the local content (unless it was already present). So running it concurrent with
git-annex get
may leave local files present or not, depending on which process gets to a file first.