copy with both --to and --from

git-annex/ todo/ copy with both --to and --from

Edit
RecentChanges
History
Preferences
Branchable
10 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

We are - creating git-annex repositories by minting annex keys based on metadata (size, filename extension, checksum) which we share on https://github.com/dandisets/ and https://github.com/dandizarrs . - back them up as a tandem of annex get following by annex move --to=backup commands.

That results - temporarily first fully populate those datasets (could be TBs in size) - for every annexed key to have two commits to adjust the state of that key in git-annex branch -- first to announce local availability and then that it moved to another remote

Ideally I would have loved if I could just say something like git annex copy --from=someremote --to=backup . or even better just smth like git annex copy --auto-get --to=backup . and ensure that - we get the file before moving it to backup - we do not bother recording in git-annex that the file ever existed locally

NB may be some annex sync would do that magic?

implemented --Joey

RSS Atom

{copy,move} with both --to and --from would be great!

That would indeed be awesome! However I guess the current implementation wouldn't support streaming files without caching them in their entirety.

Comment by nobodyinperson — Tue Jan 10 09:38:39 2023

Remove comment

comment 2

See transitive transfers for past discussion of this kind of feature (and why supporting streaming would complicate git-annex significantly for typically little gain).

The extra commits to the git-annex branch can be avoided by annex.alwayscommit=false

I wonder to what extent alternating getting a file with sending it on (and dropping it) would meet the use case? Are some files so large a single file cannot fit in local storage at all?

Comment by joey — Mon Jan 16 17:02:37 2023

Remove comment

comment 3

FWIW -- alternating is fine, all my files should be able to fit on the drive individually (but not all at once) while doing some number of them in parallel. BUT in some use cases (e.g. AWS) - drive IO could be slower than network IO (it was the use case which inspired me to complain to get checksumming done inbound with downloading). So, if stored locally, it has potential to become much slower. But I am not sure if that would be feasible anyhow to establish generic pipe-ing between various remotes.

Comment by yarikoptic — Mon Jan 16 21:43:40 2023

Remove comment

comment 4

Ok, I think this should be implemented, with alternating behavior.

As for disk speed, hopefully the files mostly fit in cache...

Comment by joey — Tue Jan 17 17:09:50 2023

Remove comment

comment 5

When there is already a local copy of a file, there are two different ways that git-annex move --from foo --to bar could behave:

Drop it from foo, copy it to bar, drop it from local
Drop it from foo, copy it to bar

Also, when there is a local copy, but foo does not have a copy, there are two possible behaviors:

Copy it to bar, drop it from local
Do nothing, since foo does not have a copy.

Picking the first choice makes the behavior always be the same as git-annex move --from foo; git-annex move --to bar, except using less disk space.

Picking the second choice in both situations makes this a move that behaves much as if the local repository is not involved at all (besides temporarily storing the content as necessary for the transfers).

If git-annex copy also gets the feature, the choice made for move should probably also affect it.

Following the first path would result in git-annex copy --from foo --to bar being equivilant to git-annex copy --from foo; git-annex copy --to bar. This would not be a useful new feature because the local repo would end up containing all the content.

Following the second path would mean that, when foo has a copy but there is no local copy, it would copy from foo, copy to bar, and drop the intermediate local copy. And when there is a local copy, but foo does not have a copy, it would do nothing.

Which is the better choice? The second is more complicated, but it adds a genuine new capability, rather than only a disk space optimisation.

If the second is implemented but the user wants the first behavior instead, they can first git-annex move --from foo --to bar followed by git-annex move --to bar, and the result will be the same.

The second will be more complicated to implement, since it will sometimes need to use a temp file to store the object content (when the local repo does not contain a copy of the content).

Comment by joey — Wed Jan 18 16:44:11 2023

Remove comment

comment 6

FWIW: I also feel that 2nd one (absent affect on a possibly present locally copy) would be preferable.

Comment by yarikoptic — Wed Jan 18 17:55:49 2023

Remove comment

comment 6

I've started on an implementation of this, in the fromto branch.

Downloading to a local temp file has some complications which make me want to avoid it is possible. For one thing, these temp files would have to somehow get cleaned up after an interrupted move. For another, two concurrent move processes from different remotes to different remotes would need to either use separate temp files (wasting disk space) or locking so only one uses the temp file at one time. The existing code in Annex.Transfer would have to be parameterized with the temp file to use, but then the transfer log/lock files that are used by that code would be problematic. So perhaps that Annex.Transfer code could not be reused, but then it would need to independeantly deal with resuming, locking, and stall detection.

So, I'm considering downloading --from the remote as usual, populating the local annex with the content, sending that --to the remote, and then dropping the local copy. That has its own complications, but they seem mostly less. Although there are two small races that I have not been able to resolve yet, which would result in git-annex move --from --to, when run concurrently with a git-annex get type process, result in the local copy not being present at the end (see a46c385aec2584419330c5dbb571c19ceb92f6fb). That would be surprising behavior, but also unlikely to happen. (And perhaps not too surprising, since running git-annex move --to concurrently with git-annex get can of course result in the local copy not being present at the end..)

The latter approach also has the problem that, when the file is unlocked, the unlocked file would get populated after downloading the content, which would be unncessary work.

Comment by joey — Fri Jan 20 15:11:50 2023

Remove comment

comment 8

I did implement this using the second path described in comment #5.

And it does download the content to the local annex, and then after sending it to the destination remote, will drop the local content (unless it was already present). So running it concurrent with git-annex get may leave local files present or not, depending on which process gets to a file first.

Comment by joey — Mon Jan 23 21:45:37 2023

Remove comment

comment 9

Thank you, joey for your work! I'll try it when I find the time.

Comment by nobodyinperson — Mon Jan 23 22:38:28 2023

Remove comment