does git-annex parallelize different remotes?

Hi!

I'm wondering how clever git-annex is at parallelizing different remotes with different bandwidth. For example here I have a origin remote that is on the local LAN (so it's fast) and an offsite-annex remote that's, well, offsite, so it's much slower. If I run git annex sync --content -J2, it looks like it indiscriminately starts copying over files to either host without too much though:

$ git annex sync --content -J2
[...]
copy 2019/01/29/montreal/DSCF8148.JPG (to origin...) ok
copy 2019/01/29/montreal/DSCF8148.JPG (checking offsite-annex...) (to offsite-annex...) ok
copy 2019/01/29/montreal/DSCF8147.RAF (checking offsite-annex...) (to offsite-annex...) 
copy 2019/01/29/montreal/DSCF8148.RAF (to origin...) ok
copy 2019/01/29/montreal/DSCF8148.RAF (checking offsite-annex...) (to offsite-annex...)

The interactivity of this doesn't show well here, but what happens is this, in order:

a first JPG is copied to origin and offsite-annex in parallel (good)
the origin (local) JPG transfer completes, a RAF file transfer gets started to offsite-annex (not so great - a best strategy would be to continue copying files to the local remote, as it's fast and its bandwidth is now unused)
the offsite-annex JPG transfer completes, a RAF transfer starts to origin (good)
that transfer completes, the same file is now copied to offsite (again, not so great - local remote is now unused)

What I think git-annex should be doing is try, as much as possible to saturate the different network links represented by the different remotes. In my case, files should be transfered on the local LAN as soon as possible: a single thread should be busy with that origin remote as long as files are missing there, while the other thread can slowly trickle files to offsite. Only when origin is full should both threads work on the offsite one. A simple heuristic would be "is there a thread already busy with that remote? if yes, see if another remote needs a file transfered that is not busy".

Am I analysing this correctly? Is this a bug? or feature request? -- anarcat

RSS Atom

comment 1

If there are several remotes that it can use, and they all have the same cost, then yes, git-annex get will spread the load amoung them and not use higher cost remotes. So will git-annex sync when getting files from remotes.

There is not currently any similar smart thing done when sending files to multiple remotes (or dropping from multiple remotes). And it's kind of hard to see an efficient way to improve it.

The simplest way would be to loop over remotes ordered by cost and then inner loop over files, rather than the current method of looping over files with an inner loop over remotes. But in a large tree with many remotes, that has to traverse the tree multiple times, which would slow down the overall sync.

If instead there's one thread per remote, then the slowest remote will fall behind the others, and there will need to be a queue of the files that still need to be sent to it -- and that queue could grow to use a lot of memory when the tree is large. There would need to be some bound on how far behind a thread gets before it stops adding more files and waits for it to catch up.

Comment by joey — Thu Feb 7 19:29:55 2019

Remove comment

use cost then?

So I guess the answer here is to use "cost" to prioritize "LAN-local" repositories? Then we hit assistant does not always use repo cost info when queueing downloads but at least it will work in the general case...

I think that, in my case, it means doing:

git config remote.origin.annex-cost 150

... so that it's somewhere between local repositories (100) and remote (200). Would that solve my issue here? I don't have many files to transfer right now so I can't really test this until I import new photos, but I'll give it a shot!

It would certainly be nice if git-annex was a little more clever with this - it could, for example, have a gray zone between "remote" and "local"... but I guess that's what the annex-cost-command is for...

Thanks!

Comment by anarcat — Thu Feb 7 20:19:13 2019

Remove comment

comment 3

That would make it do efficient parallelization of downloads, but not of the uploads that you showed it being bottlenecked on the slowest remote.

Comment by joey — Thu Feb 7 20:29:19 2019

Remove comment

not uploads

ah. i somehow missed that... i was assuming a symmetry between the process of getting and sending files, after all it's similar: there's a list of files to move around, and we iterate of them the same way.

cost doesn't apply to uploads? if so that would seem like a fair feature to add...

Comment by anarcat — Thu Feb 7 20:59:13 2019

Remove comment

parallelizing checksum and get

one thing I would definitely like to see parallelize is CPU and network. right now git annex get will:

download file A
checksum file A
download file B
checksum file B

... serially. If parallelism (-J2) is enabled, the following happens, assuming files are roughly the same size:

download file A and B
checksum file A and B

This is not much of an improvement... We can get away with maximizing the bandwidth usage if file transfers are somewhat interleaved (because of size differences) but the above degenerate case happens actually quite often. The alternative (-J3 or more) might just download more files in parallel, which is not optimal.

So could we at least batch the checksum jobs separately from downloads? This would already be an improvement and maximize resource usage while at the same time reducing total transfer time.

Thanks!

Comment by anarcat — Thu Mar 7 18:21:22 2019

Remove comment

or -c annex.verify=false

oh... i guess i can use -c annex.verify=false to workaround that problem as well... but that's kind of obscure, really.

Comment by anarcat — Thu Mar 7 18:23:02 2019

Remove comment

comment 7

git-annex does separately parallelize checksums, since version 7.20190626.

cost doesn't apply to uploads? if so that would seem like a fair feature to add...

git-annex sync --content uploads to the lowest-cost remotes first, but it still generally still has to upload to the higher cost remotes too, unless preferred content has been set up to prevent it.

Comment by joey — Wed Sep 18 17:02:27 2019

Remove comment

Add a comment