Hi!
I'm wondering how clever git-annex is at parallelizing different
remotes with different bandwidth. For example here I have a origin
remote that is on the local LAN (so it's fast) and an offsite-annex
remote that's, well, offsite, so it's much slower. If I run git annex
sync --content -J2
, it looks like it indiscriminately starts
copying over files to either host without too much though:
$ git annex sync --content -J2
[...]
copy 2019/01/29/montreal/DSCF8148.JPG (to origin...) ok
copy 2019/01/29/montreal/DSCF8148.JPG (checking offsite-annex...) (to offsite-annex...) ok
copy 2019/01/29/montreal/DSCF8147.RAF (checking offsite-annex...) (to offsite-annex...)
copy 2019/01/29/montreal/DSCF8148.RAF (to origin...) ok
copy 2019/01/29/montreal/DSCF8148.RAF (checking offsite-annex...) (to offsite-annex...)
The interactivity of this doesn't show well here, but what happens is this, in order:
a first JPG is copied to origin and offsite-annex in parallel (good)
the origin (local) JPG transfer completes, a RAF file transfer gets started to offsite-annex (not so great - a best strategy would be to continue copying files to the local remote, as it's fast and its bandwidth is now unused)
the offsite-annex JPG transfer completes, a RAF transfer starts to origin (good)
that transfer completes, the same file is now copied to offsite (again, not so great - local remote is now unused)
What I think git-annex should be doing is try, as much as possible to
saturate the different network links represented by the different
remotes. In my case, files should be transfered on the local LAN as
soon as possible: a single thread should be busy with that origin
remote as long as files are missing there, while the other thread can
slowly trickle files to offsite
. Only when origin
is full should
both threads work on the offsite
one. A simple heuristic would be
"is there a thread already busy with that remote? if yes, see if
another remote needs a file transfered that is not busy".
Am I analysing this correctly? Is this a bug? or feature request? -- anarcat
If there are several remotes that it can use, and they all have the same cost, then yes,
git-annex get
will spread the load amoung them and not use higher cost remotes. So willgit-annex sync
when getting files from remotes.There is not currently any similar smart thing done when sending files to multiple remotes (or dropping from multiple remotes). And it's kind of hard to see an efficient way to improve it.
The simplest way would be to loop over remotes ordered by cost and then inner loop over files, rather than the current method of looping over files with an inner loop over remotes. But in a large tree with many remotes, that has to traverse the tree multiple times, which would slow down the overall sync.
If instead there's one thread per remote, then the slowest remote will fall behind the others, and there will need to be a queue of the files that still need to be sent to it -- and that queue could grow to use a lot of memory when the tree is large. There would need to be some bound on how far behind a thread gets before it stops adding more files and waits for it to catch up.
So I guess the answer here is to use "cost" to prioritize "LAN-local" repositories? Then we hit assistant does not always use repo cost info when queueing downloads but at least it will work in the general case...
I think that, in my case, it means doing:
... so that it's somewhere between local repositories (100) and remote (200). Would that solve my issue here? I don't have many files to transfer right now so I can't really test this until I import new photos, but I'll give it a shot!
It would certainly be nice if git-annex was a little more clever with this - it could, for example, have a gray zone between "remote" and "local"... but I guess that's what the
annex-cost-command
is for...Thanks!
That would make it do efficient parallelization of downloads, but not of the uploads that you showed it being bottlenecked on the slowest remote.
ah. i somehow missed that... i was assuming a symmetry between the process of getting and sending files, after all it's similar: there's a list of files to move around, and we iterate of them the same way.
cost doesn't apply to uploads? if so that would seem like a fair feature to add...
one thing I would definitely like to see parallelize is CPU and network. right now
git annex get
will:... serially. If parallelism (
-J2
) is enabled, the following happens, assuming files are roughly the same size:This is not much of an improvement... We can get away with maximizing the bandwidth usage if file transfers are somewhat interleaved (because of size differences) but the above degenerate case happens actually quite often. The alternative (
-J3
or more) might just download more files in parallel, which is not optimal.So could we at least batch the checksum jobs separately from downloads? This would already be an improvement and maximize resource usage while at the same time reducing total transfer time.
Thanks!
-c annex.verify=false
to workaround that problem as well... but that's kind of obscure, really.git-annex does separately parallelize checksums, since version 7.20190626.
git-annex sync --content
uploads to the lowest-cost remotes first, but it still generally still has to upload to the higher cost remotes too, unless preferred content has been set up to prevent it.