forum/does git-annex parallelize different remotes?git-annexhttp://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/git-annexikiwiki2020-06-17T01:18:32Zcomment 1http://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/comment_1_764f43b51a075a8c94b433eda0cff946/joey2019-02-07T19:54:52Z2019-02-07T19:29:55Z
<p>If there are several remotes that it can use, and they all have the same
cost, then yes, <code>git-annex get</code> will spread the load amoung them and not
use higher cost remotes.
So will <code>git-annex sync</code> when getting files from remotes.</p>
<p>There is not currently any similar smart thing done when sending files to
multiple remotes (or dropping from multiple remotes).
And it's kind of hard to see an efficient way to improve it.</p>
<p>The simplest way would be to loop over remotes ordered by cost and
then inner loop over files, rather than the current method of looping over
files with an inner loop over remotes. But in a large tree with many remotes,
that has to traverse the tree multiple times, which would slow down the
overall sync.</p>
<p>If instead there's one thread per remote, then the slowest remote will
fall behind the others, and there will need to be a queue of the files
that still need to be sent to it -- and that queue could grow to use a lot
of memory when the tree is large. There would need to be some bound
on how far behind a thread gets before it stops adding more files and waits
for it to catch up.</p>
use cost then?http://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/comment_2_dff46561e15ba5f386dc9cecb9c88e20/anarcat2019-02-07T20:19:13Z2019-02-07T20:19:13Z
<p>So I guess the answer here is to use "cost" to prioritize "LAN-local" repositories? Then we hit <a href="http://git-annex.branchable.com/bugs/assistant_does_not_always_use_repo_cost_info_when_queueing_downloads/">assistant does not always use repo cost info when queueing downloads</a> but at least it will work in the general case...</p>
<p>I think that, in my case, it means doing:</p>
<pre><code>git config remote.origin.annex-cost 150
</code></pre>
<p>... so that it's somewhere between local repositories (100) and remote (200). Would that solve my issue here? I don't have many files to transfer right now so I can't really test this until I import new photos, but I'll give it a shot! <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p>
<p>It would certainly be nice if git-annex was a little more clever with this - it could, for example, have a gray zone between "remote" and "local"... but I guess that's what the <code>annex-cost-command</code> is for...</p>
<p>Thanks!</p>
comment 3http://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/comment_3_47269e2c40c5ce27ca197ddfd5ac85c3/joey2019-02-07T20:30:16Z2019-02-07T20:29:19Z
<p>That would make it do efficient parallelization of downloads, but not of
the uploads that you showed it being bottlenecked on the slowest remote.</p>
not uploadshttp://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/comment_4_b9444184ab653718fca9df5c2068ad3c/anarcat2019-02-07T20:59:13Z2019-02-07T20:59:13Z
<p>ah. i somehow missed that... i was assuming a symmetry between the process of getting and sending files, after all it's similar: there's a list of files to move around, and we iterate of them the same way.</p>
<p>cost doesn't apply to uploads? if so that would seem like a fair feature to add... <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p>
parallelizing checksum and gethttp://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/comment_5_5098db1fad3290cba49ea1c1163cc168/anarcat2019-03-07T18:21:22Z2019-03-07T18:21:22Z
<p>one thing I would definitely like to see parallelize is CPU and network. right now <code>git annex get</code> will:</p>
<ol>
<li>download file A</li>
<li>checksum file A</li>
<li>download file B</li>
<li>checksum file B</li>
</ol>
<p>... serially. If parallelism (<code>-J2</code>) is enabled, the following happens, assuming files are roughly the same size:</p>
<ol>
<li>download file A and B</li>
<li>checksum file A and B</li>
</ol>
<p>This is not much of an improvement... We can get away with maximizing the bandwidth usage <em>if</em> file transfers are somewhat interleaved (because of size differences) but the above degenerate case happens actually quite often. The alternative (<code>-J3</code> or more) might just download more files in parallel, which is not optimal.</p>
<p>So could we at least batch the checksum jobs separately from downloads? This would already be an improvement and maximize resource usage while at the same time reducing total transfer time.</p>
<p>Thanks! <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p>
or -c annex.verify=falsehttp://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/comment_6_ac6b01d2ff9dfe6060be5956d5adb632/anarcat2019-03-07T18:23:02Z2019-03-07T18:23:02Z
oh... i guess i can use <code>-c annex.verify=false</code> to workaround that problem as well... but that's kind of obscure, really. <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" />
comment 7http://git-annex.branchable.com/forum/does_git-annex_parallelize_different_remotes__63__/comment_7_101543247268097f7535660efe0d39d8/joey2020-06-17T01:18:32Z2019-09-18T17:02:27Z
<p>git-annex does separately parallelize checksums, since version
7.20190626.</p>
<blockquote><p>cost doesn't apply to uploads? if so that would seem like a fair feature
to add... <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p></blockquote>
<p><code>git-annex sync --content</code> uploads to the lowest-cost remotes first,
but it still generally still has to upload to the higher cost remotes too,
unless preferred content has been set up to prevent it.</p>