todo/make copy --fast fasteryohhttp://git-annex.branchable.com/todo/make_copy_--fast__faster/git-annexikiwiki2017-10-30T20:04:48Zcomment 1http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_1_24a9ca652007a18f18b368232cf549da/joey2016-05-21T13:53:55Z2016-05-21T13:53:15Z
--to or --from? The latter is faster due to locality..
comment 2http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_2_0c67f467d730a0966b43171de0382c42/EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a]2016-05-25T01:09:56Z2016-05-25T01:09:56Z
<p>"to remote host " so it was "--to". annex is already aware of having those files in that remote (see below).</p>
<div class="highlight-sh"><pre class="hl">$<span class="hl opt">></span> git annex copy <span class="hl kwb">--to</span><span class="hl opt">=</span>datalad-public <span class="hl kwb">--fast</span> .
git annex copy <span class="hl kwb">--to</span><span class="hl opt">=</span>datalad-public <span class="hl kwb">--fast</span> . <span class="hl num">7.33</span>s user <span class="hl num">0.91</span>s system <span class="hl num">55</span><span class="hl opt">%</span> cpu <span class="hl num">14.772</span> total
$<span class="hl opt">></span> git annex info
repository mode<span class="hl opt">:</span> indirect
trusted repositories<span class="hl opt">:</span> <span class="hl num">0</span>
semitrusted repositories<span class="hl opt">:</span> <span class="hl num">5</span>
<span class="hl num">00000000</span><span class="hl kwb">-0000-0000-0000-000000000001 --</span> web
<span class="hl num">00000000</span><span class="hl kwb">-0000-0000-0000-000000000002 --</span> bittorrent
<span class="hl num">123</span>c73e5-a8dc-4cff-8ffc-679c7ea67f94 <span class="hl kwb">--</span> yoh@smaug<span class="hl opt">:/</span>mnt<span class="hl opt">/</span>datasets<span class="hl opt">/</span>datalad<span class="hl opt">/</span>crawl<span class="hl opt">/</span>neurovault <span class="hl opt">[</span>here<span class="hl opt">]</span>
<span class="hl num">48</span>c1556f-6241-45de-9497-338d437fcb62 <span class="hl kwb">--</span> yoh@falkor<span class="hl opt">:/</span>srv<span class="hl opt">/</span>datasets.datalad.org<span class="hl opt">/</span>www<span class="hl opt">/</span>neurovault<span class="hl opt">/</span>snapshots <span class="hl opt">[</span>datalad-public<span class="hl opt">]</span>
af2785da-2538-4346-a6f6-f2f30fc3f025 <span class="hl kwb">--</span> <span class="hl opt">[</span>datalad-archives<span class="hl opt">]</span>
untrusted repositories<span class="hl opt">:</span> <span class="hl num">0</span>
transfers <span class="hl kwa">in</span> progress<span class="hl opt">:</span> none
available <span class="hl kwb">local</span> disk space<span class="hl opt">:</span> <span class="hl num">31.42</span> terabytes <span class="hl opt">(+</span><span class="hl num">1</span> megabyte reserved<span class="hl opt">)</span>
<span class="hl kwb">local</span> annex keys<span class="hl opt">:</span> <span class="hl num">6615</span>
<span class="hl kwb">local</span> annex size<span class="hl opt">:</span> <span class="hl num">12.77</span> gigabytes
annexed files <span class="hl kwa">in</span> working tree<span class="hl opt">:</span> <span class="hl num">6628</span>
size of annexed files <span class="hl kwa">in</span> working tree<span class="hl opt">:</span> <span class="hl num">6.31</span> gigabytes
bloom filter size<span class="hl opt">:</span> <span class="hl num">32</span> mebibytes <span class="hl opt">(</span><span class="hl num">1.3</span><span class="hl opt">%</span> full<span class="hl opt">)</span>
backend usage<span class="hl opt">:</span>
SHA256E<span class="hl opt">:</span> <span class="hl num">6628</span>
$<span class="hl opt">></span> git annex <span class="hl kwc">whereis</span> | <span class="hl kwc">head</span> <span class="hl kwb">-30</span>
<span class="hl kwc">whereis</span> <span class="hl num">1003</span><span class="hl opt">/</span><span class="hl num">13873</span>.nii.gz <span class="hl opt">(</span><span class="hl num">3</span> copies<span class="hl opt">)</span>
<span class="hl num">123</span>c73e5-a8dc-4cff-8ffc-679c7ea67f94 <span class="hl kwb">--</span> yoh@smaug<span class="hl opt">:/</span>mnt<span class="hl opt">/</span>datasets<span class="hl opt">/</span>datalad<span class="hl opt">/</span>crawl<span class="hl opt">/</span>neurovault <span class="hl opt">[</span>here<span class="hl opt">]</span>
<span class="hl num">48</span>c1556f-6241-45de-9497-338d437fcb62 <span class="hl kwb">--</span> yoh@falkor<span class="hl opt">:/</span>srv<span class="hl opt">/</span>datasets.datalad.org<span class="hl opt">/</span>www<span class="hl opt">/</span>neurovault<span class="hl opt">/</span>snapshots <span class="hl opt">[</span>datalad-public<span class="hl opt">]</span>
af2785da-2538-4346-a6f6-f2f30fc3f025 <span class="hl kwb">--</span> <span class="hl opt">[</span>datalad-archives<span class="hl opt">]</span>
datalad-archives<span class="hl opt">:</span> dl<span class="hl opt">+</span>archive<span class="hl opt">:</span>SHA256E-s6460020224--710cc05117e2290e2f793271d11e26452cdc111121e09a937dbf5a34b3cc0107.tar<span class="hl opt">/</span>neurovault_snapshot<span class="hl opt">/</span><span class="hl num">1003</span><span class="hl opt">/</span><span class="hl num">13873</span>.nii.gz<span class="hl slc">#size=23262</span>
ok
<span class="hl kwc">whereis</span> <span class="hl num">1003</span><span class="hl opt">/</span><span class="hl num">13874</span>.nii.gz <span class="hl opt">(</span><span class="hl num">3</span> copies<span class="hl opt">)</span>
<span class="hl num">123</span>c73e5-a8dc-4cff-8ffc-679c7ea67f94 <span class="hl kwb">--</span> yoh@smaug<span class="hl opt">:/</span>mnt<span class="hl opt">/</span>datasets<span class="hl opt">/</span>datalad<span class="hl opt">/</span>crawl<span class="hl opt">/</span>neurovault <span class="hl opt">[</span>here<span class="hl opt">]</span>
<span class="hl num">48</span>c1556f-6241-45de-9497-338d437fcb62 <span class="hl kwb">--</span> yoh@falkor<span class="hl opt">:/</span>srv<span class="hl opt">/</span>datasets.datalad.org<span class="hl opt">/</span>www<span class="hl opt">/</span>neurovault<span class="hl opt">/</span>snapshots <span class="hl opt">[</span>datalad-public<span class="hl opt">]</span>
af2785da-2538-4346-a6f6-f2f30fc3f025 <span class="hl kwb">--</span> <span class="hl opt">[</span>datalad-archives<span class="hl opt">]</span>
...
<span class="hl opt">></span> git annex copy <span class="hl kwb">--to</span><span class="hl opt">=</span>datalad-public .
copy <span class="hl num">1003</span><span class="hl opt">/</span><span class="hl num">13873</span>.nii.gz <span class="hl opt">(</span>checking datalad-public...<span class="hl opt">)</span> yoh@datasets.datalad.org<span class="hl str">'s password:</span>
</pre></div>
comment 3http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_3_5cd9e6b5d6d015120b5852bd212314aa/joey2016-05-31T16:17:57Z2016-05-31T16:02:01Z
<p>copy --to has to query the git-annex branch to see if the file is on the remote.
So it has worse locality than copy --from, which can simply stat the local
file to see if it's present.</p>
<p>Whatever inneficiencies git-annex has here are well swamped by the overhead
of git querying the branch.</p>
<p>When the remote has most of the files already, <code>git annex copy --to remote</code> is
similar to <code>git annex find --not --in remote</code>.</p>
<p>Here I've ran that under /usr/bin/time, and it looks like git-annex
ran for 89 seconds out of the 260 second runtime. So at least 65% of the total
runtime is spent by git querying the branch.</p>
<pre><code>89.26user 6.92system 4:20.80elapsed 36%CPU (0avgtext+0avgdata 75584maxresident)k
516432inputs+0outputs (0major+31156minor)pagefaults 0swaps
</code></pre>
comment 4http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_4_3ac10a07c74e5debafc9ae574d26c955/joey2016-08-03T17:49:11Z2016-08-03T16:02:46Z
<p>--failed can now be used to retry only failed transfers. So that will be a
lot faster in that specific case.</p>
<p>Leaving this bug open for the general wishlist that copy --fast be somehow
a lot faster than it is at finding things that need to be copied.</p>
also CPU (on git and git-annex processes) doesn't go to 100%http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_5_eb7008151a59e35c7850df3a86cf3587/EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a]2016-09-08T16:32:08Z2016-09-08T16:32:08Z
<p>seems to wobble around 50% for each one of git and git-annex processes... probably would be an overkill but may be it is easy in haskell (so throwing idea around) if communication was done in async fashion (git-annex wouldn't wait for git to respond but would process its own queue of already returned from git results, while submitting new ones as soon as previous comes out from the --batch). That might make both processes busy to a 100%.</p>
<p>another idea -- could may be 'annex find' get a -J flag thus starting multiple git ls-files querying processes?</p>
<p>or both ideas are too overengineered/not tractable? <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";)" /></p>
comment 6http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_6_3a08de49e9661f9df5bab272e170461a/EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a]2016-09-09T12:47:30Z2016-09-09T12:47:30Z
<p>ha -- a wild idea: instead of git ls-files git-annex | git cat-file you could be much better off with using "git archive" to dump the content of all the files under git-annex branch!</p>
<div class="highlight-sh"><pre class="hl">$<span class="hl opt">></span> GIT_TRACE_PACKET<span class="hl opt">=</span>true GIT_TRACE_PERFORMANCE<span class="hl opt">=</span>true git annex <span class="hl kwc">find</span> <span class="hl kwb">--not --in</span> here <span class="hl opt">>/</span>dev<span class="hl opt">/</span>null
<span class="hl num">08</span><span class="hl opt">:</span><span class="hl num">46</span><span class="hl opt">:</span><span class="hl num">11.246625</span> trace.c<span class="hl opt">:</span><span class="hl num">420</span> performance<span class="hl opt">:</span> <span class="hl num">0.000291504</span> s<span class="hl opt">:</span> git <span class="hl kwb">command</span><span class="hl opt">:</span> <span class="hl str">'/usr/lib/git-annex.linux/shimmed/git/git'</span> <span class="hl str">'config'</span> <span class="hl str">'--null'</span> <span class="hl str">'--list'</span>
<span class="hl num">08</span><span class="hl opt">:</span><span class="hl num">46</span><span class="hl opt">:</span><span class="hl num">11.267559</span> trace.c<span class="hl opt">:</span><span class="hl num">420</span> performance<span class="hl opt">:</span> <span class="hl num">0.000466198</span> s<span class="hl opt">:</span> git <span class="hl kwb">command</span><span class="hl opt">:</span> <span class="hl str">'/usr/lib/git-annex.linux/shimmed/git/git'</span> <span class="hl str">'--git-dir=.git'</span> <span class="hl str">'--work-tree=.'</span> <span class="hl str">'--literal-pathspecs'</span> <span class="hl str">'show-ref'</span> <span class="hl str">'git-annex'</span>
<span class="hl num">08</span><span class="hl opt">:</span><span class="hl num">46</span><span class="hl opt">:</span><span class="hl num">11.271522</span> trace.c<span class="hl opt">:</span><span class="hl num">420</span> performance<span class="hl opt">:</span> <span class="hl num">0.000434572</span> s<span class="hl opt">:</span> git <span class="hl kwb">command</span><span class="hl opt">:</span> <span class="hl str">'/usr/lib/git-annex.linux/shimmed/git/git'</span> <span class="hl str">'--git-dir=.git'</span> <span class="hl str">'--work-tree=.'</span> <span class="hl str">'--literal-pathspecs'</span> <span class="hl str">'show-ref'</span> <span class="hl str">'--hash'</span> <span class="hl str">'refs/heads/git-annex'</span>
<span class="hl num">08</span><span class="hl opt">:</span><span class="hl num">46</span><span class="hl opt">:</span><span class="hl num">22.647051</span> trace.c<span class="hl opt">:</span><span class="hl num">420</span> performance<span class="hl opt">:</span> <span class="hl num">11.387079176</span> s<span class="hl opt">:</span> git <span class="hl kwb">command</span><span class="hl opt">:</span> <span class="hl str">'/usr/lib/git-annex.linux/shimmed/git/git'</span> <span class="hl str">'--git-dir=.git'</span> <span class="hl str">'--work-tree=.'</span> <span class="hl str">'--literal-pathspecs'</span> <span class="hl str">'ls-files'</span> <span class="hl str">'--cached'</span> <span class="hl str">'-z'</span> <span class="hl str">'--'</span>
<span class="hl num">08</span><span class="hl opt">:</span><span class="hl num">46</span><span class="hl opt">:</span><span class="hl num">23.616005</span> trace.c<span class="hl opt">:</span><span class="hl num">420</span> performance<span class="hl opt">:</span> <span class="hl num">12.339791892</span> s<span class="hl opt">:</span> git <span class="hl kwb">command</span><span class="hl opt">:</span> <span class="hl str">'/usr/lib/git-annex.linux/shimmed/git/git'</span> <span class="hl str">'--git-dir=.git'</span> <span class="hl str">'--work-tree=.'</span> <span class="hl str">'--literal-pathspecs'</span> <span class="hl str">'cat-file'</span> <span class="hl str">'--batch'</span>
<span class="hl num">08</span><span class="hl opt">:</span><span class="hl num">46</span><span class="hl opt">:</span><span class="hl num">23.616052</span> trace.c<span class="hl opt">:</span><span class="hl num">420</span> performance<span class="hl opt">:</span> <span class="hl num">12.391364205</span> s<span class="hl opt">:</span> git <span class="hl kwb">command</span><span class="hl opt">:</span> <span class="hl str">'git'</span> <span class="hl str">'annex'</span> <span class="hl str">'find'</span> <span class="hl str">'--not'</span> <span class="hl str">'--in'</span> <span class="hl str">'here'</span>
$<span class="hl opt">></span> git ls-tree <span class="hl kwb">-r --name-only</span> git-annex | <span class="hl kwc">sed</span> <span class="hl kwb">-e</span> <span class="hl str">"s/^/git-annex:/g"</span> | <span class="hl kwa">time</span> git <span class="hl kwb">--git-dir</span><span class="hl opt">=</span>.git cat-file <span class="hl kwb">--buffer --batch</span> <span class="hl opt">></span>| <span class="hl opt">/</span>tmp<span class="hl opt">/</span><span class="hl num">111</span>
git <span class="hl kwb">--git-dir</span><span class="hl opt">=</span>.git cat-file <span class="hl kwb">--buffer --batch</span> <span class="hl opt">></span>| <span class="hl opt">/</span>tmp<span class="hl opt">/</span><span class="hl num">111 7.80</span>s user <span class="hl num">0.40</span>s system <span class="hl num">99</span><span class="hl opt">%</span> cpu <span class="hl num">8.214</span> total
$<span class="hl opt">></span> <span class="hl kwa">time</span> git archive git-annex <span class="hl opt">> /</span>dev<span class="hl opt">/</span>null
git archive git-annex <span class="hl opt">> /</span>dev<span class="hl opt">/</span>null <span class="hl num">0.20</span>s user <span class="hl num">0.00</span>s system <span class="hl num">97</span><span class="hl opt">%</span> cpu <span class="hl num">0.212</span> total
</pre></div>
<p>x40 times faster (if we disregard time to parse/split tar, but it should not be way too much I think)</p>
comment 7http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_7_3f52b6e19035d3c891356c6d98035987/joey2016-09-14T16:19:45Z2016-09-14T15:28:23Z
<p>First, note that git-annex 6.20160619 sped up the git-annex
command startup time significantly. Please be sure to use a current
version in benchmarks, and state the version.</p>
<p><code>git archive</code> (and <code>git cat-file --batch --batch-all-objects</code>) are just
reading packs and loose objects in disk order and dumping out the contents.
<code>git cat-file --batch</code> has to look up objects in the pack index files, seek
in the pack, etc. It's not a fair comparison.</p>
<p>Note that <code>git annex find</code>, when used without options like --in or --copies,
does not need to read anything from <code>git cat-file</code> at all. The
<code>GIT_TRACE_PERFORMANCE</code> you show is misleading; it's just showing how long
the git command is left running, idle.</p>
<p><code>git annex find</code>'s overhead should be purely traversing the filesystem tree
and checking what symlinks point to files. You can write programs that do
the same thing without using git at all (or only <code>git ls-files</code>), and
compare them to git-annex's time; that would be a fairer comparison.
Ideally, <code>git annex find</code> would be entirely system call bound and would use
very little CPU itself.</p>
<p>By contrast, <code>git annex copy</code> makes significant use of <code>git cat-file --batch</code>,
since it needs to look up location log information to see if the
--to/--from remote has the files.</p>
<p><code>git annex copy -J</code> already parallelizes the parts of the code that look at
the location log. Including spinning up a separate <code>git cat-file --batch</code>
processes for each thread, so they won't contend on such queries. So I
would expect that to make it faster, even leaving aside the speed benefits
of doing the actual copies in parallel.</p>
<p>My feeling is that the best way to speed these up is going to be in one
of these classes:</p>
<ul>
<li><p>It's possible that <code>git cat-file --batch</code> is somehow slower than it needs
to be. Perhaps it's not doing good caching between queries or has
inneficient seralization/bad stdio buffering. It might just be the case
that using something like libgit2 instead would be faster.
(Due to libgit2's poor interface stability, it would have to be an
optional build flag.)</p></li>
<li><p>Many small optimisations to the code. The use of Strings throughout
git-annex could well be a source of systematic small innefficiences,
and using ByteString might eliminate those. (But this would be a huge job.)
(The <code>git cat-file --batch</code> communication is already done using
bytestrings.)</p></li>
<li><p>A completely lateral move. For example, if git-annex kept its own
database recording which files are present, then <code>git annex find</code>
could do a simple database query and not need to chase all the symlinks.
But such a database needs to somehow be kept in sync or reconciled
with the git index, it's not an easy thing.</p></li>
</ul>
profilinghttp://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_8_c1f99493f5e5c362d5c39f048280b11b/joey2016-09-26T20:53:50Z2016-09-26T19:20:36Z
<p>Built git-annex with profiling, using <code>stack build --profile</code></p>
<p>(For reproduciblity, running git-annex in a clone of the git-annex repo
https://github.com/RichiH/conference_proceedings with rev
2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed
objects.)</p>
<p>Profiling <code>git-annex find +RTS -p</code>:</p>
<pre><code> total time = 3.53 secs (3530 ticks @ 1000 us, 1 processor)
total alloc = 3,772,700,720 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
spanList Data.List.Utils 32.6 37.7
startswith Data.List.Utils 14.3 8.1
md5 Data.Hash.MD5 12.4 18.2
join Data.List.Utils 6.9 13.7
catchIO Utility.Exception 5.9 6.0
catches Control.Monad.Catch 5.0 2.8
inAnnex'.checkindirect Annex.Content 4.6 1.8
readish Utility.PartialPrelude 3.0 1.4
isAnnexLink Annex.Link 2.6 4.0
split Data.List.Utils 1.5 0.8
keyPath Annex.Locations 1.2 1.7
</code></pre>
<p>This is interesting!</p>
<p>Fully 40% of CPU time and allocations are in list (really String) processing,
and the details of the profiling report show that <code>spanList</code> and <code>startsWith</code>
and <code>join</code> are all coming from calls to <code>replace</code> in <code>keyFile</code> and <code>fileKey</code>.
Both functions nest several calls to replace, so perhaps that could be unwound
into a single pass and/or a ByteString used to do it more efficiently.</p>
<p>12% of run time is spent calculating the md5 hashes for the hash
directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and
it is probably a quite unoptimised version. Switching to the version
if cryptonite would probably speed it up a lot.</p>
more profilinghttp://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_9_f4d802a28b79905da0cb24af6cb65b0a/joey2016-09-26T20:53:50Z2016-09-26T19:59:43Z
<p>Instead of profiling <code>git annex copy --to remote</code>, I profiled <code>git annex
find --not --in web</code>, which needs to do the same kind of location log lookup.</p>
<pre><code> total time = 12.41 secs (12413 ticks @ 1000 us, 1 processor)
total alloc = 8,645,057,104 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
adjustGitEnv Git.Env 21.4 37.0
catchIO Utility.Exception 13.2 2.8
spanList Data.List.Utils 12.6 17.9
parsePOSIXTime Logs.TimeStamp 6.1 5.0
catObjectDetails.receive Git.CatFile 5.9 2.1
startswith Data.List.Utils 5.7 3.8
md5 Data.Hash.MD5 5.1 7.9
join Data.List.Utils 2.4 6.0
readFileStrictAnyEncoding Utility.Misc 2.2 0.5
</code></pre>
<p>The adjustGitEnv overhead is a surprise! It seems it is getting called once
per file, and allocating a new copy of the environment each time. Call stack:
withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv.
Looks like simply making gitEnv be cached at startup would avoid most of
the adjustGitEnv slowdown.</p>
<p>(The catchIO overhead is a false reading; the detailed profile shows
that all its time and allocations are inherited. getAnnexLinkTarget
is running catchIO in the expensive case, so readSymbolicLink is
the actual expensive bit.)</p>
<p>The parsePOSIXTime comes from reading location logs. It's implemented
using a generic Data.Time.Format.parseTime, which uses a format string
"%s%Qs". A custom parser that splits into seconds and picoseconds
and simply reads both numbers might be more efficient.</p>
<p>catObjectDetails.receive is implemented using mostly String and could
probably be sped up by being converted to use ByteString.</p>
comment 10http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_10_1af4ac0d37c876912678522895c1656b/joey2016-09-29T21:00:02Z2016-09-29T18:33:33Z
<ul>
<li>Optimised key2file and file2key. 18% scanning time speedup.</li>
<li>Optimised adjustGitEnv. 50% git-annex branch query speedup</li>
<li>Optimised parsePOSIXTime. 10% git-annex branch query speedup</li>
<li>Tried making catObjectDetails.receive use ByteString for parsing,
but that did not seem to speed it up significantly.
So it parsing is already fairly optimal, it's just that a
lot of data passes through it when querying the git-annex
branch.</li>
</ul>
<p>After all that, profiling <code>git-annex find</code>:</p>
<pre><code> Thu Sep 29 16:51 2016 Time and Allocation Profiling Report (Final)
git-annex.1 +RTS -p -RTS find
total time = 1.73 secs (1730 ticks @ 1000 us, 1 processor)
total alloc = 1,812,406,632 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
md5 Data.Hash.MD5 28.0 37.9
catchIO Utility.Exception 10.2 12.5
inAnnex'.checkindirect Annex.Content 9.9 3.7
catches Control.Monad.Catch 8.7 5.7
readish Utility.PartialPrelude 5.7 3.0
isAnnexLink Annex.Link 5.0 8.4
keyFile Annex.Locations 4.2 5.8
spanList Data.List.Utils 4.0 6.3
startswith Data.List.Utils 2.0 1.3
</code></pre>
<p>And <code>git-annex find --not --in web</code>:</p>
<pre><code> Thu Sep 29 16:35 2016 Time and Allocation Profiling Report (Final)
git-annex +RTS -p -RTS find --not --in web
total time = 5.24 secs (5238 ticks @ 1000 us, 1 processor)
total alloc = 3,293,314,472 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
catObjectDetails.receive Git.CatFile 12.9 5.5
md5 Data.Hash.MD5 10.6 20.8
readish Utility.PartialPrelude 7.3 8.2
catchIO Utility.Exception 6.7 7.3
spanList Data.List.Utils 4.1 7.4
readFileStrictAnyEncoding Utility.Misc 3.5 1.3
catches Control.Monad.Catch 3.3 3.2
</code></pre>
<p>So, quite a large speedup overall!</p>
<p>This leaves md5 still unoptimised at 10-28% of CPU use. I looked at switching
it to cryptohash's implementation, but it would require quite a lot of
bit-banging math to pull the used values out of the ByteString containing
the md5sum.</p>
comment 11http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_11_1ca8d9765e6e3a18ae09df74bc390a00/joey2017-05-16T05:07:48Z2017-05-15T21:56:52Z
<p>Switched from MissingH to cryptonite for md5. It did move md5 out of the top CPU spot but
the overall runtime didn't change much. Memory allocations did go down by a
good amount.</p>
<p>Updated profiles:</p>
<pre><code> git-annex +RTS -p -RTS find
total time = 1.63 secs (1629 ticks @ 1000 us, 1 processor)
total alloc = 1,496,336,496 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catchIO Utility.Exception Utility/Exception.hs:79:1-17 14.1 15.1
inAnnex'.checkindirect Annex.Content Annex/Content.hs:(108,9)-(119,39) 10.6 4.8
catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 8.6 6.9
spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 6.7 11.1
isAnnexLink Annex.Link Annex/Link.hs:35:1-85 5.0 10.2
keyFile Annex.Locations Annex/Locations.hs:(456,1)-(462,19) 5.0 7.0
readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 3.8 2.0
startswith Data.List.Utils src/Data/List/Utils.hs:103:1-23 3.6 2.3
splitc Utility.Misc Utility/Misc.hs:(52,1)-(54,25) 3.4 6.5
s2w8 Data.Bits.Utils src/Data/Bits/Utils.hs:65:1-15 2.6 6.4
keyPath Annex.Locations Annex/Locations.hs:(492,1)-(494,23) 2.5 4.4
fileKey.unesc Annex.Locations Annex/Locations.hs:(469,9)-(474,39) 2.0 3.5
copyAndFreeze Data.ByteArray.Methods Data/ByteArray/Methods.hs:(224,1)-(227,21) 1.8 0.5
git-annex +RTS -p -RTS find --not --in web
total time = 5.33 secs (5327 ticks @ 1000 us, 1 processor)
total alloc = 2,908,489,000 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
catObjectDetails.\ Git.CatFile Git/CatFile.hs:(80,72)-(88,97) 7.8 2.8
catchIO Utility.Exception Utility/Exception.hs:79:1-17 7.6 8.3
spanList Data.List.Utils src/Data/List/Utils.hs:(150,1)-(155,36) 5.8 9.1
readish Utility.PartialPrelude Utility/PartialPrelude.hs:(48,1)-(50,20) 4.5 4.0
parseResp Git.CatFile Git/CatFile.hs:(113,1)-(124,28) 4.4 2.9
readFileStrict Utility.Misc Utility/Misc.hs:33:1-59 3.7 1.6
catches Control.Monad.Catch src/Control/Monad/Catch.hs:(432,1)-(436,76) 3.1 3.6
encodeW8 Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(131,1)-(133,70) 3.1 2.3
</code></pre>
comment 12http://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_12_14856a2886cf73d1bee57ef9fad01661/joey2017-10-30T19:12:36Z2017-10-30T18:48:21Z
<p>There's of course always possibility of more speed improvements, but I'm
wondering if this has already been addressed sufficient to close it?</p>
thankshttp://git-annex.branchable.com/todo/make_copy_--fast__faster/comment_13_0f8e2127cea96c4f9609fa7599b1eec9/yarikoptic2017-10-30T20:04:48Z2017-10-30T20:04:48Z
<p>I do think that things became smoother/faster since then. I guess we could consider this one closed for now, and I will keep in mind that --from mode is faster.</p>
<p>Cheers,</p>