make copy --fast faster

I was trying to copy files which failed to copy (3 out of 6,000) to remote host after copy -J4. Succeeded. But with subsequent runs, apparently even with copy --fast it takes annex 10 sec for annex to realize there is nothing to copy. git ls-files which annex calls returns list immediately, so it is really some parsing/access to data under git-annex branch which takes awhile. I think we had similar discussion before but couldn't find. So I wondered to whine again to see if some optimization is possible to make --fast copies faster, especially whenever there is nothing to copy.

closing as yoh is happy. done. Note that I copied benechmarking related comments to the profiling page for future reference. --Joey

RSS Atom

comment 1

--to or --from? The latter is faster due to locality..

Comment by joey — Sat May 21 13:53:15 2016

Remove comment

comment 2

"to remote host " so it was "--to". annex is already aware of having those files in that remote (see below).

$> git annex copy --to=datalad-public --fast .        
git annex copy --to=datalad-public --fast .  7.33s user 0.91s system 55% cpu 14.772 total

$> git annex info
repository mode: indirect
trusted repositories: 0
semitrusted repositories: 5
    00000000-0000-0000-0000-000000000001 -- web
    00000000-0000-0000-0000-000000000002 -- bittorrent
    123c73e5-a8dc-4cff-8ffc-679c7ea67f94 -- yoh@smaug:/mnt/datasets/datalad/crawl/neurovault [here]
    48c1556f-6241-45de-9497-338d437fcb62 -- yoh@falkor:/srv/datasets.datalad.org/www/neurovault/snapshots [datalad-public]
    af2785da-2538-4346-a6f6-f2f30fc3f025 -- [datalad-archives]
untrusted repositories: 0
transfers in progress: none
available local disk space: 31.42 terabytes (+1 megabyte reserved)
local annex keys: 6615
local annex size: 12.77 gigabytes
annexed files in working tree: 6628
size of annexed files in working tree: 6.31 gigabytes
bloom filter size: 32 mebibytes (1.3% full)
backend usage: 
    SHA256E: 6628

$> git annex whereis | head -30               
whereis 1003/13873.nii.gz (3 copies) 
    123c73e5-a8dc-4cff-8ffc-679c7ea67f94 -- yoh@smaug:/mnt/datasets/datalad/crawl/neurovault [here]
    48c1556f-6241-45de-9497-338d437fcb62 -- yoh@falkor:/srv/datasets.datalad.org/www/neurovault/snapshots [datalad-public]
    af2785da-2538-4346-a6f6-f2f30fc3f025 -- [datalad-archives]

  datalad-archives: dl+archive:SHA256E-s6460020224--710cc05117e2290e2f793271d11e26452cdc111121e09a937dbf5a34b3cc0107.tar/neurovault_snapshot/1003/13873.nii.gz#size=23262
ok
whereis 1003/13874.nii.gz (3 copies) 
    123c73e5-a8dc-4cff-8ffc-679c7ea67f94 -- yoh@smaug:/mnt/datasets/datalad/crawl/neurovault [here]
    48c1556f-6241-45de-9497-338d437fcb62 -- yoh@falkor:/srv/datasets.datalad.org/www/neurovault/snapshots [datalad-public]
    af2785da-2538-4346-a6f6-f2f30fc3f025 -- [datalad-archives]
...
> git annex copy --to=datalad-public .       
copy 1003/13873.nii.gz (checking datalad-public...) yoh@datasets.datalad.org's password:

Comment by EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a] — Wed May 25 01:09:56 2016

Remove comment

comment 3

copy --to has to query the git-annex branch to see if the file is on the remote. So it has worse locality than copy --from, which can simply stat the local file to see if it's present.

Whatever inneficiencies git-annex has here are well swamped by the overhead of git querying the branch.

When the remote has most of the files already, git annex copy --to remote is similar to git annex find --not --in remote.

Here I've ran that under /usr/bin/time, and it looks like git-annex ran for 89 seconds out of the 260 second runtime. So at least 65% of the total runtime is spent by git querying the branch.

89.26user 6.92system 4:20.80elapsed 36%CPU (0avgtext+0avgdata 75584maxresident)k
516432inputs+0outputs (0major+31156minor)pagefaults 0swaps

Comment by joey — Tue May 31 16:02:01 2016

Remove comment

comment 4

--failed can now be used to retry only failed transfers. So that will be a lot faster in that specific case.

Leaving this bug open for the general wishlist that copy --fast be somehow a lot faster than it is at finding things that need to be copied.

Comment by joey — Wed Aug 3 16:02:46 2016

Remove comment

also CPU (on git and git-annex processes) doesn't go to 100%

seems to wobble around 50% for each one of git and git-annex processes... probably would be an overkill but may be it is easy in haskell (so throwing idea around) if communication was done in async fashion (git-annex wouldn't wait for git to respond but would process its own queue of already returned from git results, while submitting new ones as soon as previous comes out from the --batch). That might make both processes busy to a 100%.

another idea -- could may be 'annex find' get a -J flag thus starting multiple git ls-files querying processes?

or both ideas are too overengineered/not tractable?

Comment by EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a] — Thu Sep 8 16:32:08 2016

Remove comment

comment 6

ha -- a wild idea: instead of git ls-files git-annex | git cat-file you could be much better off with using "git archive" to dump the content of all the files under git-annex branch!

$> GIT_TRACE_PACKET=true GIT_TRACE_PERFORMANCE=true git annex find --not --in here >/dev/null                                      
08:46:11.246625 trace.c:420             performance: 0.000291504 s: git command: '/usr/lib/git-annex.linux/shimmed/git/git' 'config' '--null' '--list'
08:46:11.267559 trace.c:420             performance: 0.000466198 s: git command: '/usr/lib/git-annex.linux/shimmed/git/git' '--git-dir=.git' '--work-tree=.' '--literal-pathspecs' 'show-ref' 'git-annex'
08:46:11.271522 trace.c:420             performance: 0.000434572 s: git command: '/usr/lib/git-annex.linux/shimmed/git/git' '--git-dir=.git' '--work-tree=.' '--literal-pathspecs' 'show-ref' '--hash' 'refs/heads/git-annex'
08:46:22.647051 trace.c:420             performance: 11.387079176 s: git command: '/usr/lib/git-annex.linux/shimmed/git/git' '--git-dir=.git' '--work-tree=.' '--literal-pathspecs' 'ls-files' '--cached' '-z' '--'
08:46:23.616005 trace.c:420             performance: 12.339791892 s: git command: '/usr/lib/git-annex.linux/shimmed/git/git' '--git-dir=.git' '--work-tree=.' '--literal-pathspecs' 'cat-file' '--batch'
08:46:23.616052 trace.c:420             performance: 12.391364205 s: git command: 'git' 'annex' 'find' '--not' '--in' 'here'


$> git ls-tree -r --name-only git-annex | sed -e "s/^/git-annex:/g" | time git --git-dir=.git cat-file --buffer --batch >| /tmp/111
git --git-dir=.git cat-file --buffer --batch >| /tmp/111  7.80s user 0.40s system 99% cpu 8.214 total
                

$> time git archive git-annex > /dev/null                                                                                          
git archive git-annex > /dev/null  0.20s user 0.00s system 97% cpu 0.212 total

x40 times faster (if we disregard time to parse/split tar, but it should not be way too much I think)

Comment by EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a] — Fri Sep 9 12:47:30 2016

Remove comment

comment 7

First, note that git-annex 6.20160619 sped up the git-annex command startup time significantly. Please be sure to use a current version in benchmarks, and state the version.

git archive (and git cat-file --batch --batch-all-objects) are just reading packs and loose objects in disk order and dumping out the contents. git cat-file --batch has to look up objects in the pack index files, seek in the pack, etc. It's not a fair comparison.

Note that git annex find, when used without options like --in or --copies, does not need to read anything from git cat-file at all. The GIT_TRACE_PERFORMANCE you show is misleading; it's just showing how long the git command is left running, idle.

git annex find's overhead should be purely traversing the filesystem tree and checking what symlinks point to files. You can write programs that do the same thing without using git at all (or only git ls-files), and compare them to git-annex's time; that would be a fairer comparison. Ideally, git annex find would be entirely system call bound and would use very little CPU itself.

By contrast, git annex copy makes significant use of git cat-file --batch, since it needs to look up location log information to see if the --to/--from remote has the files.

git annex copy -J already parallelizes the parts of the code that look at the location log. Including spinning up a separate git cat-file --batch processes for each thread, so they won't contend on such queries. So I would expect that to make it faster, even leaving aside the speed benefits of doing the actual copies in parallel.

My feeling is that the best way to speed these up is going to be in one of these classes:

It's possible that git cat-file --batch is somehow slower than it needs to be. Perhaps it's not doing good caching between queries or has inneficient seralization/bad stdio buffering. It might just be the case that using something like libgit2 instead would be faster. (Due to libgit2's poor interface stability, it would have to be an optional build flag.)
Many small optimisations to the code. The use of Strings throughout git-annex could well be a source of systematic small innefficiences, and using ByteString might eliminate those. (But this would be a huge job.) (The git cat-file --batch communication is already done using bytestrings.)
A completely lateral move. For example, if git-annex kept its own database recording which files are present, then git annex find could do a simple database query and not need to chase all the symlinks. But such a database needs to somehow be kept in sync or reconciled with the git index, it's not an easy thing.

Comment by joey — Wed Sep 14 15:28:23 2016

Remove comment

profiling

Built git-annex with profiling, using stack build --profile

(For reproduciblity, running git-annex in a clone of the git-annex repo https://github.com/RichiH/conference_proceedings with rev 2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed objects.)

Profiling git-annex find +RTS -p:

        total time  =        3.53 secs   (3530 ticks @ 1000 us, 1 processor)
        total alloc = 3,772,700,720 bytes  (excludes profiling overheads)

COST CENTRE            MODULE                  %time %alloc

spanList               Data.List.Utils          32.6   37.7
startswith             Data.List.Utils          14.3    8.1
md5                    Data.Hash.MD5            12.4   18.2
join                   Data.List.Utils           6.9   13.7
catchIO                Utility.Exception         5.9    6.0
catches                Control.Monad.Catch       5.0    2.8
inAnnex'.checkindirect Annex.Content             4.6    1.8
readish                Utility.PartialPrelude    3.0    1.4
isAnnexLink            Annex.Link                2.6    4.0
split                  Data.List.Utils           1.5    0.8
keyPath                Annex.Locations           1.2    1.7

This is interesting!

Fully 40% of CPU time and allocations are in list (really String) processing, and the details of the profiling report show that spanList and startsWith and join are all coming from calls to replace in keyFile and fileKey. Both functions nest several calls to replace, so perhaps that could be unwound into a single pass and/or a ByteString used to do it more efficiently.

12% of run time is spent calculating the md5 hashes for the hash directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and it is probably a quite unoptimised version. Switching to the version if cryptonite would probably speed it up a lot.

Comment by joey — Mon Sep 26 19:20:36 2016

Remove comment

more profiling

Instead of profiling git annex copy --to remote, I profiled git annex find --not --in web, which needs to do the same kind of location log lookup.

        total time  =       12.41 secs   (12413 ticks @ 1000 us, 1 processor)
        total alloc = 8,645,057,104 bytes  (excludes profiling overheads)

COST CENTRE               MODULE                      %time %alloc

adjustGitEnv              Git.Env                      21.4   37.0
catchIO                   Utility.Exception            13.2    2.8
spanList                  Data.List.Utils              12.6   17.9
parsePOSIXTime            Logs.TimeStamp                6.1    5.0
catObjectDetails.receive  Git.CatFile                   5.9    2.1
startswith                Data.List.Utils               5.7    3.8
md5                       Data.Hash.MD5                 5.1    7.9
join                      Data.List.Utils               2.4    6.0
readFileStrictAnyEncoding Utility.Misc                  2.2    0.5

The adjustGitEnv overhead is a surprise! It seems it is getting called once per file, and allocating a new copy of the environment each time. Call stack: withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv. Looks like simply making gitEnv be cached at startup would avoid most of the adjustGitEnv slowdown.

(The catchIO overhead is a false reading; the detailed profile shows that all its time and allocations are inherited. getAnnexLinkTarget is running catchIO in the expensive case, so readSymbolicLink is the actual expensive bit.)

The parsePOSIXTime comes from reading location logs. It's implemented using a generic Data.Time.Format.parseTime, which uses a format string "%s%Qs". A custom parser that splits into seconds and picoseconds and simply reads both numbers might be more efficient.

catObjectDetails.receive is implemented using mostly String and could probably be sped up by being converted to use ByteString.

Comment by joey — Mon Sep 26 19:59:43 2016

Remove comment

comment 10

Optimised key2file and file2key. 18% scanning time speedup.
Optimised adjustGitEnv. 50% git-annex branch query speedup
Optimised parsePOSIXTime. 10% git-annex branch query speedup
Tried making catObjectDetails.receive use ByteString for parsing, but that did not seem to speed it up significantly. So it parsing is already fairly optimal, it's just that a lot of data passes through it when querying the git-annex branch.

After all that, profiling git-annex find:

        Thu Sep 29 16:51 2016 Time and Allocation Profiling Report  (Final)

           git-annex.1 +RTS -p -RTS find

        total time  =        1.73 secs   (1730 ticks @ 1000 us, 1 processor)
        total alloc = 1,812,406,632 bytes  (excludes profiling overheads)

COST CENTRE            MODULE                  %time %alloc

md5                    Data.Hash.MD5            28.0   37.9
catchIO                Utility.Exception        10.2   12.5
inAnnex'.checkindirect Annex.Content             9.9    3.7
catches                Control.Monad.Catch       8.7    5.7
readish                Utility.PartialPrelude    5.7    3.0
isAnnexLink            Annex.Link                5.0    8.4
keyFile                Annex.Locations           4.2    5.8
spanList               Data.List.Utils           4.0    6.3
startswith             Data.List.Utils           2.0    1.3

And git-annex find --not --in web:

        Thu Sep 29 16:35 2016 Time and Allocation Profiling Report  (Final)

           git-annex +RTS -p -RTS find --not --in web

        total time  =        5.24 secs   (5238 ticks @ 1000 us, 1 processor)
        total alloc = 3,293,314,472 bytes  (excludes profiling overheads)

COST CENTRE               MODULE                      %time %alloc

catObjectDetails.receive  Git.CatFile                  12.9    5.5
md5                       Data.Hash.MD5                10.6   20.8
readish                   Utility.PartialPrelude        7.3    8.2
catchIO                   Utility.Exception             6.7    7.3
spanList                  Data.List.Utils               4.1    7.4
readFileStrictAnyEncoding Utility.Misc                  3.5    1.3
catches                   Control.Monad.Catch           3.3    3.2

So, quite a large speedup overall!

This leaves md5 still unoptimised at 10-28% of CPU use. I looked at switching it to cryptohash's implementation, but it would require quite a lot of bit-banging math to pull the used values out of the ByteString containing the md5sum.

Comment by joey — Thu Sep 29 18:33:33 2016

Remove comment

comment 11

Switched from MissingH to cryptonite for md5. It did move md5 out of the top CPU spot but the overall runtime didn't change much. Memory allocations did go down by a good amount.

Updated profiles:

           git-annex +RTS -p -RTS find

        total time  =        1.63 secs   (1629 ticks @ 1000 us, 1 processor)
        total alloc = 1,496,336,496 bytes  (excludes profiling overheads)

COST CENTRE              MODULE                     SRC                                             %time %alloc

catchIO                  Utility.Exception          Utility/Exception.hs:79:1-17                     14.1   15.1
inAnnex'.checkindirect   Annex.Content              Annex/Content.hs:(108,9)-(119,39)                10.6    4.8
catches                  Control.Monad.Catch        src/Control/Monad/Catch.hs:(432,1)-(436,76)       8.6    6.9
spanList                 Data.List.Utils            src/Data/List/Utils.hs:(150,1)-(155,36)           6.7   11.1
isAnnexLink              Annex.Link                 Annex/Link.hs:35:1-85                             5.0   10.2
keyFile                  Annex.Locations            Annex/Locations.hs:(456,1)-(462,19)               5.0    7.0
readish                  Utility.PartialPrelude     Utility/PartialPrelude.hs:(48,1)-(50,20)          3.8    2.0
startswith               Data.List.Utils            src/Data/List/Utils.hs:103:1-23                   3.6    2.3
splitc                   Utility.Misc               Utility/Misc.hs:(52,1)-(54,25)                    3.4    6.5
s2w8                     Data.Bits.Utils            src/Data/Bits/Utils.hs:65:1-15                    2.6    6.4
keyPath                  Annex.Locations            Annex/Locations.hs:(492,1)-(494,23)               2.5    4.4
fileKey.unesc            Annex.Locations            Annex/Locations.hs:(469,9)-(474,39)               2.0    3.5
copyAndFreeze            Data.ByteArray.Methods     Data/ByteArray/Methods.hs:(224,1)-(227,21)        1.8    0.5

           git-annex +RTS -p -RTS find --not --in web

        total time  =        5.33 secs   (5327 ticks @ 1000 us, 1 processor)
        total alloc = 2,908,489,000 bytes  (excludes profiling overheads)

COST CENTRE          MODULE                     SRC                                             %time %alloc

catObjectDetails.\   Git.CatFile                Git/CatFile.hs:(80,72)-(88,97)                    7.8    2.8
catchIO              Utility.Exception          Utility/Exception.hs:79:1-17                      7.6    8.3
spanList             Data.List.Utils            src/Data/List/Utils.hs:(150,1)-(155,36)           5.8    9.1
readish              Utility.PartialPrelude     Utility/PartialPrelude.hs:(48,1)-(50,20)          4.5    4.0
parseResp            Git.CatFile                Git/CatFile.hs:(113,1)-(124,28)                   4.4    2.9
readFileStrict       Utility.Misc               Utility/Misc.hs:33:1-59                           3.7    1.6
catches              Control.Monad.Catch        src/Control/Monad/Catch.hs:(432,1)-(436,76)       3.1    3.6
encodeW8             Utility.FileSystemEncoding Utility/FileSystemEncoding.hs:(131,1)-(133,70)    3.1    2.3

Comment by joey — Mon May 15 21:56:52 2017

Remove comment

comment 12

There's of course always possibility of more speed improvements, but I'm wondering if this has already been addressed sufficient to close it?

Comment by joey — Mon Oct 30 18:48:21 2017

Remove comment

thanks

I do think that things became smoother/faster since then. I guess we could consider this one closed for now, and I will keep in mind that --from mode is faster.

Cheers,

Comment by yarikoptic — Mon Oct 30 20:04:48 2017

Remove comment

Add a comment