I was trying to copy files which failed to copy (3 out of 6,000) to remote host after copy -J4. Succeeded. But with subsequent runs, apparently even with copy --fast it takes annex 10 sec for annex to realize there is nothing to copy. git ls-files which annex calls returns list immediately, so it is really some parsing/access to data under git-annex branch which takes awhile. I think we had similar discussion before but couldn't find. So I wondered to whine again to see if some optimization is possible to make --fast copies faster, especially whenever there is nothing to copy.
closing as yoh is happy. done. Note that I copied benechmarking related comments to the profiling page for future reference. --Joey
"to remote host " so it was "--to". annex is already aware of having those files in that remote (see below).
copy --to has to query the git-annex branch to see if the file is on the remote. So it has worse locality than copy --from, which can simply stat the local file to see if it's present.
Whatever inneficiencies git-annex has here are well swamped by the overhead of git querying the branch.
When the remote has most of the files already,
git annex copy --to remote
is similar togit annex find --not --in remote
.Here I've ran that under /usr/bin/time, and it looks like git-annex ran for 89 seconds out of the 260 second runtime. So at least 65% of the total runtime is spent by git querying the branch.
--failed can now be used to retry only failed transfers. So that will be a lot faster in that specific case.
Leaving this bug open for the general wishlist that copy --fast be somehow a lot faster than it is at finding things that need to be copied.
seems to wobble around 50% for each one of git and git-annex processes... probably would be an overkill but may be it is easy in haskell (so throwing idea around) if communication was done in async fashion (git-annex wouldn't wait for git to respond but would process its own queue of already returned from git results, while submitting new ones as soon as previous comes out from the --batch). That might make both processes busy to a 100%.
another idea -- could may be 'annex find' get a -J flag thus starting multiple git ls-files querying processes?
or both ideas are too overengineered/not tractable?
ha -- a wild idea: instead of git ls-files git-annex | git cat-file you could be much better off with using "git archive" to dump the content of all the files under git-annex branch!
x40 times faster (if we disregard time to parse/split tar, but it should not be way too much I think)
First, note that git-annex 6.20160619 sped up the git-annex command startup time significantly. Please be sure to use a current version in benchmarks, and state the version.
git archive
(andgit cat-file --batch --batch-all-objects
) are just reading packs and loose objects in disk order and dumping out the contents.git cat-file --batch
has to look up objects in the pack index files, seek in the pack, etc. It's not a fair comparison.Note that
git annex find
, when used without options like --in or --copies, does not need to read anything fromgit cat-file
at all. TheGIT_TRACE_PERFORMANCE
you show is misleading; it's just showing how long the git command is left running, idle.git annex find
's overhead should be purely traversing the filesystem tree and checking what symlinks point to files. You can write programs that do the same thing without using git at all (or onlygit ls-files
), and compare them to git-annex's time; that would be a fairer comparison. Ideally,git annex find
would be entirely system call bound and would use very little CPU itself.By contrast,
git annex copy
makes significant use ofgit cat-file --batch
, since it needs to look up location log information to see if the --to/--from remote has the files.git annex copy -J
already parallelizes the parts of the code that look at the location log. Including spinning up a separategit cat-file --batch
processes for each thread, so they won't contend on such queries. So I would expect that to make it faster, even leaving aside the speed benefits of doing the actual copies in parallel.My feeling is that the best way to speed these up is going to be in one of these classes:
It's possible that
git cat-file --batch
is somehow slower than it needs to be. Perhaps it's not doing good caching between queries or has inneficient seralization/bad stdio buffering. It might just be the case that using something like libgit2 instead would be faster. (Due to libgit2's poor interface stability, it would have to be an optional build flag.)Many small optimisations to the code. The use of Strings throughout git-annex could well be a source of systematic small innefficiences, and using ByteString might eliminate those. (But this would be a huge job.) (The
git cat-file --batch
communication is already done using bytestrings.)A completely lateral move. For example, if git-annex kept its own database recording which files are present, then
git annex find
could do a simple database query and not need to chase all the symlinks. But such a database needs to somehow be kept in sync or reconciled with the git index, it's not an easy thing.Built git-annex with profiling, using
stack build --profile
(For reproduciblity, running git-annex in a clone of the git-annex repo https://github.com/RichiH/conference_proceedings with rev 2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed objects.)
Profiling
git-annex find +RTS -p
:This is interesting!
Fully 40% of CPU time and allocations are in list (really String) processing, and the details of the profiling report show that
spanList
andstartsWith
andjoin
are all coming from calls toreplace
inkeyFile
andfileKey
. Both functions nest several calls to replace, so perhaps that could be unwound into a single pass and/or a ByteString used to do it more efficiently.12% of run time is spent calculating the md5 hashes for the hash directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and it is probably a quite unoptimised version. Switching to the version if cryptonite would probably speed it up a lot.
Instead of profiling
git annex copy --to remote
, I profiledgit annex find --not --in web
, which needs to do the same kind of location log lookup.The adjustGitEnv overhead is a surprise! It seems it is getting called once per file, and allocating a new copy of the environment each time. Call stack: withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv. Looks like simply making gitEnv be cached at startup would avoid most of the adjustGitEnv slowdown.
(The catchIO overhead is a false reading; the detailed profile shows that all its time and allocations are inherited. getAnnexLinkTarget is running catchIO in the expensive case, so readSymbolicLink is the actual expensive bit.)
The parsePOSIXTime comes from reading location logs. It's implemented using a generic Data.Time.Format.parseTime, which uses a format string "%s%Qs". A custom parser that splits into seconds and picoseconds and simply reads both numbers might be more efficient.
catObjectDetails.receive is implemented using mostly String and could probably be sped up by being converted to use ByteString.
After all that, profiling
git-annex find
:And
git-annex find --not --in web
:So, quite a large speedup overall!
This leaves md5 still unoptimised at 10-28% of CPU use. I looked at switching it to cryptohash's implementation, but it would require quite a lot of bit-banging math to pull the used values out of the ByteString containing the md5sum.
Switched from MissingH to cryptonite for md5. It did move md5 out of the top CPU spot but the overall runtime didn't change much. Memory allocations did go down by a good amount.
Updated profiles:
There's of course always possibility of more speed improvements, but I'm wondering if this has already been addressed sufficient to close it?
I do think that things became smoother/faster since then. I guess we could consider this one closed for now, and I will keep in mind that --from mode is faster.
Cheers,