Does git annex find (& friends) batch queries to the location log?

Hi,

Some time ago I asked here about possible improvements in git copy --fast --to, since it was painfully slow on moderately large repos.

Now I found a way to make it much faster for my particular use case, by accessing some annex internals. And I realized that maybe commands like git annex find --in=repo do not batch queries to the location log. This is based on the following timings, on a new repo (only a few commits) and about 30k files.

> time git annex find --in=skynet > /dev/null

real    0m55.838s
user    0m30.000s
sys 0m1.583s


> time git ls-tree -r git-annex | cut -d ' ' -f 3 | cut -f 1 | git cat-file --batch > /dev/null

real    0m0.334s
user    0m0.517s
sys 0m0.030s

Those numbers are on linux (with an already warm file cache) and an ext4 filesystem on a SSD.

The second command above is feeding a list of objects to a single git cat-file process that cats them all to stdout, preceeding every file dump by the object being cat-ed. It is a trivial matter to parse this output and use it for whatever annex needs.

Above I wrote a git ls-tree on the git-annex branch for simplicity, but we could just as well do a ls-tree | ... | git cat-file on HEAD to get the keys for the annexed files matching some path and then feed those keys to a cat-file on the git-annex branch. And this still would be an order of magnitude faster than what currently annex seems to do.

I'm assuming the bottleneck is in that annex does not batch the cat-file, as the rest of logic needed for a find will be fast. Is that right?

Now, if the queries to the location log for copy --to and find could be batched this way, the performance of several useful things to do, like checking how many annexed files we are missing, would be bastly improved. Hell, I could even put that number on the command line prompt!

I'm not yet very fluent in Haskell, but I'm willing to help if this is something that makes sense and can be done.

RSS Atom

comment 1

git-annex uses cat-file --batch, yes. You can verify this with --debug. Or you can read Annex/CatFile.hs and Git/CatFile.hs

git-annex has to ensure that the git-annex branch is up-to-date and that any info synced into the repository is merged into it. This can require several calls to git log. Your command does not do that. git-annex find also runs git ls-files --cached, which has to examine the state of the index and of files on disk, in order to only show files that are in the working tree. Your command also omits that.

Comment by joeyh.name — Wed Nov 6 22:21:33 2013

Remove comment

comment 2

Ok, then I don't understand where annex spends its time. git annex takes 55 seconds! vs less than a second for a batched query on all the keys in the location log. Checking that branches are in sync, or traversing the working dir shouldn't amount the extra 54 seconds! At least not on a recently synced repo with up to date index and clean working dir.

git-annex has to ensure that the git-annex branch is up-to-date and that any info synced into the repository is merged into it. This can require several calls to git log

Ok, I understand that. Checking that should be typically fast though, isn't it? On a repo that has just been synced, it doesn't need to go very far on the log.

git-annex find also runs git ls-files --cached, which has to examine the state of the index and of files on disk, in order to only show files that are in the working tree

I understand that too. For my particular use case, I know I do the git copy when the repo is in sync and the working dir has no uncommited changes. So I use HEAD to retrieve the keys for the files in the working tree. I do something like that:

time git ls-tree -r HEAD | grep -e '^120000' | cut -d ' ' -f 3 | cut -f 1 | git cat-file --batch > /dev/null

real    0m0.178s
user    0m0.277s
sys     0m0.037s

That plus some fast parsing of the output gets the list of keys for the files in HEAD in less than a second. Where do the 54 extra seconds hide, then?

Mm... how does annex retrieve the keys for files in the working tree? Does it follow the actual symlinks on the filesystem? I can believe that following 30k symlinks may be slow (although not 55 second slow).

Sorry for being so insistent on this... It is just that I do think the same can be done much faster, and such an improvement in performance would be very interesting, not only for me.

Comment by Abdó — Wed Nov 6 23:14:03 2013

Remove comment

comment 3

It's hard to say until some profiling has actually been done. Comparing apples and oranges, or perhaps better to say, orange blossoms and oranges, is not very useful.

Comment by joeyh.name — Thu Nov 7 18:12:09 2013

Remove comment

comment 4

Ok, thanks.

I understand that annex is more sophisticated than what I'm proposing. I'm also not comparing apples to oranges, what I'm saying is this: All the logic and checking and git calls that I thought annex needs to do, take much much less time than those 55 seconds. So either I'm missing something big (about 98% of it in execution time), or annex is doing something very inefficient. That's all I'm saying, and I was hoping that you could clarify that, since you already know the code.

Also I'm sorry for being so annoying. If annex were writen in C, I would either have understood those 55 seconds or sent you a patch making annex faster. Being in haskell means that it will take me considerably more time and effort. But if I manage to improve things and if it is ok with you I'll send you patches... eventually.

Comment by Abdó — Thu Nov 7 18:40:27 2013

Remove comment

Add a comment