added branches makes 'git annex unused' slow

Creating additional branches in history seems to slow down the 'git annex unused' command quadratically, even if the location of the branches should be irrelevant as far as unused data goes.

This was tested on:

$ git annex version
git-annex version: 3.20130216
local repository version: 3
default repository version: 3
supported repository versions: 3
upgrade supported from repository versions: 0 1 2

What steps will reproduce the problem?

$ mkdir a
$ cd a
$ git init
$ git annex init
$ i=0 ; while test $i -lt 1000; do dd if=/dev/urandom of=$i.img bs=1M count=1; i=$(($i+1)); done
$ git annex add .
$ git commit -m"foo"
$ git rm 1*
$ git commit -m"bar"
$ git log --oneline --decorate
ffcca3a (HEAD, master) bar
3e7793d foo
$ time -p git annex unused
unused . (checking for unused data...) (checking master...)
(...)
real 0.76
user 0.40
sys 0.06
git commit --allow-empty -m"baz"
$ git log --oneline --decorate
4390c32 (HEAD, master) baz
ffcca3a bar
3e7793d foo
$ time -p git annex unused
unused . (checking for unused data...) (checking master...)
(...)
real 0.75
user 0.38
sys 0.07
$ git branch boo HEAD^
$ time -p git annex unused
unused . (checking for unused data...) (checking boo...) (checking master...)
(...)
real 1.29
user 0.62
sys 0.08
arand@mas:~/tmp/more/a(master)$ git branch beeboo HEAD^
4390c32 (HEAD, master) baz
ffcca3a (boo, beeboo) bar
3e7793d foo
arand@mas:~/tmp/more/a(master)$ time -p git annex unused
unused . (checking for unused data...) (checking beeboo...) (checking master...)
(...)
real 2.50
user 1.12
sys 0.14
$ git branch -d boo beeboo
$ git log --oneline --decorate
4390c32 (HEAD, master) baz
ffcca3a bar
3e7793d foo
$ time -p git annex unused
unused . (checking for unused data...) (checking master...)
(...)
real 0.77
user 0.42
sys 0.04

What is the expected output? What do you see instead?

I would expect the time to be the same in all the above cases.

What version of git-annex are you using? On what operating system?

$ git annex version
git-annex version: 3.20130216

On current Debian sid/experimental

Done, thanks to guilhem. We ended up using a different algorythm which is faster yet, basically it now does a diff-index between the index and each branch for its second stage bloom filter. Speedup is 30x with 0 (or 1?) branch, and then massive for each additional branch. --Joey

RSS Atom

comment 1

git annex unused finds content that is not used by the working tree or by any branch is unused. For the working tree, it can look at the symlinks on disk, which is the fastest option.

For branches, it has to first use git ls-tree to find the files on the branch, and then use git cat-file to look up the key used by each file. It does this as fast as it can (eg, it runs a single git cat-file --batch, rather than one process per file). Still, this is pulling potentially a lot of data out of git, and it gets pretty slow.

I've spent a lot of time optimising this as much as is possible with these constraints. One nice one is that, if it finds no unused keys after checking the working tree, it can stop, rather than checking any branches. Your example avoids this optimisation.

Another optimisation is to only check each git ref once, even if multiple branches refer to it. You can see this optmisation firing in your transcript, when it only shows it's checking one branch of the two identical branches you've made.

Indeed, if you go on and add 100 identical branches, you'll find it runs in just about the same time it ran with 2 branches. (There's a little overhead in getting the list of branches and throwing out the duplicates, but that's all.)

What then explains your numbers? Well, I have no idea. I cannot replicate them; I tend to see about the same amount of time taken with two duplicate branches as with one branch. I suspect you just didn't get statistically valid results, which playing around with time at the command line often doesn't, due to caching, other active processes, etc.

Comment by joeyh.name — Tue Feb 26 20:29:05 2013

Remove comment

comment 2

Hmm, indeed, after further testing it seems like the increased time due to the duplicate branch seems to have been a random quirk, bleh

But shouldn't it theoretically be possible to optimize out much of the overhead of multiple very-similar (though not identical) branches though?

I've experimented around with ls-tree and cat-file in a bash script[1], and in this primitive implementation the running time seems to be considerably lower (~0.3s vs ~4s), with much less overhead for extra very-similar branches (~0.7s vs ~37s)

Am I missing some key element that's the reason for the time taken by git annex unused?

[1] primitive annex unused script: https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-funused

timing script: https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-testunused

Comment by arand — Wed Feb 27 15:53:37 2013

Remove comment

comment 3

So, if I've understood it correctly (please correct me if that's not the case )

Currently git-annex unused goes through this process

Look through all files in the index and find those which are git-annex keys (git ls-tree + git cat-file)
Look through all files the current ref and find those which are git-annex keys (git ls-tree + git cat-file)
For each ref in the repo
- Look through all files and find those which are git-annex keys (git ls-tree + git cat-file)
Then at the end
- Compare this list of keys with what is stored in .git/annex/objects
- Print out any objects which does not match a key.

If that's the case, it means if that if you have multiple refs, even is they only differ by single empty commits, git-annex will end up doing a cat-file for the same file multiple times (one per ref), which is expensive.

Would it be possible to change the algorithm for git-annex unused into instead something like:

For the index, HEAD, and all refs
- Create a list all files and remove those which are duplicates based on their sha1 hash (git ls-tree | uniq)
Then Look through this reduced list to find those which are git-annex keys (git cat-file)
Then check as before

Unless this bypasses some safety or case I've overlooked, I think it should be possible to speed up git-annex unused quite a bit.

Comment by arand — Sat Aug 10 17:00:21 2013

Remove comment

comment 4

I think that could work. It would probably tend to use more memory than the current method, but only a small constant multiplier more. And unused is already the one command that necessarily needs to hold information about the whole repository in memory.

Note that git cat-file is only needed when dealing with branches other than the current working tree. In that special case, it can, and AFAIK does have the optimisation of looking directly at the symlink target instead. Your method may turn out to be both slower and use more memory in that case. It may make sense to special case handling of that case and keep the current code paths there (most of the necessary code for it is used by other stuff anyway).

Comment by joeyh.name — Sun Aug 11 12:48:47 2013

Remove comment

comment 5

I've compared my bash/coreutils implementation mentioned above annex-funused with git annex unused in various situations, and from what I've seen annex-funused is pretty much always faster.

In the case of no unused files they seem to be about the same.

In all other cases there is a very considerable difference, for example, in my current main annex I get:

$ time git annex unused                                                                                                      
unused . (checking for unused data...) (checking master...) (checking synced/master...) (checking barracuda160G/master...) ok

real    5m13.830s
user    2m0.444s
sys     0m28.344s

whereas

$ time annex-funused                                                                                                             
 == WARNING ==
This program should NOT be trusted to reliably find unused files in the
git annex.


real    0m1.569s
user    0m2.024s
sys     0m0.184s

I tried to check memory usage via /usr/bin/time -v as well, and that showed (re-running in the same annex as above)

annex-funused

Maximum resident set size (kbytes): 13560

git annex unused

Maximum resident set size (kbytes): 29120

I've also written a comparison script annex-testunused (needs annex-funused in $PATH) which creates an annex with a bunch of unused files and compares the running time for both versions:

$ annex-testunused
Initialized empty Git repository in /tmp/tmp.fmsAvsPTcd/.git/
init  ok
(Recording state in git...)
###
* b2840d7 (HEAD, master) delete ~1100 files
* c4a1e3a add 3000 files
* bc19777 (git-annex) update
* b3e6539 update
* bec2c8f branch created
annex unused
real 0m4.154s
real 0m2.029s
real 0m2.044s
annex funused
real 0m0.923s
real 0m0.933s
real 0m0.905s
Initialized empty Git repository in /tmp/tmp.7qFoCRWzB3/.git/
init  ok
(Recording state in git...)
###
* a5ff392 (HEAD, master) empty
* cca4810 (1) delete ~1100 files
* 587c406 add 3000 files
* de0afeb (git-annex) update
* 37b7881 update
* 1735062 branch created
annex unused
real 0m3.499s
real 0m3.443s
real 0m3.435s
annex funused
real 0m0.956s
real 0m0.956s
real 0m0.874s
Initialized empty Git repository in /tmp/tmp.L5fjdAgnFv/.git/
init  ok
(Recording state in git...)
###
* 94463a0 (HEAD, master) empty
* e115619 (10) empty
* 20686d4 (9) empty
* 2e01a3f (8) empty
* 043289d (7) empty
* 6a52966 (6) empty
* 0dc866d (5) empty
* 35db331 (4) empty
* 48504bc (3) empty
* e25cac7 (2) empty
* 655d026 (1) delete ~1100 files
* 91a07d1 add 3000 files
* 3c9ac62 (git-annex) update
* c5736e0 update
* 862d5b8 branch created
annex unused
real 0m16.242s
real 0m16.277s
real 0m16.246s
annex funused
real 0m0.960s
real 0m0.960s
real 0m0.927s

So, unless I've missed something fundamental (I keep thinking I might have...), it seems to be very consistently faster, and scale ok where git annex unused scales rather poorly.

Comment by arand — Sun Aug 11 20:43:22 2013

Remove comment

comment 6

The memory usage is probably lower because sort and comm and bash's <(command) all have particularly well tuned memory usage with 37 years of history behind them. Particularly GNU sort will transparently use a temp file rather than storing too much data in memory, and does rather sophisticated stuff to make that work efficiently. It's rather harder to get that kind of behavior when not using the unix tools and instead using stock programming language primatives like sort() and hashes.

I still suspect that git cat-file is slower than a direct readlink(2) of the symlink, when that can be done.

Comment by joeyh.name — Mon Aug 12 04:14:18 2013

Remove comment

comment 7

Hmm, probably, do you think this way of getting everything and then doing the filtering could work reasonably in git-annex? Assuming "git cat-file --batch" is the main bottleneck, reducing the amount of runs for it would still likely be an improvement?

Comment by arand — Mon Aug 12 18:07:44 2013

Remove comment

Add a comment