Creating additional branches in history seems to slow down the 'git annex unused' command quadratically, even if the location of the branches should be irrelevant as far as unused data goes.
This was tested on:
$ git annex version
git-annex version: 3.20130216
local repository version: 3
default repository version: 3
supported repository versions: 3
upgrade supported from repository versions: 0 1 2
What steps will reproduce the problem?
$ mkdir a
$ cd a
$ git init
$ git annex init
$ i=0 ; while test $i -lt 1000; do dd if=/dev/urandom of=$i.img bs=1M count=1; i=$(($i+1)); done
$ git annex add .
$ git commit -m"foo"
$ git rm 1*
$ git commit -m"bar"
$ git log --oneline --decorate
ffcca3a (HEAD, master) bar
3e7793d foo
$ time -p git annex unused
unused . (checking for unused data...) (checking master...)
(...)
real 0.76
user 0.40
sys 0.06
git commit --allow-empty -m"baz"
$ git log --oneline --decorate
4390c32 (HEAD, master) baz
ffcca3a bar
3e7793d foo
$ time -p git annex unused
unused . (checking for unused data...) (checking master...)
(...)
real 0.75
user 0.38
sys 0.07
$ git branch boo HEAD^
$ time -p git annex unused
unused . (checking for unused data...) (checking boo...) (checking master...)
(...)
real 1.29
user 0.62
sys 0.08
arand@mas:~/tmp/more/a(master)$ git branch beeboo HEAD^
4390c32 (HEAD, master) baz
ffcca3a (boo, beeboo) bar
3e7793d foo
arand@mas:~/tmp/more/a(master)$ time -p git annex unused
unused . (checking for unused data...) (checking beeboo...) (checking master...)
(...)
real 2.50
user 1.12
sys 0.14
$ git branch -d boo beeboo
$ git log --oneline --decorate
4390c32 (HEAD, master) baz
ffcca3a bar
3e7793d foo
$ time -p git annex unused
unused . (checking for unused data...) (checking master...)
(...)
real 0.77
user 0.42
sys 0.04
What is the expected output? What do you see instead?
I would expect the time to be the same in all the above cases.
What version of git-annex are you using? On what operating system?
$ git annex version
git-annex version: 3.20130216
On current Debian sid/experimental
Done, thanks to guilhem. We ended up using a different algorythm which is faster yet, basically it now does a diff-index between the index and each branch for its second stage bloom filter. Speedup is 30x with 0 (or 1?) branch, and then massive for each additional branch. --Joey
git annex unused
finds content that is not used by the working tree or by any branch is unused. For the working tree, it can look at the symlinks on disk, which is the fastest option.For branches, it has to first use
git ls-tree
to find the files on the branch, and then usegit cat-file
to look up the key used by each file. It does this as fast as it can (eg, it runs a singlegit cat-file --batch
, rather than one process per file). Still, this is pulling potentially a lot of data out of git, and it gets pretty slow.I've spent a lot of time optimising this as much as is possible with these constraints. One nice one is that, if it finds no unused keys after checking the working tree, it can stop, rather than checking any branches. Your example avoids this optimisation.
Another optimisation is to only check each git ref once, even if multiple branches refer to it. You can see this optmisation firing in your transcript, when it only shows it's checking one branch of the two identical branches you've made.
Indeed, if you go on and add 100 identical branches, you'll find it runs in just about the same time it ran with 2 branches. (There's a little overhead in getting the list of branches and throwing out the duplicates, but that's all.)
What then explains your numbers? Well, I have no idea. I cannot replicate them; I tend to see about the same amount of time taken with two duplicate branches as with one branch. I suspect you just didn't get statistically valid results, which playing around with
time
at the command line often doesn't, due to caching, other active processes, etc.Hmm, indeed, after further testing it seems like the increased time due to the duplicate branch seems to have been a random quirk, bleh
But shouldn't it theoretically be possible to optimize out much of the overhead of multiple very-similar (though not identical) branches though?
I've experimented around with ls-tree and cat-file in a bash script[1], and in this primitive implementation the running time seems to be considerably lower (~0.3s vs ~4s), with much less overhead for extra very-similar branches (~0.7s vs ~37s)
Am I missing some key element that's the reason for the time taken by git annex unused?
[1] primitive annex unused script: https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-funused
timing script: https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-testunused
So, if I've understood it correctly (please correct me if that's not the case )
Currently git-annex unused goes through this process
If that's the case, it means if that if you have multiple refs, even is they only differ by single empty commits, git-annex will end up doing a cat-file for the same file multiple times (one per ref), which is expensive.
Would it be possible to change the algorithm for git-annex unused into instead something like:
Unless this bypasses some safety or case I've overlooked, I think it should be possible to speed up git-annex unused quite a bit.
I think that could work. It would probably tend to use more memory than the current method, but only a small constant multiplier more. And unused is already the one command that necessarily needs to hold information about the whole repository in memory.
Note that git cat-file is only needed when dealing with branches other than the current working tree. In that special case, it can, and AFAIK does have the optimisation of looking directly at the symlink target instead. Your method may turn out to be both slower and use more memory in that case. It may make sense to special case handling of that case and keep the current code paths there (most of the necessary code for it is used by other stuff anyway).
I've compared my bash/coreutils implementation mentioned above annex-funused with
git annex unused
in various situations, and from what I've seenannex-funused
is pretty much always faster.In the case of no unused files they seem to be about the same.
In all other cases there is a very considerable difference, for example, in my current main annex I get:
whereas
I tried to check memory usage via
/usr/bin/time -v
as well, and that showed (re-running in the same annex as above)annex-funused
git annex unused
I've also written a comparison script annex-testunused (needs annex-funused in $PATH) which creates an annex with a bunch of unused files and compares the running time for both versions:
So, unless I've missed something fundamental (I keep thinking I might have...), it seems to be very consistently faster, and scale ok where
git annex unused
scales rather poorly.The memory usage is probably lower because
sort
andcomm
and bash's<(command)
all have particularly well tuned memory usage with 37 years of history behind them. Particularly GNUsort
will transparently use a temp file rather than storing too much data in memory, and does rather sophisticated stuff to make that work efficiently. It's rather harder to get that kind of behavior when not using the unix tools and instead using stock programming language primatives like sort() and hashes.I still suspect that
git cat-file
is slower than a direct readlink(2) of the symlink, when that can be done.