Last Thursday I implemented git-annex filter-process
, which you
can try enabling to make commands like git add
and git checkout
faster when they operate on a lot of files.
git config filter.annex.process 'git-annex filter-process'
On Friday, I benchmarked it
and was not surprised to find that it's slower in some cases than the
old smudge/clean filter interface, and faster in other cases. Still, good
to see actual numbers (see 054c803f8d7cc43eb01fdf6141ab6572373c7d60).
The surprising good news is that it only seems to make git add
around 10%
slower when adding a large file (to the annex presumably). Although I
know I can speed that up, eventually.
Today, I used the benchmark results to build a cost model into git-annex, so it knows when it would be faster to have filter.annex.process set or unset, and temporarily unsets it when that seems best. It can only do that when it's restaging pointer files, but that was the main problem with setting filter.annex.process really.
So I'm fairly close to wanting to enable it by default. But will probably just wait until whenever v9 happens and do it then. Hopefully some people will try it out in the meantime and perhaps I can refine the cost model.
This work was sponsored by Jake Vosloo, Graham Spencer, and Dr. Land Raider on Patreon
My first impression of commit a0758bdd1002e798f62353efa725ac2972589b96 with the cost model is quite positive as I'm the one with multigigabyte annexed files in his otherwise rather small (by number of files) repo and thus I'm affected by the limitations of the filter-process method which pipes all the content of annexed files from git to git-annex. Compared to commit 837025b14f523f9180f82d0cced1e53a8a9b94de, which frankly for me was unusable in this particular repo with
filter.annex.process
set, the new version behaves rather nicely in that a simple test oftime git checkout git-annex
followed bytime git checkout 'adjusted/master(hidemissing-unlocked)'
turns out to be faster than using an unoptimised version (=8.20211028) without the long-runningfilter-process
functionality. Obviously, it's only the first stage, i.e. checking out the git-annex branch, that becomes faster by over 50 percentage points but I'll take any improvement in my daily git-annex operations.The timings I got are as follows.
git checkout git-annex
filter-process
: 103sfilter-process
enabled: 36sfilter-process
enabled: 37sgit checkout 'adjusted/master(hidemissing-unlocked)'
filter-process
: 49sfilter-process
enabled: 57 minutes (I had dropped a few files, in reality this would've taken even longer)filter-process
enabled: 43sThis repo is on Windows (with annex.thin set) and locally has only 13 annexed files on this very drive but the files cover some 870 gigabytes worth of system backup images so individual files are definitely on the larger side for git-annex.
Thanks @jkniiv, that's good to hear. That is exactly the results I would have hoped for for a repository like yours. To speed up the checkout of the annexed files in your case will need improvements to git, probably. But the crucial thing is it's not gotten worse, and the other checkout improved a lot.
I would be curious how you find
git add
's performance now, if you ever use that to add large annexed files.