speed up annex.adjustedbranchrefresh

The --hide-missing and --unlock-present adjusted branches depend on whether the content of a file is present. So after a get or drop, the branch needs to be updated to reflect the change.

To avoid the user needing to do it manually, the annex.adjustedbranchrefresh config automates it now. That is not yet enabled by default, because it can be slow.

Can it be done efficiently enough to become default? And how frequently can it update without being too slow? After every file or after some number of files, or after processing all files in a (sub-)tree, or only once at the end of a command?

The config was designed as a numeric scale with the hope that it might later become a simple boolean if it got fast enough. Since 0 is false and > 0 is true, in git config, the old values will still work.

Investigation of the obvious things that make it slow follows:

efficient branch adjustment

updateAdjustedBranch re-adjusts the whole branch. That is inneficient for a branch with a large number of files in it. We need a way to incrementally adjust part of the branch.

(Or maybe not need because maybe it's acceptable for it to be slow; slow is still better than manual, probably.)

Git.Tree.adjustTree recursively lists the tree and applies the adjustment to each item in it. What's needed is a way to adjust only the subtree containing a file, and then use Git.Tree.graftTree to graft that into the existing tree. Seems quite doable!

graftTree does use getTree, so buffers the whole tree object in memory. adjustTree avoids doing that. I think it's probably ok though; git probably also buffers tree objects in memory. Only the tree objects down to the graft point need to be buffered.

Oh, also, it can cache a few tree objects to speed this up more. Eg after dropping foo/bar/1 buffer foo and foo/bar's objects, and use that when dropping foo/bar/2.

efficient checkout

updateAdjustedBranch checks out the new version of the branch.

git checkout needs to diff the old and new tree, and while git is fast, it's not magic. Consider if, after every file dropped, we checked out a new branch that did not contain that file. Dropping can be quite fast, thousands of files per second.. How much would those git checkouts slow it down?

I benchmarked to find out. (On an SSD)

for x in $(seq 1 10000); do echo $x > $x; git add $x; git commit -m add; done

time while git checkout --quiet HEAD^;do : ; done
real    5m59.489s
user    4m34.775s
sys 3m39.665s

Seems like git-annex could do the equivilant of checkout more quickly, since it knows the part of the branch that changed. Eg, git rm the file, and update the sha1 that HEAD points to.

dangling git objects

Incremental adjusting would generate new trees each time, and a new commit. If it's done a lot, those will pile up. They get deleted by gc, but normally would only be deleted after some time. Perhaps git-annex could delete them itself. (At least if they remain loose git objects; deleting them once they reach a pack file seems hard.)

(The tree object buffer mentioned above suggests the approach of, when an object in the buffer gets replaced with another object for the same tree position, delete the old object from .git/objects.)

A git repo containing 10000 commits, starting with 10000 files and ending with 0, deleting 1 file per commit, weighs in at 99mb. That's with short filenames, more typical filenames would add to it. (It packs down to 9 mb, but that's just extra work for objects that will get gc'ed away eventually.)

slowness in writing git objects

git-annex uses git commit-tree to create new commits on the adjusted branch. That is not batched, so running once per file may get slow.

And to write trees, it uses git mktree --batch. But, a new process is started each time by Git.Tree.adjustTree (and other things). Making that a long-running process would speed it up, probably.

queue flush slowness

It uses Annex.Queue.flush to avoid a problem with dropping files, which depopuates the pointer file, which git sees as modified until restaged, which then prevents checking out the version of the branch where the pointer file is a symlink. There must be a less expensive way to handle that.

(All the extra "(recording state in git...)" when dropping are due to it doing that too.)