track free space in repos via git-annex branch

motivations
implementation
incremental update via git diff
incremental update when merging git-annex branch
concurrency

motivations

If the total space available in a repository for annex objects is recorded on the git-annex branch (by the user running a command probably, or perhaps automatically), then it is possible to examine the git-annex branch and tell how much free space a remote has available.

One use case is just to display it in git-annex info. But a more compelling use case is balanced preferred content, which needs a way to tell when an object is too large to store on a repository, so that it can be redirected to be stored on another repository in the same group.

This was actually a fairly common feature request early on in git-annex and I probably should have thought about it more back then!

implementation

git-annex info has recently started summing up the sizes of repositories from location logs, and is well optimised. In my big repository, that takes 8.54 seconds of its total runtime.

Since info already knows the repo sizes, just adding a git-annex maxsize here 200gb type of command would let it display the free space of all repos that had a maxsize recorded, essentially for free.

But 8 seconds is rather a long time to block a git-annex push type command. Which would be needed if any remote's preferred content expression used the free space information.

Would it be possible to update incrementally from the previous git-annex branch to the current one? That's essentially what git-annex log --sizesof does for each commit on the git-annex branch, so could imagine adapting that to store its state on disk, so it can resume at a new git-annex branch commit.

Perhaps a less expensive implementation than git-annex log --sizesof is possible, to get only the current sizes, if the past sizes are known at a particular git-annex branch commit. We don't care about sizes at intermediate points in time, which that command does calculate.

See ?info --size-history for the subtleties that had to be handled. In particular, computing the previous git-annex branch commit to current may yield lines that seem to indicate content was added to a repo, but in fact that repo already had that content at the previous git-annex branch commit and another log line was recorded elsewhere redundantly. So it needs to look at the location log's value at the previous commit in order to determine if a change to a log should be counted.

Worst case, that's queries of the location log file for every single key. If queried from git, that would be slow -- slower than git-annex info's streaming approach. If they were all cached in a sqlite database, it might manage to be faster?

incremental update via git diff

Could git diff -U1000000 be used and the patch parsed to get the complete old and new location log? (Assuming no log file ever reaches a million lines.) I tried this in my big repo, and even diffing from the first git-annex branch commit to the last took 7.54 seconds.

Compare that with the method used by git-annex info's size gathering, of dumping out the content of all files on the branch with git ls-tree -r git-annex |awk '{print $3}'|git cat-file --batch --buffer, which only takes 3 seconds. So, this is not ideal when diffing to too old a point.

Diffing in my big repo to the git-annex branch from 2020 takes 4 seconds.
... from 3 months ago takes 2 seconds.
... from 1 week ago takes 1 second.

incremental update when merging git-annex branch

When merging git-annex branch changes into .git/annex/index, it already diffs between the branch and the index and uses git cat-file to get both versions of the file in order to union merge them.

That's essentially the same information needed to do the incremental update of the repo sizes. So could update sizes at the same time as merging the git-annex branch. That would be essentially free!

Note that the use of git cat-file in union merge is not --buffer streaming, so is slower than the patch parsing method that was discussed in the previous section. So it might be possible to speed up git-annex branch merging using patch parsing.

Note that Database.ContentIdentifier and Database.ImportFeed also update by diffing from the old to new git-annex branch (with git cat-file to read log files) so could also be sped up by being done at git-annex branch merge time. Those are less expensive than diffing the location logs only because the logs they diff are less often used, and the work is only done when relevant commands are run.

(Opened optimise git-annex branch merge and database updates about that possibility.)

concurrency

Suppose a repository is almost full. Two concurrent threads or processes are considering sending two different keys to the repository. It can hold either key, but not both. So the size tracking seems to need to be provisionally updated for one key before it is sent, and then the check for the other key will show not enough space and it won't be sent. If the first key fails to get sent, the size needs to be reset back.

Note that checkDiskSpace deals with this by looking at sizeOfDownloadsInProgress. It would be possible to make a sizeOfUploadsInProgressToRemote r similarly.

done! > --Joey