Support eg git-annex info --size-history=30d which would display the combined size of all repositories every 30 days throughout the history of the git-annex branch. This would allow graphing, analysis, etc of repo growth patterns.

Also, git-annex info somerepo --size-history=30d would display the size of only the selected repository.

Maybe also a way to get the size of each repository plus total size in a single line of output?


Implementation of this is fairly subtle. My abandoned first try just went through git log and updated counters as the location logs were updated. That resulted in bad numbers. (The size went negative eventually in fact!) The problem is that the git-annex branch is often updated both locally and on a remote, eg when copying a file to a remote. And that results in 2 changes to the git-annex branch that both record the same data. So it gets counted twice by my naive implementation.

I think it is not possible for an accumulation based approach to work in constant memory and fast. In the worst case, there is a fork of the branch that diverges hugely over a long period of time. So that divergence either needs to be buffered in memory, or recalculated repeatedly.

What I think needs to be done is use git log --reverse --date-order git-annex. Feed the changed annex log file refs into catObjectStream to get the log files. (Or use --patch and parse the diff to extract log file lines, might be faster?) Parse the log files, and update a simple data structure:

Map Key [UUIDPointer]

Where UUIDPointer is a number that points to the UUID in a Map. This avoids storing copies of the uuids in the map.

This is essentially union merging all forks of the git-annex branch at each commit, but far faster and in memory. Since union merging a git-annex branch can be done at any point and always results in a consistent view of the data, this will be consistent as well.

And when updating the data structure, then it can update a counter when something changed, and avoid updating it when a redundant log was logged.

This approach will use an amount of memory that scales with the number of keys and numbers of copies. I mocked it up using my big repository. Storing every key in it in such a map, with 64 UUIDPointers in the list (many more than the usual number of copies) took 2 gb of memory. Which is a lot but also most users have that much if necessary. With a more usual 5 copies, memory use was only 0.5 gb. So I think this is an acceptable exception to git-annex's desire to use a constant amount of memory.

(I considered a bloom filter, but a false positive would wreck the statistics. An in-memory sqlite db might be more efficient, but probably not worth the bother.)

--Joey

done