ATM it takes about a minute for 'git annex info' on a sizeable but not huge repository with only ~450 files under annex but a few thousand of files (~7000) in the tree. I am not quite sure why it takes that long since it seems to care only about annexed files. Also it might be of benefit to parallelize some traversal operations to take advantage of multiple cpu/cores
I sense this one has reached its end, it's fast enough, so done --Joey
actually it does indeed relate with all the files, since "annexed files in working tree"... if interested -- try on this repo https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/freesurfer.git . On a clean new clone it took only 30 seconds
There's --fast if you don't need the expensive to obtain data.
AFAIK, the working tree traversal that makes up most of the overhead of info is mostly disk-bound.
yeap, saw fast but it is kinda of limited use. doesn't even go through git-annex branch to report amount of storage occupied. I didn't check but "working tree" is probably a current checked out branch, and that is why requires traversal and thus slow. What is often also desired to know what is the total number/size of annexed files I have locally in any given annex. So something along of 'du -scm .git/annex/objects' (but via info in git-annex branch instead). May be that could be included in the --fast portion?
as for disk-bound: in my case /tmp is in ram, there is no disk activities, git-annex is 100% (1 core) busy, FWIW:
When run on a local repository, git-annex info does not look at the git-annex branch. That would be slower than traversing the directories. (Asking for info about a remote does look at the git-annex branch.)
git ls-tree does not have to look at files on disk, so is not comparable. The work
git annex info
does is roughly the same as a du of .git/annex/objects and a readlink of each symlink in the work tree.If you just want the .git/annex/objects size, perhaps it would make sense to have a way to get only one stat from git-annex status. Or perhaps it would be as good to
du .git/annex/objects
?Also, around the time you filed this, there was a change which turned out to cause all files in the work tree to be read in full, and buffered in memory for some time. This may have to do with some of the slowdown you saw, especially since you have a lot of non-annexed files in the tree. I fixed that problem in commit b0081598c7c580d6760374c42765e94e4750e793.
I was about to whine in a separate TODO but then remembered that the issue is not new... I wondered -- since sizes report depends on what is present or not locally, and that all directly relates to the state of git-annex branch, could may be annex cache collected information associated with a given annex / current branch treeishes? Then subsequent invocations would be fast.
In my case I would want to list information on multiple annexes e.g. in current directory. If each one takes 3-4 seconds, for 30 of them -- minutes. With caching, at least subsequent runs should be much faster (in case of no changes, which would be frequent case I think)
It's not clear to me what needs to be done here.
I tried cloning the freesurfer repository, and in it with a cold disk cache,
git annex info
took 0.6 seconds (warm cache, 0.1 seconds).It seems like you wanted to get the "local annex size" value, perhaps without the overhead of the "size of annexed files in working tree" value? In an indirect mode repository, the former value is obtained the same as
du .git/annex/objects
, and should be similarly fast.rereading my previous comment (it has been awhile), I was suggesting a super-dooper-feature. I am not sure if it is still needed/desired since upon a quick try again, info seems to be indeed relatively speedy when ran on "hot" (e.g. 2nd time in a row; not sure how long the effect lasts ;)). The idea was -- to cache "status" information locally so subsequent invocations (if git-annex branch didn't change) could just pick it up from the cache if nothing has changed in terms of the git-annex branch and current branch position. E.g. if I am at a commit ABC and at git-annex branch state XYZ, why not to have .git/annex/caches/info/XYZ-ABC.json which would pretty much have output of
git annex info --json
which it could reuse, without ANY traversal of git-annex branch or local files tree, on a subsequent call if still in XYZ state and that commit. Whenever git-annex branch progresses away from XYZ, all previous ones in the cache could be let RiP.