todo/make annex info more efficientyohhttp://git-annex.branchable.com/todo/make_annex_info_more_efficient/git-annexikiwiki2017-12-05T18:51:49Zcomment 1http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_1_afd5c806f7285442401b027f82a8c629/EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a]2016-02-15T15:28:40Z2016-02-15T15:28:40Z
<p>actually it does indeed relate with all the files, since "annexed files in working tree"... if interested -- try on this repo https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/freesurfer.git . On a clean new clone it took only 30 seconds <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";)" /></p>
comment 2http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_2_0770ef5c4c261949a565723073480dcb/joey2016-02-15T16:53:09Z2016-02-15T16:51:20Z
<p>There's --fast if you don't need the expensive to obtain data.</p>
<p>AFAIK, the working tree traversal that makes up most of the overhead of
info is mostly disk-bound.</p>
comment 3http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_3_c022caab61061b1f77b78485089a9052/EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a]2016-02-15T17:02:59Z2016-02-15T17:02:59Z
<p>yeap, saw fast but it is kinda of limited use. doesn't even go through git-annex branch to report amount of storage occupied. I didn't check but "working tree" is probably a current checked out branch, and that is why requires traversal and thus slow. What is often also desired to know what is the total number/size of annexed files I have locally in any given annex. So something along of 'du -scm .git/annex/objects' (but via info in git-annex branch instead). May be that could be included in the --fast portion?</p>
<p>as for disk-bound: in my case /tmp is in ram, there is no disk activities, git-annex is 100% (1 core) busy, FWIW:</p>
<div class="highlight-sh"><pre class="hl"> Command being timed<span class="hl opt">:</span> <span class="hl str">"git annex info"</span>
User <span class="hl kwa">time</span> <span class="hl opt">(</span>seconds<span class="hl opt">):</span> <span class="hl num">27.69</span>
System <span class="hl kwa">time</span> <span class="hl opt">(</span>seconds<span class="hl opt">):</span> <span class="hl num">0.68</span>
Percent of CPU this job got<span class="hl opt">:</span> <span class="hl num">99</span><span class="hl opt">%</span>
Elapsed <span class="hl opt">(</span>wall <span class="hl kwc">clock</span><span class="hl opt">)</span> <span class="hl kwa">time</span> <span class="hl opt">(</span>h<span class="hl opt">:</span>mm<span class="hl opt">:</span>ss or m<span class="hl opt">:</span>ss<span class="hl opt">):</span> <span class="hl num">0</span><span class="hl opt">:</span><span class="hl num">28.42</span>
</pre></div>
comment 4http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_4_923dd7c22920b389488ca2625225164c/EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a]2016-02-15T17:05:08Z2016-02-15T17:05:08Z
FWIW the acctual git ls-tree command annex invokes takes 0.01 sec <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";)" />
comment 5http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_5_c6c8850aefe3ab81f1f113daa734695b/joey2017-01-27T18:18:18Z2016-02-19T15:58:17Z
<p>When run on a local repository, git-annex info does not look at the
git-annex branch. That would be slower than traversing the directories.
(Asking for info about a remote does look at the git-annex branch.)</p>
<p>git ls-tree does not have to look at files on disk, so is not comparable.
The work <code>git annex info</code> does is roughly the same as a du of
.git/annex/objects and a readlink of each symlink in the work tree.</p>
<p>If you just want the .git/annex/objects size, perhaps it would make sense
to have a way to get only one stat from git-annex status. Or perhaps
it would be as good to <code>du .git/annex/objects</code>?</p>
comment 6http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_6_804dbb72757b09e3abad3e249f704da0/joey2016-02-19T18:56:19Z2016-02-19T18:49:38Z
<p>Also, around the time you filed this, there was a change which turned out
to cause all files in the work tree to be read in full, and buffered in
memory for some time. This may have to do with some of the slowdown you
saw, especially since you have a lot of non-annexed files in the tree.
I fixed that problem in commit b0081598c7c580d6760374c42765e94e4750e793.</p>
cache?http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_7_44efc2bfcdde576aaca002595476a2a2/EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a]2016-04-01T19:56:43Z2016-04-01T19:56:43Z
<p>I was about to whine in a separate TODO but then remembered that the issue is not new...
I wondered -- since sizes report depends on what is present or not locally, and that all directly relates to the state of git-annex branch, could may be annex cache collected information associated with a given annex / current branch treeishes? Then subsequent invocations would be fast.</p>
<p>In my case I would want to list information on multiple annexes e.g. in current directory. If each one takes 3-4 seconds, for 30 of them -- minutes. With caching, at least subsequent runs should be much faster (in case of no changes, which would be frequent case I think)</p>
comment 8http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_8_90a22c51b19707ee0cecbe652c6ffa98/joey2017-12-05T17:01:39Z2017-10-30T18:56:14Z
<p>It's not clear to me what needs to be done here.</p>
<p>I tried cloning the freesurfer repository, and in it with a cold
disk cache, <code>git annex info</code> took 0.6 seconds (warm cache, 0.1 seconds).</p>
<p>It seems like you wanted to get the "local annex size" value, perhaps
without the overhead of the "size of annexed files in working tree"
value? In an indirect mode repository, the former value is obtained
the same as <code>du .git/annex/objects</code>, and should be similarly fast.</p>
comment 9http://git-annex.branchable.com/todo/make_annex_info_more_efficient/comment_9_3bec13ce0c9f0932715d73106d576e86/yarikoptic2017-12-05T18:51:49Z2017-12-05T18:51:49Z
<p>rereading my previous comment (it has been awhile), I was suggesting a super-dooper-feature. I am not sure if it is still needed/desired since upon a quick try again, info seems to be indeed relatively speedy when ran on "hot" (e.g. 2nd time in a row; not sure how long the effect lasts ;)).
The idea was -- to cache "status" information locally so subsequent invocations (if git-annex branch didn't change) could just pick it up from the cache if nothing has changed in terms of the git-annex branch and current branch position. E.g. if I am at a commit ABC and at git-annex branch state XYZ, why not to have .git/annex/caches/info/XYZ-ABC.json which would pretty much have output of <code>git annex info --json</code> which it could reuse, without ANY traversal of git-annex branch or local files tree, on a subsequent call if still in XYZ state and that commit. Whenever git-annex branch progresses away from XYZ, all previous ones in the cache could be let RiP.</p>