I found the command "git annex lock" very slow (much slower than the initial "git annex add" with SHA1), for a not so big directory, when run in a big repo. It seems that each underlying git command is not fast, so I thought it would be better to run them once with all files as arguments. I had to stop the lock command, and ran "git checkout ." (I did not change any file), is this a correct alternative?
Thanks a LOT for this software, one that I missed since a long time (but wasn't able to write)!
Rafaël
Running
git checkout
by hand is fine, of course.Underlying problem is that git has some O(N) scalability of operations on the index with regards to the number of files in the repo. So a repo with a whole lot of files will have a big index, and any operation that changes the index, like the
git reset
this needs to do, has to read in the entire index, and write out a new, modified version. It seems that git could be much smarter about its index data structures here, but I confess I don't understand the index's data structures at all. I hope someone takes it on, as git's scalability to number of files in the repo is becoming a new pain point, now that scalability to large files is "solved".Still, it is possible to speed this up at git-annex's level. Rather than doing a
git reset
followed by a git checkout, it can justgit checkout HEAD -- file
, and since that's one command, it can then be fed into the queueing machinery in git-annex (that exists mostly to work around this git malfescence), and so only a single git command will need to be run to lock multiple files.I've just implemented the above. In my music repo, this changed an lock of a CD's worth of files from taking ctrl-c long to 1.75 seconds. Enjoy!
(Hey, this even speeds up the one file case greatly, since
git reset -- file
is slooooow -- it seems to scan the entire repository tree. Yipes.)Nice! So if I understand correctly, 'git reset -- file' was there to discard staged (but not commited) changes made to 'file', before checking out, so that it is equivalent to directly 'git checkout HEAD -- file' ? I'm curious about the "queueing machinery in git-annex": does it end up calling the one git command with multiple files as arguments? does it correspond to the message "(Recording state in git...)" ? Thanks!
Thank you for the answer to this problem!
I put 150k emails in maildir format in git annex. Adding them took a couple of hours and so did the first full sync. But after that things didnt noticibly slow down - that was a surprise!
What but then I wanted to do a one-time sync using rsync. I didnt find out how to tell rsync to follow symlinks at the destination instead of replacing symlinks with normal files. So I first unlocked the 150k emails (which was moderately quick and probably only I/O bound) and then did the rsync using the --checksum option which worked well as well. The problems started when I wanted to lock the whole thing again. This took ages and even after two days it did not output that it was adding files at all. So I used
find -type f | split -a3 -l100
and thenfor f in x*; do echo $f; git annex add
echo `cat $f`; rm $f; done
. This started off well and I could finally see files being added. Unfortunately when I came back after a couple of hours, the progress slowed down to a crawl of only one file every few seconds.The solution was to just
git checkout -- Mail
everything. This finished in a matter of seconds and left the new mail which was copied over by rsync intact.Thanks a lot for the tip!
Let me hereby also report that git annex seems to work without problems when using it with offlineimap and notmuch. At least I did not run into any problems with that setup yet.