"git annex lock" very slow for big repo

git-annex/ forum/ "git annex lock" very slow for big repo

Edit
RecentChanges
History
Preferences
Branchable
4 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

I found the command "git annex lock" very slow (much slower than the initial "git annex add" with SHA1), for a not so big directory, when run in a big repo. It seems that each underlying git command is not fast, so I thought it would be better to run them once with all files as arguments. I had to stop the lock command, and ran "git checkout ." (I did not change any file), is this a correct alternative?

Thanks a LOT for this software, one that I missed since a long time (but wasn't able to write)!

Rafaël

RSS Atom

fixed

Running git checkout by hand is fine, of course.

Underlying problem is that git has some O(N) scalability of operations on the index with regards to the number of files in the repo. So a repo with a whole lot of files will have a big index, and any operation that changes the index, like the git reset this needs to do, has to read in the entire index, and write out a new, modified version. It seems that git could be much smarter about its index data structures here, but I confess I don't understand the index's data structures at all. I hope someone takes it on, as git's scalability to number of files in the repo is becoming a new pain point, now that scalability to large files is "solved".

Still, it is possible to speed this up at git-annex's level. Rather than doing a git reset followed by a git checkout, it can just git checkout HEAD -- file, and since that's one command, it can then be fed into the queueing machinery in git-annex (that exists mostly to work around this git malfescence), and so only a single git command will need to be run to lock multiple files.

I've just implemented the above. In my music repo, this changed an lock of a CD's worth of files from taking ctrl-c long to 1.75 seconds. Enjoy!

(Hey, this even speeds up the one file case greatly, since git reset -- file is slooooow -- it seems to scan the entire repository tree. Yipes.)

Comment by joey — Tue May 31 18:51:13 2011

Remove comment

comment 2

Nice! So if I understand correctly, 'git reset -- file' was there to discard staged (but not commited) changes made to 'file', before checking out, so that it is equivalent to directly 'git checkout HEAD -- file' ? I'm curious about the "queueing machinery in git-annex": does it end up calling the one git command with multiple files as arguments? does it correspond to the message "(Recording state in git...)" ? Thanks!

Comment by Rafaël — Tue May 31 21:43:22 2011

Remove comment

comment 3

@Rafaël , you're correct on all counts.

Comment by joey — Tue May 31 21:54:23 2011

Remove comment

thanks a lot!

Thank you for the answer to this problem!

I put 150k emails in maildir format in git annex. Adding them took a couple of hours and so did the first full sync. But after that things didnt noticibly slow down - that was a surprise!

What but then I wanted to do a one-time sync using rsync. I didnt find out how to tell rsync to follow symlinks at the destination instead of replacing symlinks with normal files. So I first unlocked the 150k emails (which was moderately quick and probably only I/O bound) and then did the rsync using the --checksum option which worked well as well. The problems started when I wanted to lock the whole thing again. This took ages and even after two days it did not output that it was adding files at all. So I used find -type f | split -a3 -l100 and then for f in x*; do echo $f; git annex addecho `cat $f`; rm $f; done. This started off well and I could finally see files being added. Unfortunately when I came back after a couple of hours, the progress slowed down to a crawl of only one file every few seconds.

The solution was to just git checkout -- Mail everything. This finished in a matter of seconds and left the new mail which was copied over by rsync intact.

Thanks a lot for the tip!

Let me hereby also report that git annex seems to work without problems when using it with offlineimap and notmuch. At least I did not run into any problems with that setup yet.

Comment by josch — Mon Jun 23 15:13:34 2014

Remove comment

Add a comment

Last edited Wed Nov 27 22:47:37 2013