Just as git does not scale well with large files, it can also become painful to work with when you have a large number of files. Below are things I have found to minimise the pain.
Using version 4 index files
During operations which affect the index, git writes an entirely new index out to index.lck and then replaces .git/index with it. With a large number of files, this index file can be quite large and take several seconds to write every time you manipulate the index!
This can be mitigated by changing it to version 4 which uses path compression to reduce the filesize:
git update-index --index-version 4
NOTE: The git documentation warns that this version may not be supported by other git implementations like JGit and libgit2.
Personally, I saw a reduction from 516MB to 206MB (40% of original size) and got a much more responsive git!
It may also be worth doing the same to git-annex's index:
GIT_INDEX_FILE=.git/annex/index git update-index --index-version 4
Though I didn't gain as much here with 89MB to 86MB (96% of original size).
Packing
As I have gc disabled:
git config gc.auto 0
so I control when it is run, I ended up with a lot of loose objects which also cause slowness in git. Using
git count-objects
to tell me how many loose objects I have, when I reach a threshold (~25000), I pack those loose objects and clean things up:
git repack -d
git gc
git prune
File count per directory
If it takes a long time to list the files in a directory, naturally, git(-annex) will be affected by this bottleneck.
You can avoid this by keeping the number of files in a directory to between 5000 and 20000 (depends on the filesystem and its settings).
fpart can be a very useful tool to achieve this.
This sort of usage was discussed in Handling a large number of files and "git annex sync" synced after 8 hours. -- ?CandyAngel
Forget tracking information
In addition to keeping track of where files are, git-annex keeps a log that keeps track of where files were. This can take up space as well and slow down certain operations.
You can use the git-annex-forget command to drop historical location tracking info for files.
Note: this was discussed in scalability with lots of files. -- anarcat
As writing the index file becomes the bottleneck, turning on split index mode might be a help as well. See git-update-index's man page.
I have been playing with tracking a large number of url's for about one month now. Having already been disappointed by how git performs when there are a very large amount of files in the annex, I tested making multiple annexes. I did find that splitting the url's into multiple annexes increased performance, but at the cost of extra housekeeping, duplicated url's, and more work needed to keep track of the url's. Part of the duplication and tracking problem was mitigated by using a dumb remote, such as rsync or directory, where a very large amount of objects can be stored. The dumb remotes perform very well, however each annex needed to be synced regularly with the dumb remote.
I found the dumb remote to be great for multiple annexes. I have noticed that a person can create a new annex and extract a tarball of symlinks into the repo, the
git commit
the links. Subsequently, executinggit-annex fsck --from dummy
would setup the tracking info, which was pretty useful.However, I found that by the time I got to over fifty annexes, the overall performance far worse than just storing the url's and file paths in a postgresql database. In fact, the url's are already being stored and accessed from such a database, but I had the desire to access the url's from multiple machines, which is a bit more difficult with a centralized database.
After reading the tips and pages discussing splitting the files into multiple directories, and changing the index version, I decided to try a single annex to hold the url's. Over the new year's weekend, I decided to write a script that generates rss files to use with importfeed to add url's to this annex. I have noticed that when using
git commit
the load average of the host was in the mid twenties and persisted for hours until I had to kill the processes to be able to use the machine again (I would really like to know if there is a git config setting that would keep the load down, so the machine can be used during a commit). I gave up ongit-annex sync
this morning, since it was taking longer than I was able to sit in the coffee shop and wait for it (~3 hrs).I came back to the office, and started
git gc
which has been running for ~1hr.When making the larger annex, I decided to use the hexidecimal value of uuid5(url) for each filename, and use the two high nybbles and the two low nybbles for a two state directory structure, with respect for the advice from CandyAngel. When my url's are organized in this manner, I still need access to the database to perform the appropriate
git-annex get
which impairs the use of this strategy, but I'm still testing and playing around. I suspended adding url's to this annex until I get at least one sync performed.The url annex itself is not very big, and I am guessing the average file size to be close to 500K. The large number of url's seems to be a problem I have yet to solve. I wanted to experiment with this to further the idea of the public git-annex repositories that seem to be a useful idea, even though the utility of this idea is very limited at the moment.
@umeboshi: Odd that you report your machine freezes during commits.. I find the exact opposite.. waiting for a long time with no load at all.
My current setup for my sorting annex (where I import all the files off old HDDs) is to have the HDD plugged into my home server (Atom 330 @ 1.6Ghz) and import the files on a cloned (empty) annex. Doing so for 1.1M files (latest HDD) is a long wait, because 80% of the time is waiting for something to happen (but there being no load on the machine). Once that is done, the HDD is transferred to my desktop, where the annex is "joined" to the others and files are sorted in a dedicated VM[1], where commit times are reasonable.
[1] Fully virtualising my desktop is possibly the best thing I've ever done, in terms of setup. Locking up any VM affects none of the others (which is handy, as I discovered an issue that causes X to almost hardlock whenever libvo is used..).
Splitting the index (git update-index --split-index) doesn't work for me at all. While it may initially reduce the size of .git/index, making a commit inflates it back to its original size anyway.
I thought it might be some interaction with v4 index and its compression mechanics but it does the same if I set it to v3 index. For (manufactured) example:
Do you have any information on actual times for working with big repos?
As an example, I created one with 400k files. After following the steps here,
git status
takes 8 seconds to complete. I have plenty of resources. So, it's just slow. I am curious what sort of times you're getting with your big repos.I will have to see if submodules help with this at all. This material is all reference information, and isn't going to be changed very much. So it's possible I'd be better off with an "active" repo, and a "reference" repo (maybe connected by submodule, maybe not).
Joey did make the suggestion of storing those sorts of files in a separate branch. I just did a test, and it appears that the limiting factor is in fact the number of files in the working tree. Deleting a lot of the files brought git back up to speed. So from a simplicity standpoint, I may want to have a
reference
branch with those files in it. And perhaps have two local clones of the repo - onemain
and onereference
so I can explore and copy files fromreference
tomain
as needed.Separate branch is a no-go.
git annex info
takes 3 minutes 30 seconds to report 320k annex keys.So for my purposes I think I will keep one slow reference repo, and one fast working repo.
I'm trying to use a repo consisting of about 150,000 photos/videos. I tried all the tipes here as well as the ones at ?strong> synced after 8 hours and the time is still quite poor. I don't know if using the special remote directory with importtree=yes hurts; I don't think it should. The problem seems to be largely CPU-bound and RAM-bound; syncs can use many GB of RAM and a large amount of CPU time (even when there is no evident hashing of source files). --jobs=10 hasn't caused much evident parallelization. Changing the git index type, repacking, etc. rocketed through almost instantly and made no evident change. I'd be very interested in ideas here, because at this rate, a sync that is a no-op has been running for 15 minutes just sitting there after "list source ok". I'll let it run and see what it does.
If it makes a difference, this is an unlocked repo (via git annex adjust --unlocked), not running assistant. There are no directories with excessive numbers of photos. The underlying filesystem is ZFS.
You may try the following: Set the preferred-content expression for the repo to just
present
oranything
and then rungit annex sync --content --all
. This allows git-annex to use a optimization and should run faster. Don't use--jobs
and unsetannex.jobs
git config, since these slow the optimization down a bit in my experience (note that just specifying--jobs=1
is not the same AFAIK).See also Incremental git annex sync --content --all
Using unlocked files will slow down things in general, but from your description it doesn't sound like that's the issue here.
Importing a large tree from a special remote does have a price. git-annex has to list the files, check to see if these are files it has imported before (that is the cid lookup), and also has to feed the files and sha1s to git to build a tree object. Only once it's built the tree object can it see that no files have changed and it has nothing to do.
That is always going to be slower than making a git repository in the directory and running git-annex add there. Because in that case, git can use the index to quickly tell what files are modified or new, and skip the unchanged files.
Bear in mind that the directory special remote is not the only special remote that supports importing, and so the import interface has to be general enough to support others. So it can't use the filesystem level tricks that git is able to use with its index.
Hi Joey,
Thanks for this! I may be missing some important detail here...
So what I'm seeing is that it says "list source" and then stats every file in the importdir location. That makes sense, and goes through in a minute or two.
Then it sits at 100% CPU for an hour or two, reading a lot from the cidsdb, but not opening, stating, or reading from any other file.
Maybe it checksummed something while I wasn't watching, but generally while I'm watching at least, I don't see it actually checksumming every file on every call to import.
So is it building the tree object that's slow here? And, since it doesn't store mtime, how does it detect changes?
Thanks again for your patience with all these questions as I dive back into git-annex!