Just as git does not scale well with large files, it can also become painful to work with when you have a large number of files. Below are things I have found to minimise the pain.

Using version 4 index files
Packing
File count per directory
Forget tracking information

Using version 4 index files

During operations which affect the index, git writes an entirely new index out to index.lck and then replaces .git/index with it. With a large number of files, this index file can be quite large and take several seconds to write every time you manipulate the index!

This can be mitigated by changing it to version 4 which uses path compression to reduce the filesize:

git update-index --index-version 4

NOTE: The git documentation warns that this version may not be supported by other git implementations like JGit and libgit2.

Personally, I saw a reduction from 516MB to 206MB (40% of original size) and got a much more responsive git!

It may also be worth doing the same to git-annex's index:

GIT_INDEX_FILE=.git/annex/index git update-index --index-version 4

Though I didn't gain as much here with 89MB to 86MB (96% of original size).

Packing

As I have gc disabled:

git config gc.auto 0

so I control when it is run, I ended up with a lot of loose objects which also cause slowness in git. Using

git count-objects

to tell me how many loose objects I have, when I reach a threshold (~25000), I pack those loose objects and clean things up:

git repack -d
git gc
git prune

File count per directory

If it takes a long time to list the files in a directory, naturally, git(-annex) will be affected by this bottleneck.

You can avoid this by keeping the number of files in a directory to between 5000 and 20000 (depends on the filesystem and its settings).

fpart can be a very useful tool to achieve this.

This sort of usage was discussed in Handling a large number of files and "git annex sync" synced after 8 hours. -- ?CandyAngel

Forget tracking information

In addition to keeping track of where files are, git-annex keeps a log that keeps track of where files were. This can take up space as well and slow down certain operations.

You can use the git-annex-forget command to drop historical location tracking info for files.

Note: this was discussed in scalability with lots of files. -- anarcat

RSS Atom

comment 1

This is some excellent information. Thank you.

Comment by ghen1 — Wed Jul 8 04:08:28 2015

Remove comment

comment 2

As writing the index file becomes the bottleneck, turning on split index mode might be a help as well. See git-update-index's man page.

Comment by joey — Tue Sep 22 16:51:51 2015

Remove comment

comment 3

I have been playing with tracking a large number of url's for about one month now. Having already been disappointed by how git performs when there are a very large amount of files in the annex, I tested making multiple annexes. I did find that splitting the url's into multiple annexes increased performance, but at the cost of extra housekeeping, duplicated url's, and more work needed to keep track of the url's. Part of the duplication and tracking problem was mitigated by using a dumb remote, such as rsync or directory, where a very large amount of objects can be stored. The dumb remotes perform very well, however each annex needed to be synced regularly with the dumb remote.

I found the dumb remote to be great for multiple annexes. I have noticed that a person can create a new annex and extract a tarball of symlinks into the repo, the git commit the links. Subsequently, executing git-annex fsck --from dummy would setup the tracking info, which was pretty useful.

However, I found that by the time I got to over fifty annexes, the overall performance far worse than just storing the url's and file paths in a postgresql database. In fact, the url's are already being stored and accessed from such a database, but I had the desire to access the url's from multiple machines, which is a bit more difficult with a centralized database.

After reading the tips and pages discussing splitting the files into multiple directories, and changing the index version, I decided to try a single annex to hold the url's. Over the new year's weekend, I decided to write a script that generates rss files to use with importfeed to add url's to this annex. I have noticed that when using git commit the load average of the host was in the mid twenties and persisted for hours until I had to kill the processes to be able to use the machine again (I would really like to know if there is a git config setting that would keep the load down, so the machine can be used during a commit). I gave up on git-annex sync this morning, since it was taking longer than I was able to sit in the coffee shop and wait for it (~3 hrs).

I came back to the office, and started git gc which has been running for ~1hr.

When making the larger annex, I decided to use the hexidecimal value of uuid5(url) for each filename, and use the two high nybbles and the two low nybbles for a two state directory structure, with respect for the advice from CandyAngel. When my url's are organized in this manner, I still need access to the database to perform the appropriate git-annex get which impairs the use of this strategy, but I'm still testing and playing around. I suspended adding url's to this annex until I get at least one sync performed.

The url annex itself is not very big, and I am guessing the average file size to be close to 500K. The large number of url's seems to be a problem I have yet to solve. I wanted to experiment with this to further the idea of the public git-annex repositories that seem to be a useful idea, even though the utility of this idea is very limited at the moment.

Comment by umeboshi — Mon Jan 4 21:26:05 2016

Remove comment

comment 4

@umeboshi: Odd that you report your machine freezes during commits.. I find the exact opposite.. waiting for a long time with no load at all.

My current setup for my sorting annex (where I import all the files off old HDDs) is to have the HDD plugged into my home server (Atom 330 @ 1.6Ghz) and import the files on a cloned (empty) annex. Doing so for 1.1M files (latest HDD) is a long wait, because 80% of the time is waiting for something to happen (but there being no load on the machine). Once that is done, the HDD is transferred to my desktop, where the annex is "joined" to the others and files are sorted in a dedicated VM[1], where commit times are reasonable.

[1] Fully virtualising my desktop is possibly the best thing I've ever done, in terms of setup. Locking up any VM affects none of the others (which is handy, as I discovered an issue that causes X to almost hardlock whenever libvo is used..).

Comment by CandyAngel — Tue Jan 5 11:41:13 2016

Remove comment

comment 5

Splitting the index (git update-index --split-index) doesn't work for me at all. While it may initially reduce the size of .git/index, making a commit inflates it back to its original size anyway.

I thought it might be some interaction with v4 index and its compression mechanics but it does the same if I set it to v3 index. For (manufactured) example:

$ git update-index --split-index
$ du -sh .git/*index*
4.0K    .git/index
76M     .git/sharedindex.70e661456302b51a7ec59bf5b32d630e74b34c7c

... add 8000 files ...

$ git commit -m "add files"
$ du -sh .git/*index*
80M     .git/index
76M     .git/sharedindex.70e661456302b51a7ec59bf5b32d630e74b34c7c

Comment by CandyAngel — Tue Feb 2 10:08:06 2016

Remove comment

merge with scalability?

shouldn't this tip be merged into the scalability page directly?

Comment by anarcat — Mon Apr 24 14:10:29 2017

Remove comment

Thank you

I just ran into related git scaling issues while trying to make a repo in a folder with 6.5 million files. Writeup. I did things the naive/wrong way on purpose, but now I'm curious if I could actually get this work fast, and your pointers are a good help. Thanks!

Comment by breck7 — Thu Jan 30 15:15:24 2020

Remove comment

comment 8

Do you have any information on actual times for working with big repos?

As an example, I created one with 400k files. After following the steps here, git status takes 8 seconds to complete. I have plenty of resources. So, it's just slow. I am curious what sort of times you're getting with your big repos.

I will have to see if submodules help with this at all. This material is all reference information, and isn't going to be changed very much. So it's possible I'd be better off with an "active" repo, and a "reference" repo (maybe connected by submodule, maybe not).

Joey did make the suggestion of storing those sorts of files in a separate branch. I just did a test, and it appears that the limiting factor is in fact the number of files in the working tree. Deleting a lot of the files brought git back up to speed. So from a simplicity standpoint, I may want to have a reference branch with those files in it. And perhaps have two local clones of the repo - one main and one reference so I can explore and copy files from reference to main as needed.

Comment by pat — Wed May 5 19:39:11 2021

Remove comment

comment 9

Separate branch is a no-go. git annex info takes 3 minutes 30 seconds to report 320k annex keys.

So for my purposes I think I will keep one slow reference repo, and one fast working repo.

Comment by pat — Wed May 5 22:51:28 2021

Remove comment

comment 10

@pat: Sorry, I don't have such performance information for git-annex as I stopped using it ~2 years ago.

Comment by CandyAngel — Thu Nov 4 16:04:06 2021

Remove comment

comment 11

I'm trying to use a repo consisting of about 150,000 photos/videos. I tried all the tipes here as well as the ones at ?strong> synced after 8 hours and the time is still quite poor. I don't know if using the special remote directory with importtree=yes hurts; I don't think it should. The problem seems to be largely CPU-bound and RAM-bound; syncs can use many GB of RAM and a large amount of CPU time (even when there is no evident hashing of source files). --jobs=10 hasn't caused much evident parallelization. Changing the git index type, repacking, etc. rocketed through almost instantly and made no evident change. I'd be very interested in ideas here, because at this rate, a sync that is a no-op has been running for 15 minutes just sitting there after "list source ok". I'll let it run and see what it does.

If it makes a difference, this is an unlocked repo (via git annex adjust --unlocked), not running assistant. There are no directories with excessive numbers of photos. The underlying filesystem is ZFS.

Comment by jgoerzen — Sun Sep 4 22:31:14 2022

Remove comment

comment 12

To expand: this took about 2 hours to run. A git annex sync from another git annex repo is a lot faster (a minute or so). It is the directory remote that's so slow. Examining with strace and lsof, I don't believe this is chechsumming. In fact, it seems to be mostly continuous reading from cidsdb/db, which is only 49MB in size and therefore certainly cached. The process is entirely CPU-bound.

Comment by jgoerzen — Mon Sep 5 01:01:59 2022

Remove comment

comment 13

You may try the following: Set the preferred-content expression for the repo to just present or anything and then run git annex sync --content --all. This allows git-annex to use a optimization and should run faster. Don't use --jobs and unset annex.jobs git config, since these slow the optimization down a bit in my experience (note that just specifying --jobs=1 is not the same AFAIK).

Using unlocked files will slow down things in general, but from your description it doesn't sound like that's the issue here.

Comment by Lukey — Mon Sep 5 09:07:05 2022

Remove comment

comment 14

Importing a large tree from a special remote does have a price. git-annex has to list the files, check to see if these are files it has imported before (that is the cid lookup), and also has to feed the files and sha1s to git to build a tree object. Only once it's built the tree object can it see that no files have changed and it has nothing to do.

That is always going to be slower than making a git repository in the directory and running git-annex add there. Because in that case, git can use the index to quickly tell what files are modified or new, and skip the unchanged files.

Bear in mind that the directory special remote is not the only special remote that supports importing, and so the import interface has to be general enough to support others. So it can't use the filesystem level tricks that git is able to use with its index.

Comment by joey — Mon Sep 5 18:09:59 2022

Remove comment

comment 15

Hi Joey,

Thanks for this! I may be missing some important detail here...

So what I'm seeing is that it says "list source" and then stats every file in the importdir location. That makes sense, and goes through in a minute or two.

Then it sits at 100% CPU for an hour or two, reading a lot from the cidsdb, but not opening, stating, or reading from any other file.

Maybe it checksummed something while I wasn't watching, but generally while I'm watching at least, I don't see it actually checksumming every file on every call to import.

So is it building the tree object that's slow here? And, since it doesn't store mtime, how does it detect changes?

Thanks again for your patience with all these questions as I dive back into git-annex!

Comment by jgoerzen — Mon Sep 5 23:57:01 2022

Remove comment

Add a comment