Handling a large number of files

git-annex/ forum/ Handling a large number of files

Edit
RecentChanges
History
Preferences
Branchable
4 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

I have noticed performance getting really slow when adding files (git annex add . ) to a directory already containing several hundred thousand files. When using git annex, is it more recommended to split large numbers of files into multiple directories containing fewer files? Is there a particular recommended way of handling large numbers of files (say getting into the millions) in git annex?

Thanks

RSS Atom

comment 1

I've posted a few tips on this site about dealing with large (file count) repositories.. I should probably collate them into a single post in tips (and I shall go so!).

In terms of directories containing hundreds of thousands of files, my research indicates that the slowness stems from how the filesystem (at least, for ext3) fetches the listing, as it requires multiple "trips" to query it (it only returns 32KB of metadata at a time or something like that).

So you can really speed it up by partitioning your files in such a way that a directory listing returns quickly. Personally, I use fpart and 5000 files per directory, but I think you can go up to about 20000.

If you do 'find', and it takes a really long time to start printing paths, you have too many files

Comment by CandyAngel — Wed Jun 17 07:51:53 2015

Remove comment

comment 2

Here we go, I started a tips page for this sort of usage

Comment by CandyAngel — Wed Jun 17 08:30:55 2015

Remove comment

possible large directory scaling...

awesome tips page, thanks!

i was wondering ... i seem to recall newer changes to ext4 that made it more scalable with large directories, the dir_index flag i think. did you try that and, if so, what was the effect? --anarcat

Comment by anarcat — Wed Jun 17 12:36:36 2015

Remove comment

comment 4

From what I have gleaned about dir_index, it speeds up finding particular files in a directory, but slows down enumeration of directory contents. It could very well end up as 6 of one, half a dozen of another as git-annex does a bit of both.

So yeah, I've benchmarked this a little (just with a loopback ext3 partitions, dropping caches each time, so caveat emptor) and my findings match the above.

Without dir_index, enumerating 102400 files takes 56% more time than 125 files (0.69 seconds vs 1.08 seconds), but with dir_index, it takes 715% more time (0.6 vs. 4.86). However, the initial creation of the files was much faster with dir_index than without (though I wasn't timing that bit).

The results I've gotten so far show that (with dir_index): min time: Increases after 64000 files max time: Increases after 128000 files mean time: Increases after 16000 files

Pretty easy to see where the 20000 files number may have come from.

I'll post the numbers after I have run this longer one (100 iterations) in case anyone is interested. My conclusion is that dir_index being enabled and 16000 files per directory is optimal.

Comment by CandyAngel — Thu Jun 18 14:03:08 2015

Remove comment

Add a comment

Links: tips/Repositories with large number of files treat directory with multiple files as a single item

Last edited Tue Jun 16 22:43:15 2015