I have noticed performance getting really slow when adding files (git annex add . ) to a directory already containing several hundred thousand files. When using git annex, is it more recommended to split large numbers of files into multiple directories containing fewer files? Is there a particular recommended way of handling large numbers of files (say getting into the millions) in git annex?
Thanks
I've posted a few tips on this site about dealing with large (file count) repositories.. I should probably collate them into a single post in tips (and I shall go so!).
In terms of directories containing hundreds of thousands of files, my research indicates that the slowness stems from how the filesystem (at least, for ext3) fetches the listing, as it requires multiple "trips" to query it (it only returns 32KB of metadata at a time or something like that).
So you can really speed it up by partitioning your files in such a way that a directory listing returns quickly. Personally, I use fpart and 5000 files per directory, but I think you can go up to about 20000.
If you do 'find', and it takes a really long time to start printing paths, you have too many files
awesome tips page, thanks!
i was wondering ... i seem to recall newer changes to ext4 that made it more scalable with large directories, the
dir_index
flag i think. did you try that and, if so, what was the effect? --anarcatFrom what I have gleaned about dir_index, it speeds up finding particular files in a directory, but slows down enumeration of directory contents. It could very well end up as 6 of one, half a dozen of another as git-annex does a bit of both.
So yeah, I've benchmarked this a little (just with a loopback ext3 partitions, dropping caches each time, so caveat emptor) and my findings match the above.
Without dir_index, enumerating 102400 files takes 56% more time than 125 files (0.69 seconds vs 1.08 seconds), but with dir_index, it takes 715% more time (0.6 vs. 4.86). However, the initial creation of the files was much faster with dir_index than without (though I wasn't timing that bit).
The results I've gotten so far show that (with dir_index): min time: Increases after 64000 files max time: Increases after 128000 files mean time: Increases after 16000 files
Pretty easy to see where the 20000 files number may have come from.
I'll post the numbers after I have run this longer one (100 iterations) in case anyone is interested. My conclusion is that dir_index being enabled and 16000 files per directory is optimal.