In my analyses I often have multiple (>10k) generated small files in a single directory.
I would like to store this in git annex, in order to version them and probably even synchronize. The problem is that if a huge number of files is stored inside the repository, the repository itself becomes huge and slow. There are some ways to improve the performance (1, 2, 3), but it doesn't solve the issue completely.
I was wondering if it is possible to force git annex to treat a single directory with multiple files as a single item? Probably with abandoning the checksum verification.
Have you thought about (compress/tar)ing the files and then only include the single object in the annex?
Would give you the benefits of deduping any data too.
I considered this, but unfortunately in this form I cannot work with files.
There are hacks like archivemount, but they are extremely slow for large number of files and not available on many systems.
Well, there's no way I know of to make git treat a directory as one item in history. So even if git-annex didn't, it would not help much.
You can check a tar archive into git-annex as a single file, and have hooks to handle unpacking it into the directory and repacking the directory back to it.