Hi there, I have an old archive drive with ~300k files, ~2 TB data. They're files that I would like to use in my work, but I've had to move them off my machine due to space. I periodically copy files off of the archive when I need to work with them. This of course is before I had even heard of git-annex
.
So now I'm wondering how I can start to integrate these files into my work. Two basic ideas I have are:
git-annex
the whole thing right away, andgit annex get
them onto my local machine as needed.- Start an empty annex on the archive drive. Move files from the old archive location into the annex as needed.
So basically I'm wondering between annexing the whole thing to start, or gradually building up the annex.
I have no idea how well git-annex
will work with 300k files / 2 TB data.
How would you approach incorporating an old archive drive into a new annex?
2TB of data is no problem. git does start to slow down as the number of files in a tree increases, with 200,000 or so where it might start to become noticable. With this many files, updating .git/index will need to write out something like 50mb of data to disk.
(git has some "split index" stuff that is supposed to help with this, but I have not had the best experience with it.)
Committing the files to a branch other than master might be a reasonable compromise. Then you can just copy the git-annex symlinks over to master as needed, or check out the branch from time to time.
The main bottleneck doing that would be that the git-annex branch will also contain 1 location log file per annexed file, and writing to .git/annex/index will slow down a bit with so many files too. But, git-annex has a lot of optimisations around batching writes to its index that should make the impact minimal.
I think that could work nicely. I do like the idea of having my files annexed, and distributing them across machines that way, so this strikes me as a good compromise.
Thank you for the idea!