I am evaluating the best strategy to use git-annex to manage my media library. It consists of about 300.000 files totaling 1 TB of data.
My question is, wheither it would be of advantage to split the repo into several smaller ones (Like Photos, Videos, Musik, Books, ...)?
Would this affect performance of certain operations? I.e. Operations that have superlinear (O(na) with a > 1) complexity?
I am thinking about "git annex unused", which takes 22 minutes on my machine performed on the full repo.
Do you have more interesting information on using git-annex in this scale?
git-annex unused
needs to scan the entire repository. But it uses a bloom filter, so its complexity is O(n) to the number of keys.git annex fsck
scans the entire repository and also reads all available file content. But we have incremental fsck support now.The rest of git-annex is designed to have good locality.
The main problem you are likely to run into is innefficiencies with git's index file. This file records the status of every file in the repository, and commands like
git add
rewrite the whole file. git-annex uses a journal to minimise operations that need to rewrite the git index file, but this won't help you when you're using raw git commands in the repository.The biggest problem I am facing actually is the git pack-objects command, which takes forever.
I am not starting it manually, but "git annex sync" is.
Any advice on how to make this faster? Does it make sense at all to pack a bunch of uncompressible simlinks?
I already tried the delta=false attrigute, without any (major) effect.