Hi,
So I have yet another idea to speed up git annex. For now only for the 2nd pass of git annex sync --content --all.
- Do a normal (full) git annex sync. For every remote that we synced with, record the commit id of the current tip of the git-annex branch.
- Record the commit id only if --content --all was specified
- Record the commit id only if the remote is actually available and every file was sucessfully transfered
- If any of the remotes doesn't have a commit id recorded, go to 1. Else do a incremental git annex sync: In the 2nd pass of git annex sync --content --all,
only look at keys whose location log changed since the last (full or incremental) sync via
git diff-tree -r --name-only <lowest recorded commit id of all remotes> git-annex
. Again, update the commit id of remotes that we sucessfully synced with. - If one of the following happens, remove all recorded commit ids of all remotes, go to 1. Else go to 2.
- The preferred content expression of us or one of our remotes changed.
- The preferred content expression of a group changed
- The group of any repo (not only remotes) changed. This way remotes containing
copies=<group>:<numcopies>
recheck all keys.
This should be pretty reliable, but please double check. It has to be reliable enough to become the default.
So I implemented this with an bash script, see here. It's a bit hacky, since
git annex sync
doesn't have--batch
and even if it did,--batch
can't work with keys. Instead it check out a temporary branch, link the keys to sync in a directory and then uses --content-of= to sync only the keys within that directory.I'll use it with my repos and see if it works reliably.