Hi,
So I have yet another idea to speed up git annex. For now only for the 2nd pass of git annex sync --content --all.
- Check which remotes are currently available (i.e. online and connected).
- If any of the (available) remotes doesn't have a commit recorded (See below), do a full sync.
- If numcopies.log, mincopies.log, trust.log, group.log, preferred-content.log, required-content.log, group-preferred-content.log or transitions.log (on the git-annex branch) changed since the last time we synced, do a full sync (since if any of those logs changed, it may affect the preferred-content expressions and we need to reevaluate every key).
- If every check passed, do an incremental sync.
Full sync:
- Do a normal (full) git annex sync.
- For every remote that we synced with, record the commit id of the current tip of the git-annex branch.
- Record the commit id only if every file was successfully transferred/dropped.
Incremental sync:
- In the 2nd pass of git annex sync --content --all, only look at keys whose location log changed since the last (full or incremental) sync via
git diff-tree -r --name-only <lowest recorded commit id of all remotes> git-annex
. - Again, update the commit id of remotes that we successfully synced with.
So I implemented this with an bash script, see here. It's a bit hacky, since
git annex sync
doesn't have--batch
and even if it did,--batch
can't work with keys. Instead it check out a temporary branch, link the keys to sync in a directory and then uses --content-of= to sync only the keys within that directory.I'll use it with my repos and see if it works reliably.
The script needed a few tweaks, you can find the updated version here.
I used the script here since half a year now and everything works fine. I even had a problem with failed transfers due to a bad SATA-cable and the script did handle it properly.
git annex fsck --all
confirms that everything is alright.Thank you for rewording, which should not have been necessary, but seems to have helped my reading comprehension.
This does seem like a good idea! That diff should be fast and if the location log changed, it needs to recheck preferred content against the changed situation, and if it didn't, we know preferred content will have the same result as currently applies. Elegant.
I suppose it needs to record the branch tip for each remote, because different remotes can be synced at different times. It can record it locally, in a hidden ref or something.
Your script checks for changes to the preferred-content.log etc by storing a copy and comparing it with the current one. But since it knows the old git-annex branch tip, it can just request a diff of those files between the old and new shas, eg:
If that outputs anything the logs changed and the optimisation can't be used.
Weirdly, this will make --all often faster than not using --all, because it will be able to quickly see there is nothing to do. Occurs to me that the same method could be used to tell when a non-all sync is a no-op, and so speed up those, although only in the case where there was a previous --all sync. Or, it could record a tuple of (tree, git-annex branch), and use that to speed up non-all syncs, at least of the variety that don't operate on a specific list of files, but on a whole tree.
My recent optimisations of
git-annex sync
with importtree remotes uses a similar diffing approach.git-annex satisfy
syncs--content
by default, so this optimisation would be especially nice to have for it.To implement this optimisation for a non-all sync, when the tree being synced has changed, it ought to diff from the old tree to the current tree, and sync those files. Preferred content can vary depending on filename, and diffing like that will avoid scanning every file in the whole tree.
And when there are location log changes, it needs to also sync files in the tree that use keys whose location log changed, using the git-annex branch diff to find those keys. (And presumably then using the keys database to get back to the filenames.)
So, implementing an optimisation like this for a non-all sync has two separate diffs which would have to be combined together somehow.
Doing that in constant memory would be hard. It seems that a bloom filter cannot be used to check if a file was processed in the first diff and avoid processing it again in the second diff. Because a false positive would avoid processing a file whose location log did change. I think it would need to use an on-disk structure maybe (eg sqlite)?
None of which should prevent implementing this nice optimisation for --all.