Pondering syncing today. I will be doing syncing of the git repository first, and working on syncing of file data later.
The former seems straightforward enough, since we just want to push all changes to everywhere. Indeed, git-annex already has a sync command that uses a smart technique to allow syncing between clones without a central bare repository. (Props to Joachim Breitner for that.)
But it's not all easy. Syncing should happen as fast as possible, so changes show up without delay. Eventually it'll need to support syncing between nodes that cannot directly contact one-another. Syncing needs to deal with nodes coming and going; one example of that is a USB drive being plugged in, which should immediately be synced, but network can also come and go, so it should periodically retry nodes it failed to sync with. To start with, I'll be focusing on fast syncing between directly connected nodes, but I have to keep this wider problem space in mind.
One problem with git annex sync
is that it has to be run in both clones
in order for changes to fully propagate. This is because git doesn't allow
pushing changes into a non-bare repository; so instead it drops off a new
branch in .git/refs/remotes/$foo/synced/master
. Then when it's run locally
it merges that new branch into master
.
So, how to trigger a clone to run git annex sync
when syncing to it?
Well, I just realized I have spent two weeks developing something that can
be repurposed to do that! Inotify can watch for changes to
.git/refs/remotes
, and the instant a change is made, the local sync
process can be started. This avoids needing to make another ssh connection
to trigger the sync, so is faster and allows the data to be transferred
over another protocol than ssh, which may come in handy later.
So, in summary, here's what will happen when a new file is created:
- inotify event causes the file to be added to the annex, and immediately committed.
- new branch is pushed to remotes (probably in parallel)
- remotes notice new sync branch and merge it
- (data sync, TBD later)
- file is fully synced and available
Steps 1, 2, and 3 should all be able to be accomplished in under a second.
The speed of git push
making a ssh connection will be the main limit
to making it fast. (Perhaps I should also reuse git-annex's existing ssh
connection caching code?)