The git annex sync
command provides an easy way to keep several
repositories in sync.
Often git is used in a centralized fashion with a central bare repository which changes are pulled and pushed to using normal git commands. That works fine, if you don't mind having a central repository.
But it can be harder to use git in a fully decentralized fashion, with no central repository and still keep repositories in sync with one another. You have to remember to pull from each remote, and merge the appropriate branch after pulling. It's difficult to push to a remote, since git does not allow pushes into the currently checked out branch.
git annex sync
makes it easier using a scheme devised by Joachim
Breitner. The idea is to have a branch synced/master
(actually,
synced/$currentbranch
), that is never directly checked out, and serves
as a drop-point for other repositories to use to push changes.
When you run git annex sync
, it merges the synced/master
branch
into master
, receiving anything that's been pushed to it. (If there is a
conflict in this merge, automatic conflict resolution is used to
resolve it). Then it fetches from each remote, and merges in any changes that
have been made to the remotes too. Finally, it updates synced/master
to reflect the new state of master
, and pushes it out to each of the remotes.
This way, changes propagate around between repositories as git annex sync
is run on each of them. Every repository does not need to be able to talk
to every other repository; as long as the graph of repositories is
connected, and git annex sync
is run from time to time on each, a given
change, made anywhere, will eventually reach every other repository.
(git-annex sync
will also attempt to push the master branch to remotes,
which does work for bare repositories.)
The workflow for using git annex sync
is simple:
- Make some changes to files in the repository, using
git-annex
, or anything else. - Run
git annex sync
to save the changes. - Next time you're working on a different clone of that repository,
run
git annex sync
to update it.
Note that by default, git annex sync
only synchronises the git
repositories, but does not transfer the content of annexed files. If you
want to fully synchronise two repositories content,
you can use git annex sync --content
. You can also configure
preferred content settings to make only some content be synced.
See git-annex-sync for the command's man page.
I cam upon git-annex a few months ago. I saw immidiately how it could help with some frustrations I've been having. One in particlar is keeping my vimrc in sync accross multiple locations and platforms. I finally took the time to give it a try after I finally hit my boiling point this morning. I went through the walkthrough and now I have an annax everywhere I need it.
git annex sync
and my vimrc is up-to-date, simply grand!Thanks so much for making git-annex, Daniel Wozniak
git annex copy --to bareremote
. You could run that in cron. Or, the assistant can be run as a daemon, and automatically syncs git-annex data.By default,
git annex sync
will sync to all remotes, unless you specify a remote. So, I have to specify, e.g.,git annex sync origin
. I can simplify this with aliases, I suppose, but I do a lot of teaching non-programmer scientists... so it'd be nice to be able to configure this (so beginning users don't have to keep track of as many things).Is there (or will there be) a way to do this?
Just in case you haven't considered such a scenario - maybe you have suggestions for how to collaborate more effectively with git annex (and avoid warning messages):
I'm trying to teach beginning scientist programmers (mostly graduate students), and a common scenario is to fork some scientific code. I'd like forking on github to be mundane, and not trigger warnings, and generally have as little for folks to explicitly keep track of as possible (this seems to be a common concern we share, which leads you to prefer syncing to all remotes without the option to configure the default behavior!).
However, I am currently working with students on forking and fixing up scientific code where the upstream maintainer doesn't want to allow pushes upstream, except via pull request. So, part of our approach is to set up some common shared datasets in git annex (and these just end up in our fork). If we have an "upstream" remote, git annex will try to sync with it, and report an error.
So - that's why I'd like to be able to configure the deactivation of syncing to a defined branch (e.g., "upstream"). However, if you have other suggestions to smooth the workflow, I would also like to hear those!
@Dav what kind of url does the upstream remote have? Perhaps it would be sufficient to make sync skip trying to push to git:// and http[s]:// remotes. Both are unlikely to accept pushes and in the cases where they do accept pushes it would be fine to need a manual
git push
.Anyway, you can already configure which remotes get synced with. From the man page:
So
git config remote.upstream.annex-sync=false
I noticed that in a test with 2 local repositories and around 2'000 files "git annex sync" is still very fast, but "git annex sync --content" takes multiple seconds. Is this avoidable?
I have a central repo and client repos. I want to copy all content to the central repo after a commit. Right now, I use "git annex group central backup", "git annex wanted central standard", and a hook that triggers "git annex sync --content" after each commit. Maybe there is a more efficient way to do this? Thanks for sharing thoughts.
I too feel that syncing all remotes by default is the right thing to do, but I think it should be limited to the 'master' and 'git-annex' branch. I often create branches that I want to keep local and do not want them to be synced. But I want 'master' and 'git-annex' branches to be synced with all remotes.
So it would be nice to able to set an option to sync all branches or just the 'master' and 'git-annex' or to able to ignore some branches during git annex sync
Shri
I agree with mshri. It’s confusing to have every local branch wind up on every remote (and it hinders «git annex unused»).
I tried working around this by just including relevant branches in the «fetch» refspec, but this will only work until another remote pushes the branches again.
git annex sync
pushes all branches. It does not. It pushes only the git-annex branch and the currently checked out branch.git annex sync --content
has to check each file to see if any other repository wants it. This is necessarily going to get slow when there are a lot of files. The assistant does a similar syncing but uses some tricks to avoid scanning all the files too often, while still managing to keep them all in sync -- it can do this since it's a long-running daemon and is aware when files have changed.git sync … >> fetches from each remote
Well, I have two git annex-ed repositories where "git remote -v" properly lists the other repo, and "git annex sync foo" manages to pull from foo, but "git annex sync" without a remote name simply does a local sync. Also, neither command pushes anything anywhere.
So, where does "git annex" get its list of remotes from? What could prevent it from accessing them?
If a remote has "remote..annex-sync" set to false in the git config,
git-annex sync
will skip that remote unless you specify the name. That's probably what's going on in your case.My way of working with git-annex doesn't seem to mesh well with the Assistant or even with
git annex sync
. I seem to have a bit of a control need when it comes to what gets committed when. But here's my workflow approximating what it does, with a twist. I have this in git config onmylaptop
:I don't need a
synced/git-annex
. If upstream is not up-to-date I fetch and merge. In this case upstream happens to be a bare git repo, so I don't needsynced/master
either. If upstream is non-bare, I usesynced/master
-- or sometimes I keep upstream usually checked out on an orphan branch and just switch into master to check things and then switch away to avoid conflict. If I can avoid it, I prefer not to have several branches where I don't know which one is the latest one.But here's the twist, look at this row:
If I just do
git push
, close the lid and run into the forest, it may or may not have a non-fastforward event on master and git-annex ... but it always succeeds in pushing to themylaptop
remote on my server.If I have added a batch of files, I usually push first to all my remotes, to get that precious metadata up there. At that point I don't care if there's a conflict upstream. Then I
git annex copy
to wherever, fetch all remotes,git annex merge
, maybe mergemaster
if I have to (usually not), then push to all remotes again. It's less of a bother than it sounds like. I don't even have any handy aliases for this, I prefer to just get the for loop from my command-line history.1) When I have a branch "some/branch/name" containing slashes in its name, git-annex sync strips everything up to the last slash and creates "synced/name", which may clash with "some/other/name". Is there a workaround?
2) Could the "don't use synced branches" behavior referred in the comments above somehow be configured on the repository side so that everyone cloning it doesn't need to configure it for himself?
@kartynnik, that's a bug: ?sync uses conflicting names for deep branches
Please file bugs there and not as comments here, it's too easy to lose track of a comment deep in a thread.
Hi,
how does "git-annex sync --content" transfers its file to a (regular) ssh-remote? I think it uses rsync.. Is that correct?
I want to use compression for the file transfers. Therefore, I tried in .git/config to set:
However, it seems that this crashes the upload. The sync just seems to hang.. Is it possible to use compression for the transfer? How?
@mario, great question! (Not the best place for such a question, start a thread on the forum next time..)
git-annex does use rsync when transferring files between ssh remotes. Rsync normally goes over ssh, and it might be better to enable compression at the ssh level. For example, I have "Compression yes" in
~/.ssh/config
I think that the reason your annex-rsync-upload-options setting broke it is that rsync needs --compress to be passed on to the other rsync process (in the remote repository), and that is run via git-annex-shell, which has a whitelist of options it will pass to rsync. Passing arbitrary options to rsync could allow unwanted behavior when git-annex-shell is being used as a security barrier. And --compress is one of the options that both the rsync sender and receiver have to agree on for the rsync protocol to work.
I have added a note to the man page about this limitation of what the rsync-options settings can be used to do.
I've finally taken the time to learn git-annex and am extraordinarily impressed by its usefulness and documentation.
I'm currently using git-annex as part of a scientific workflow, wherein I use git to track my analysis source code and LaTeX reports, and git-annex to handle large binary files (typically input data).
git annex sync
is really handy for making sure mygit-annex
branch propagates between my remotes, and it's hard to beat the usefulness ofgit annex sync --content
now that I've wrapped my head around standard groups. However, I'd prefer if there was a flag (or configurable option) to suppressgit annex sync
from pushing/pulling whatever branch currently happens to be checked out. I'm a pretty thoughtful committer and want more control over where my code branches (e.g.,master
) get pushed around. I saw the--no-pull
andno-push
options forgit annex sync
, but it seems that this suppresses all push/pull behavior, and thusgit annex sync --no-push --no-pull
will not sync up my specialgit-annex
branch. Is there an option or workflow that accomplishes what I'm looking for?TLDR I want a way to tell
git annex sync
to leave mymaster
(or whatever currently checked out branch is) alone (no pushing/pulling), but otherwise behave normally (e.g.,git annex sync
will just push/pull my specialgit-annex-branch
around, orgit annex sync --content
will push/pull the specialgit-annex
branch, and also move content around as it makes sense). Apologies if this is already possible, but I haven't been able to figure it out.@Dan, there's an open todo about that, http://git-annex.branchable.com/todo/sync_--branches__to_sync_only_specified_branches___40__e.g._git-annex__41__/
Please followup there if the suggested new option would work for you.
I'm currently cleaning up 3 machines (with the goal of eventually upgrading my OS's) and 2 large external drives filled with 10 plus years of backups, so my current situation is somewhat temporary and may not apply to others.
I've started using preferred content to manage which repos hang onto which content. My main cleanup workflow involves moving files into a staging repository and then adding them to the annex -- then letting the preferred content settings figure out where to send the content. If I know exactly where I want the content to go, I'll move it directly into the appropriate folder, but if I haven't figured that out yet, sometimes I'll just put it in a
stage
folder. I've simplified my preferred content settings to assume that I only have onebig
external drive where everything except the contents of thestage
directory should go, but in reality it's split up a bit across the two drives I already mentioned...I noticed the other day that I had some missing content in
big/photo/raw
, so I went into that folder and rangit annex get .
to rehydrate the missing files.Today I staged some new files and ran the following from my staging annex:
This when I noticed some weirdness:
Basically, if two copies of the same content live in two different files that have an affinity to two or more mutually exclusive annexes, it seems like the rule that applies to the last file in the directory tree is arbitrarily going to be the one that wins out in the end. It also means if you have such a situation, you're going to see a strange dance like this everytime you run
git annex sync --content
as the content moves across annexes only to make it's way back to where it started.I'm currently running v6.2, so maybe this has been fixed in the interim. Has anybody else seen this? Do standard groups address this problem? I started out tryint to use standard groups, but fell back on my own custom folder definitions when I couldn't figure out how to keep my standard groups from grabbing more content than I wanted them to.
Thanks!
@dscheffy, https://git-annex.branchable.com/bugs/indeterminite_preferred_content_state_for_duplicated_file/
@joey Is this a bug or am I missing something?
Notes:
Flow 1
git remote add test gcrypt::rsync://user@user.rsync.net:relative/path/to/repo
git annex sync
-> DOES NOT SYNC to test remotegit clone gcrypt::rsync://user@user.rsync.net:relative/path/to/repo
git push test git-annex master
git clone gcrypt::rsync://user@user.rsync.net:relative/path/to/repo
Flow 2
git remote add test gcrypt::rsync://user@user.rsync.net/full/path/to/repo
git annex sync
-> DOES SYNC to test remotegit clone gcrypt::rsync://user@user.rsync.net:relative/path/to/repo
@talmukoydu you need to file a bug report and include things like the version of git-annex you are using.. https://git-annex.branchable.com/bugs/
It would be useful to have a
git annex sync --ff-only
option. I have an alias forgit pull --ff-only
that I use most of the time, and it seems like agit annex
counterpart would be reasonable. If only one of my local repo and the remote repo have changed, I'm happy to resolve things automatically. If both have changed, then I'm going to want to think about what to do -- maybe rebase locally, maybe something else. Of course, I can manually check before runninggit annex sync
or usegit pull --ff-only
myself, but especially with several remotes, that could take some effort, and this is what we have computers for.I guess there's a question of what to do when some remotes can be fast-forwarded to and others would need a merge. I think think my ideal behavior is that if some updates can't be done without merge commits, it doesn't update any branches. But it'd also be fine to do as many updates as it can without any merges. Or do some prefix of the fast-forward updates, and then error out when it gets to the first merge. Whichever of these apply, of course it should display something if it can't handle things with fast-forwards exclusively.