sync

The git annex sync command provides an easy way to keep several repositories in sync.

Often git is used in a centralized fashion with a central bare repository which changes are pulled and pushed to using normal git commands. That works fine, if you don't mind having a central repository.

But it can be harder to use git in a fully decentralized fashion, with no central repository and still keep repositories in sync with one another. You have to remember to pull from each remote, and merge the appropriate branch after pulling. It's difficult to push to a remote, since git does not allow pushes into the currently checked out branch.

git annex sync makes it easier using a scheme devised by Joachim Breitner. The idea is to have a branch synced/master (actually, synced/$currentbranch), that is never directly checked out, and serves as a drop-point for other repositories to use to push changes.

When you run git annex sync, it merges the synced/master branch into master, receiving anything that's been pushed to it. (If there is a conflict in this merge, automatic conflict resolution is used to resolve it). Then it fetches from each remote, and merges in any changes that have been made to the remotes too. Finally, it updates synced/master to reflect the new state of master, and pushes it out to each of the remotes.

This way, changes propagate around between repositories as git annex sync is run on each of them. Every repository does not need to be able to talk to every other repository; as long as the graph of repositories is connected, and git annex sync is run from time to time on each, a given change, made anywhere, will eventually reach every other repository.

(git-annex sync will also attempt to push the master branch to remotes, which does work for bare repositories.)

The workflow for using git annex sync is simple:

Make some changes to files in the repository, using git-annex, or anything else.
Run git annex sync to save the changes.
Next time you're working on a different clone of that repository, run git annex sync to update it.

Note that by default, git annex sync only synchronises the git repositories, but does not transfer the content of annexed files. If you want to fully synchronise two repositories content, you can use git annex sync --content. You can also configure preferred content settings to make only some content be synced.

See git-annex-sync for the command's man page.

RSS Atom

very nice

Here's a way to get from a starting point of two or more peer directory trees not tracked by git or git-annex, to the point where they can be synced in the manner described above: syncing non-git trees with git-annex

Comment by Adam — Sat Feb 25 15:02:18 2012

Remove comment

Its just grand

I cam upon git-annex a few months ago. I saw immidiately how it could help with some frustrations I've been having. One in particlar is keeping my vimrc in sync accross multiple locations and platforms. I finally took the time to give it a try after I finally hit my boiling point this morning. I went through the walkthrough and now I have an annax everywhere I need it. git annex sync and my vimrc is up-to-date, simply grand!

Thanks so much for making git-annex, Daniel Wozniak

Comment by Daniel — Fri Jan 4 14:45:35 2013

Remove comment

synchronising stored files with a bare repository

Good for syncing indexes, but if I want to synchronise all data files too (specifically pushing to a remote bare repository), how do I do that?

Comment by Diggory — Fri Jan 11 16:52:38 2013

Remove comment

comment 4

Yes, sync only syncs the git branches, not git-annex data. To sync the date, you can run a command such as git annex copy --to bareremote. You could run that in cron. Or, the assistant can be run as a daemon, and automatically syncs git-annex data.

Comment by joeyh.name — Fri Jan 11 18:18:07 2013

Remove comment

How to sync content with git-annex, not assistant

Sure assistant can sync git-annex data across remotes. But how do I tell a repo to sync git-annex data, but not so manually as to having to know what exactly needs to be copied from/to where?

Comment by chocolate.camera — Fri Oct 11 09:58:12 2013

Remove comment

Syncing only a specific branch

By default, git annex sync will sync to all remotes, unless you specify a remote. So, I have to specify, e.g., git annex sync origin. I can simplify this with aliases, I suppose, but I do a lot of teaching non-programmer scientists... so it'd be nice to be able to configure this (so beginning users don't have to keep track of as many things).

Is there (or will there be) a way to do this?

Comment by Dav — Sun Nov 24 17:48:22 2013

Remove comment

comment 7

I feel that syncing with all remotes by default is the right thing for git annex sync to do.

Comment by joeyh.name — Tue Nov 26 20:08:33 2013

Remove comment

Use case for not syncing to all remotes

Just in case you haven't considered such a scenario - maybe you have suggestions for how to collaborate more effectively with git annex (and avoid warning messages):

I'm trying to teach beginning scientist programmers (mostly graduate students), and a common scenario is to fork some scientific code. I'd like forking on github to be mundane, and not trigger warnings, and generally have as little for folks to explicitly keep track of as possible (this seems to be a common concern we share, which leads you to prefer syncing to all remotes without the option to configure the default behavior!).

However, I am currently working with students on forking and fixing up scientific code where the upstream maintainer doesn't want to allow pushes upstream, except via pull request. So, part of our approach is to set up some common shared datasets in git annex (and these just end up in our fork). If we have an "upstream" remote, git annex will try to sync with it, and report an error.

So - that's why I'd like to be able to configure the deactivation of syncing to a defined branch (e.g., "upstream"). However, if you have other suggestions to smooth the workflow, I would also like to hear those!

Comment by Dav — Sun Dec 8 19:20:26 2013

Remove comment

comment 9

@Dav what kind of url does the upstream remote have? Perhaps it would be sufficient to make sync skip trying to push to git:// and http[s]:// remotes. Both are unlikely to accept pushes and in the cases where they do accept pushes it would be fine to need a manual git push.

Anyway, you can already configure which remotes get synced with. From the man page:

       remote.<name>.annex-sync
              If set to false, prevents  git-annex  sync  (and  the  git-annex
              assistant) from syncing with this remote.

So git config remote.upstream.annex-sync=false

Comment by joeyh.name — Thu Dec 12 17:54:55 2013

Remove comment

Sorry to just be getting back...

The URLs in question in this case were read-only github https URLs. In any case, my problems are solved by what you've already suggested. I think a less error-sounding response to read-only https repos sounds nice!

Comment by Dav — Sun Jan 26 22:51:28 2014

Remove comment

sync slow with content switch

I noticed that in a test with 2 local repositories and around 2'000 files "git annex sync" is still very fast, but "git annex sync --content" takes multiple seconds. Is this avoidable?

I have a central repo and client repos. I want to copy all content to the central repo after a commit. Right now, I use "git annex group central backup", "git annex wanted central standard", and a hook that triggers "git annex sync --content" after each commit. Maybe there is a more efficient way to do this? Thanks for sharing thoughts.

Comment by Matthias — Tue Apr 22 20:37:05 2014

Remove comment

Sync specific branch or ignore a branch during sync

I too feel that syncing all remotes by default is the right thing to do, but I think it should be limited to the 'master' and 'git-annex' branch. I often create branches that I want to keep local and do not want them to be synced. But I want 'master' and 'git-annex' branches to be synced with all remotes.

So it would be nice to able to set an option to sync all branches or just the 'master' and 'git-annex' or to able to ignore some branches during git annex sync

Shri

Comment by mshri [livejournal.com] — Fri Apr 25 15:37:53 2014

Remove comment

comment 13

I agree with mshri. It’s confusing to have every local branch wind up on every remote (and it hinders «git annex unused»).

I tried working around this by just including relevant branches in the «fetch» refspec, but this will only work until another remote pushes the branches again.

Comment by zardoz — Thu May 15 08:28:09 2014

Remove comment

comment 14

Added a wishlist item http://git-annex.branchable.com/todo/Allow_syncing_only_selected_branches/

Comment by zardoz — Thu May 15 08:58:26 2014

Remove comment

comment 15

We seem to have some rumor going around that git annex sync pushes all branches. It does not. It pushes only the git-annex branch and the currently checked out branch.

Comment by joeyh.name — Thu May 15 19:53:16 2014

Remove comment

comment 16

@Matthias, git annex sync --content has to check each file to see if any other repository wants it. This is necessarily going to get slow when there are a lot of files. The assistant does a similar syncing but uses some tricks to avoid scanning all the files too often, while still managing to keep them all in sync -- it can do this since it's a long-running daemon and is aware when files have changed.

Comment by joeyh.name — Thu May 15 19:54:54 2014

Remove comment

comment 17

git sync … >> fetches from each remote

Well, I have two git annex-ed repositories where "git remote -v" properly lists the other repo, and "git annex sync foo" manages to pull from foo, but "git annex sync" without a remote name simply does a local sync. Also, neither command pushes anything anywhere.

So, where does "git annex" get its list of remotes from? What could prevent it from accessing them?

Comment by Matthias — Thu Jan 22 22:04:09 2015

Remove comment

comment 18

If a remote has "remote..annex-sync" set to false in the git config, git-annex sync will skip that remote unless you specify the name. That's probably what's going on in your case.

Comment by joey — Wed Feb 4 19:12:23 2015

Remove comment

git-config for manual sync-like operations

My way of working with git-annex doesn't seem to mesh well with the Assistant or even with git annex sync. I seem to have a bit of a control need when it comes to what gets committed when. But here's my workflow approximating what it does, with a twist. I have this in git config on mylaptop:

remote.myserver.fetch=+refs/heads/*:refs/remotes/myserver/*
remote.myserver.push=refs/heads/*:refs/remotes/mylaptop/*
remote.myserver.push=refs/heads/master:refs/heads/master
remote.myserver.push=refs/heads/git-annex:refs/heads/git-annex

I don't need a synced/git-annex. If upstream is not up-to-date I fetch and merge. In this case upstream happens to be a bare git repo, so I don't need synced/master either. If upstream is non-bare, I use synced/master -- or sometimes I keep upstream usually checked out on an orphan branch and just switch into master to check things and then switch away to avoid conflict. If I can avoid it, I prefer not to have several branches where I don't know which one is the latest one.

But here's the twist, look at this row:

remote.myserver.push=refs/heads/*:refs/remotes/mylaptop/*

If I just do git push, close the lid and run into the forest, it may or may not have a non-fastforward event on master and git-annex ... but it always succeeds in pushing to the mylaptop remote on my server.

If I have added a batch of files, I usually push first to all my remotes, to get that precious metadata up there. At that point I don't care if there's a conflict upstream. Then I git annex copy to wherever, fetch all remotes, git annex merge, maybe merge master if I have to (usually not), then push to all remotes again. It's less of a bother than it sounds like. I don't even have any handy aliases for this, I prefer to just get the for loop from my command-line history.

Comment by clacke — Thu Apr 14 08:21:03 2016

Remove comment

Branch names containing slashes

1) When I have a branch "some/branch/name" containing slashes in its name, git-annex sync strips everything up to the last slash and creates "synced/name", which may clash with "some/other/name". Is there a workaround?

2) Could the "don't use synced branches" behavior referred in the comments above somehow be configured on the repository side so that everyone cloning it doesn't need to configure it for himself?

Comment by kartynnik — Mon Aug 29 17:30:44 2016

Remove comment

Re: Branch names containing slashes

@kartynnik, that's a bug: ?sync uses conflicting names for deep branches

Please file bugs there and not as comments here, it's too easy to lose track of a comment deep in a thread.

Comment by joey — Wed Sep 21 18:56:22 2016

Remove comment

Compressed file transfers

Hi,

how does "git-annex sync --content" transfers its file to a (regular) ssh-remote? I think it uses rsync.. Is that correct?

I want to use compression for the file transfers. Therefore, I tried in .git/config to set:

[remote "origin"]
    annex-rsync-upload-options = "--compress"

However, it seems that this crashes the upload. The sync just seems to hang.. Is it possible to use compression for the transfer? How?

Comment by mario — Wed May 3 20:52:43 2017

Remove comment

comment 23

@mario, great question! (Not the best place for such a question, start a thread on the forum next time..)

git-annex does use rsync when transferring files between ssh remotes. Rsync normally goes over ssh, and it might be better to enable compression at the ssh level. For example, I have "Compression yes" in ~/.ssh/config

I think that the reason your annex-rsync-upload-options setting broke it is that rsync needs --compress to be passed on to the other rsync process (in the remote repository), and that is run via git-annex-shell, which has a whitelist of options it will pass to rsync. Passing arbitrary options to rsync could allow unwanted behavior when git-annex-shell is being used as a security barrier. And --compress is one of the options that both the rsync sender and receiver have to agree on for the rsync protocol to work.

I have added a note to the man page about this limitation of what the rsync-options settings can be used to do.

Comment by joey — Tue May 9 17:52:11 2017

Remove comment

sync only git-annex branch

I've finally taken the time to learn git-annex and am extraordinarily impressed by its usefulness and documentation.

I'm currently using git-annex as part of a scientific workflow, wherein I use git to track my analysis source code and LaTeX reports, and git-annex to handle large binary files (typically input data). git annex sync is really handy for making sure my git-annex branch propagates between my remotes, and it's hard to beat the usefulness of git annex sync --content now that I've wrapped my head around standard groups. However, I'd prefer if there was a flag (or configurable option) to suppress git annex sync from pushing/pulling whatever branch currently happens to be checked out. I'm a pretty thoughtful committer and want more control over where my code branches (e.g., master) get pushed around. I saw the --no-pull and no-push options for git annex sync, but it seems that this suppresses all push/pull behavior, and thus git annex sync --no-push --no-pull will not sync up my special git-annex branch. Is there an option or workflow that accomplishes what I'm looking for?

TLDR I want a way to tell git annex sync to leave my master (or whatever currently checked out branch is) alone (no pushing/pulling), but otherwise behave normally (e.g., git annex sync will just push/pull my special git-annex-branch around, or git annex sync --content will push/pull the special git-annex branch, and also move content around as it makes sense). Apologies if this is already possible, but I haven't been able to figure it out.

Comment by Dan — Thu Jul 18 19:52:35 2019

Remove comment

Re: sync only git-annex branch

@Dan, there's an open todo about that, http://git-annex.branchable.com/todo/sync_--branches__to_sync_only_specified_branches___40__e.g._git-annex__41__/

Please followup there if the suggested new option would work for you.

Comment by joey — Fri Jul 19 16:55:33 2019

Remove comment

+1 for a command to sync only the git-annex branch

I've also missed this functionality. One use is to sync the metadata.

Comment by Ilya_Shlyakhter — Fri Jul 19 18:16:08 2019

Remove comment

Duplicate content creates frustrating cycles

I'm currently cleaning up 3 machines (with the goal of eventually upgrading my OS's) and 2 large external drives filled with 10 plus years of backups, so my current situation is somewhat temporary and may not apply to others.

I've started using preferred content to manage which repos hang onto which content. My main cleanup workflow involves moving files into a staging repository and then adding them to the annex -- then letting the preferred content settings figure out where to send the content. If I know exactly where I want the content to go, I'll move it directly into the appropriate folder, but if I haven't figured that out yet, sometimes I'll just put it in a stage folder. I've simplified my preferred content settings to assume that I only have one big external drive where everything except the contents of the stage directory should go, but in reality it's split up a bit across the two drives I already mentioned...

$ git annex wanted big
include=* and exclude=stage/*
$ git annex wanted stage
include=stage/*

I noticed the other day that I had some missing content in big/photo/raw, so I went into that folder and ran git annex get . to rehydrate the missing files.

Today I staged some new files and ran the following from my staging annex:

git annex add stage
git commit -m 'stage some new photos'
git annex sync --content

This when I noticed some weirdness:

pull big
...
ok
(merging big into stage...)
(recording state in git...)
copy photos/raw/pict0001.jpg (to big...) 
SHA256E-abc--xyz.jpg
(checksum...) ok
drop photos/raw/pict0001.jpg ok
...
get stage/cats.jpg (from big...) 
SHA256E-abc--xyz.jpg
(checksum...) ok
drop big stage/cats.jpg ok
pull big

Basically, if two copies of the same content live in two different files that have an affinity to two or more mutually exclusive annexes, it seems like the rule that applies to the last file in the directory tree is arbitrarily going to be the one that wins out in the end. It also means if you have such a situation, you're going to see a strange dance like this everytime you run git annex sync --content as the content moves across annexes only to make it's way back to where it started.

I'm currently running v6.2, so maybe this has been fixed in the interim. Has anybody else seen this? Do standard groups address this problem? I started out tryint to use standard groups, but fell back on my own custom folder definitions when I couldn't figure out how to keep my standard groups from grabbing more content than I wanted them to.

Thanks!

Comment by dscheffy — Wed Dec 16 17:10:52 2020

Remove comment

comment 28

@dscheffy, https://git-annex.branchable.com/bugs/indeterminite_preferred_content_state_for_duplicated_file/

Comment by joey — Thu Dec 17 16:45:20 2020

Remove comment

`git annex sync` not automatically syncing gcrypt remotes using relative paths

@joey Is this a bug or am I missing something?

Notes:

I am using the latest git-remote-gcrypt, version 1.5

Flow 1

git remote add test gcrypt::rsync://user@user.rsync.net:relative/path/to/repo
git annex sync -> DOES NOT SYNC to test remote
Nothing has been synced so I CANNOT successfully clone from the test remote with git clone gcrypt::rsync://user@user.rsync.net:relative/path/to/repo
git push test git-annex master
I can successfully clone from the test remote with git clone gcrypt::rsync://user@user.rsync.net:relative/path/to/repo

Flow 2

git remote add test gcrypt::rsync://user@user.rsync.net/full/path/to/repo
git annex sync -> DOES SYNC to test remote
I can successfully clone from the test remote with git clone gcrypt::rsync://user@user.rsync.net:relative/path/to/repo

Comment by talmukoydu — Sun Mar 19 19:20:44 2023

Remove comment

RE: `git annex sync` not automatically syncing gcrypt remotes using relative paths

@joey definitely seems like a bug. I am able to easily verify by changing the remote url back and forth in the .git/config and then running git annex sync. If the relative url is used git annex sync does not sync to that remote.

Comment by talmukoydu — Sun Mar 19 19:27:46 2023

Remove comment

comment 31

@talmukoydu you need to file a bug report and include things like the version of git-annex you are using.. https://git-annex.branchable.com/bugs/

Comment by joey — Tue Mar 21 17:47:45 2023

Remove comment

`git annex sync --ff-only`

It would be useful to have a git annex sync --ff-only option. I have an alias for git pull --ff-only that I use most of the time, and it seems like a git annex counterpart would be reasonable. If only one of my local repo and the remote repo have changed, I'm happy to resolve things automatically. If both have changed, then I'm going to want to think about what to do -- maybe rebase locally, maybe something else. Of course, I can manually check before running git annex sync or use git pull --ff-only myself, but especially with several remotes, that could take some effort, and this is what we have computers for.

I guess there's a question of what to do when some remotes can be fast-forwarded to and others would need a merge. I think think my ideal behavior is that if some updates can't be done without merge commits, it doesn't update any branches. But it'd also be fine to do as many updates as it can without any merges. Or do some prefix of the fast-forward updates, and then error out when it gets to the first merge. Whichever of these apply, of course it should display something if it can't handle things with fast-forwards exclusively.

Comment by adehnert — Sun Jul 21 01:04:44 2024

Remove comment

Add a comment