sync --branches to sync only specified branches (e.g. git-annex)

As we briefly discussed via email, it would be nice if sync could sync only some branches (e.g. git-annex) not all at once.

fixed --Joey

RSS Atom

comment 1

As I suggested in email, this could be something like git annex sync --branch B and --branch could be repeated to add other branches, with the default being to sync them all. git annex sync --branch git-annex would need to be special cased I think.

Comment by joey — Mon Aug 8 15:40:55 2016

Remove comment

comment 2

This is quite old, is it still wanted?

git's remote.name.fetch config can make it only fetch a particular branch, so that's one way to do this without adding an option.

Comment by joey — Mon Aug 6 15:55:56 2018

Remove comment

Still Wanted

Thanks for directing me here from our conversation on the git-sync page.

RE your first comment here this sounds like the behavior I'm looking for, and in particular for git annex sync --branch git-annex to be a special case (ideally configurable so I don't have to type it all the time); not sure if it should be handled by git's config or tracked in the git-annex branch, but probably the former (since otherwise the behavior could change pre and post sync, which might be surprising).

I'm trying to understand your second comment. I'm not 100% clear on what git annex sync actually does under the hood; if it's pushing and pulling, then presumably I'd need to tweak the git config entries for remote.name.{push,pull}, no? OTOH, with the synced/ branch workflow, perhaps it's fetching, then merging, then pushing, and that's why I should tweak the fetch settings? This would need to get set for all of my remotes too, right?

Assuming I'm understanding correctly, the downside to this approach is that it would also change the way base git works. Really, I'm looking for a workflow where git annex is (insofar as possible) narrowly responsible for managing the annex, and base git is responsible for everything it normally is. This would let me minimally modify my git-habits and just run git annex add and git annex sync --content from time to time to make sure that things are propagating in a sensible way based on how I've configured wanted/preferences. In particular, I'd like git fetch to still behave in the way that it used to, I just don't want git annex sync pulling and pushing around my code.

Comment by Dan — Mon Jul 22 17:06:20 2019

Remove comment

git annex sync commits changes, too?

I took a closer look at the man page for git-annex-sync.

My understanding is that running it without any options will

commit any changes (presumably something like commit -am 'some default message?')
fetch and merge synced/$(currentbranch) and git-annex branches (presumably from all remotes unless o/w configured)
push the branches from (2) back to the remotes

I'm realizing that what I really want is a git-annex-sync command that will just sync git-annex content. The workflow I'd like is

fetch and merge git-annex branch from all remotes (or as specified by arguments)
push git-annex branch back to remotes from (1)
If content related flags in the style of arguments from git-annex-sync (e.g., --content) are passed, sync them around.

So the command I think I'm looking for is something like git annex sync --no-commit --branch git-annex, but since it's still in todo, I assume the --branch git-annex behavior is not yet implemented. Moreover, it seems I can configure the --no-commit option to be the default by setting the annex.autocommit option to false (and this setting is handled by the git-annex branch, with options for local override via local gitconfig). If --branch gets implemented, I'd love for there to be a similar config-level option like annex.synccurrentbranch with true giving the current (default) behavior wherein the current branch is fetch/merge/pushed, and annex.synccurrentbranch giving an alternate behavior akin to the hypothetical git annex sync --branch git-annex.

Of course I think both features (adding --branch option to git-annex-sync; supporting annex.synccurrentbranch config) would be useful, but really just the latter would be enough for what I'm looking to do.

Thanks for your consideration and for such an excellent tool; it's really been a gamechanger for me.

Comment by Dan — Mon Jul 22 17:25:39 2019

Remove comment

Still wanted (update with example)

I see this page recently was edited (when todo's were tagged) and so I wanted to chime in that this is still a feature I'm looking for, and I have a much less hypothetical use case for it.

I'm a PhD student working on a research project where I supervise several undergraduates. We have a git repository that manages all of our code, and I let git-annex manage the large datafiles (also in the same repository) on which we run our code. The main repository is hosted on GitHub, and my students have read-only access to it. They've each made forks to which they have write access. We use a special remote that we all have write-access to, with wanted set to standard and group set to archive, so that it gets all of the content and distributes it as needed (the data is massive so git-annex is vital here since the student laptops can't realistically download it all at any one time).

They use pull requests to the main GitHub repository to integrate their code changes, but we need a way to get the content of the git-annex branches in their forks (which are pushed to from their local repos) into the git-annex branch in the main GitHub repository. The natural solution seems for me (who has read/write access to the main repo and the fork) to do this, essentially pulling in git-annex branches from their forks to my local repo, and then pushing it to the main repo on GitHub. It'd also be nice if I can then push this back to all of their forks, too. I can do this manually, but I think I'd need to actually check out the git-annex branch (or stuff it in another worktree) and then do lots of work manually (or automate it in a script).

First I tried git annex sync --no-commit --no-push --no-pull which (somewhat to my surprise) did pull the git-annex branches from their forks into my local repo, but didn't push git-annex back anywhere, and it neither pushed nor pulled master. So this was a good start, but I wanted to also push only the git-annex branch to the main repo (and ideally to their forks, too). So then I (foolishly) started dropping flags, and ended up in inadvertently pulling their work-in-progress master branches into the mainline and pushing this super-merged thing back to all of them. I was able to do some reseting and quick force-pushes before anyone noticed, but I should've known better

Throughout this process I'm trying to teach them how to use git-annex (it's pretty clearly the right tool for the job but need to be really careful with what git annex sync commands I encourage them to run since I don't want the,

I'd love it if there was like a --git-annex-branch-only option that I could pass that would then do all the pushing/pulling goodness of git annex sync but without touching master (or whatever branch happens to be checked out). I could then teach the students to always use this flag to avoid actually introducing changes to their master branch (they're still learning git, too, so they'd have a hard time recovering from this). Even better if this was configurable, and something I could stick in the git-annex-config options so that when they clone the repo this setting would propagate to them along with the git-annex branch.

Is something like this in the pipeline? Also, is there a simpler workaround I can do for now that doesn't involve tons of (manual) merges and pushes?

Thanks so much for such an excellent tool; if we didn't have this, we'd essentially just give up on version control for our scientific data, which would be a real bummer.

Comment by Dan — Thu Feb 13 20:08:45 2020

Remove comment

Re: Still wanted (update with example)

@Dan, thanks for explaining your use case.

In particular, I see why you don't want to pull their master branches with the unfinished whatever, but do want to pull their git-annex branch, and probably fetch their feature branches too.

I'm still unclear on why, after merging someone's feature branch into your branch (master I suppose), you would not want sync to push that updated branch back to origin? Is the issue not about pushing master to origin, but that you don't want it to push master to their forks? But if their forks contain other changes in their master branches, it would not overwrite the changes.

It does seem like setting remote.name.fetch would work in your use case, but I also understand why you might not want to use it -- refspecs are hard! -- and when you're dealing with feature branches that might be named anything, it's hard to write a refspec that does what you want, other than one that fetches everything and merges nothing.

So I do see the appeal of a git-annex sync --only-annex that separates concerns, letting you use whatever git commands you normally would to commit and pull and push everything, except for the git-annex branch.

And, that name implies it also syncs the annexed content, so no need to remember to use --content with it. (I want --content to be sync default, but there are backcompat issues with that so annex.synccontent is only an option.)

Soo, I'm leaning toward adding that option and not some other --branches option that lists branches to sync or whatever.

And, since git-annex config can set repo-wide annex.synccontent and annex.autocommit that change the behavior of git-annex sync, it could make sense to also have a setting that enables --only-annex by default. I don't know if I'd encourage setting that in your repo though;, it might teach the students a non-standard git-annex behavior. Re that, it would be helpful if you could finish this interrupted thought of yours:

Throughout this process I'm trying to teach them how to use git-annex (it's pretty clearly the right tool for the job but need to be really careful with what git annex sync commands I encourage them to run since I don't want the,

Because I'm not yet seeing how any use of git-annex sync by the students could be problimatic; it won't be able to push their master branch to your repo or anything.

Comment by joey — Mon Feb 17 17:15:10 2020

Remove comment

comment 7

Implemented --only-annex.

I'm going to close this todo, but do follow up if that does not adequately cover your use case.

Comment by joey — Mon Feb 17 19:07:47 2020

Remove comment

An overdue and overlong reply

It looks like this functionality was implemented before I could get my comment writen, but I thought it might be useful to post it anyway. It seems like the implementing changes are now in master, so if I build from source I'll get these new features, right? I assume they'll also make it into the next release of git-annex (at which point I'll version bump at homebrew, which is what I'm having my students use to install git-annex).

Thanks for your thoughtful response. I also agree that having an --only-annex option is perfectly satisfactory and more nuanced --branch-to-sync options are probably overkill. As to whether --only-annex should imply --content, I'm more agnostic and defer to your wisdom. However, if I call git-annex with --only-annex --no-content, will it push/pull the git-annex branches and leave the content alone? From looking at your commit message, it sounds like there is now a --not-only-annex option which can override a configured only-annex property, but it's not clear how --no-content might enter the picture.

Let me try to finish my dangling thought from the last comment thread. For clarity, I'll introduce some labels for repositories and assume the only people working on this project are me (Dan) and two students, Alice and Bob. Let Dan-local refer to the repository on my laptop (and similar for {Alice,Bob}-local), let Dan-github refer to my repo on GitHub, and {Alice,Bob}-github refer to my student's forks. Dan has push access to {Dan,Alice,Bob}-github. Alice and Bob can fetch from {Dan,Alice,Bob}-github, but can only push to their own github repositories (Alice can push to Alice-github, Bob to Bob-github).

Without an --only-annex option, I have two primary concerns. The first is the thought I left dangling, which I'll now complete: Throughout this process I'm trying to teach them how to use git-annex (it's pretty clearly the right tool for the job ) but need to be really careful with what git annex sync commands I encourage them to run since I don't want them to inadvertantly pull changes into their local branches (especially integrating changes from one another) and then wind up being confused as to how things got there. Like many newcomers to git, they're still at the rote learning stage where they are memorizing commands to type and are still developing a mental model of what's happening when they fetch/pull/push. For this reason, I think that avoiding their local branches changing as a side-effect of a git-annex command (i.e., by specifying this option in the config that travels in git-annex branch) will make it easier for them to understand base git. There's some risk that they'll learn bad git-annex habits from this and be surprised at all the things git annex sync does when they use it elsewhere, but for now it seems easier to help them understand git but use git-annex mechanically, and once they're comfortable with that I can help them to understand what git-annex is actually doing and the nuances of git annex sync.

The second problem is that because I'm the only one with push access to Dan-GitHub, everything has to get there either via a pull request (which I can accept after review) or I need to push it there myself via Dan-local. In particular, to keep the git-annex branch in Dan-GitHub up to date, I need to be integrating {Alice,Bob}-github/git-annex (or perhaps synced/git-annex?) into Dan-local/git-annex and then push it to Dan-GitHub/git-annex. I can do this manually, but it's a lot of typing (especially if there are many more students than just Alice and Bob), so git annex sync seems like a nice way to accomplish this. However, it has the side effect of also pulling in {Alice,Bob}-GitHub/{synced/master,master} and then pushing that up to Dan-GitHub/synced/master, and if Alice and Bob are also running git annex sync, changes from Alice will show up in Bob-local/master and vice versa. Moreover, if they're also pushing e.g. Alice-local/master -> Alice-GitHub/master, their pull requests will suddenly get very noisy as they'll incorporate more than just their own changes, and for them to remedy this it will require careful use of git reset which is a dangerous command for them to run at this stage of their learning.

Git that I am, after running git annex sync I saw that my Dan-local/master was now ahead of Dan-GitHub/master, and I foolishly pushed, which now plopped half-baked code from Alice and Bob into the primary branch of our primary repository on github. It also had the unfortunate side-effect of closing out open pull requests from Alice and Bob (since github saw that their changes were now reachable from Dan-Github/master). I did some reset-ing, git annex sync --cleanup, and some force pushes to clean everything up before Alice or Bob could fetch, so other than having to re-open their pull requests, this didn't screw them up too much.

Finally, I want to clarify my understanding of the synced/branch workflow, which seems clever but I never fully understood it. From some simple experimenting (I have not waded very far into the source code), it seems that if I just run git annex sync (with no flags and assuming I haven't configured anything to do otherwise), and assuming that BRANCH is checked out locally, it will do the following, I think:

Stage and commit any changes in tracked files
Merge synced/BRANCH into BRANCH
Loop over remotes, for each
1. Pull from the remote (seems like it just fetches all branches)
2. Merge REMOTE/BRANCH into BRANCH
3. If REMOTE/synced/BRANCH exists, merge it into BRANCH
Do octopus merge of all REMOTE/git-annex and REMOTE/synced/git-annex branches into local git-annex branch
Loop over remotes again, for each
1. Push git-annex -> REMOTE/synced/git-annex
2. Push git-annex -> REMOTE/git-annex
3. Push BRANCH -> REMOTE/synced/BRANCH

I'm a little confused by what the synced/git-annex branches are for, but I suppose they're even less likely to ever be checkoued out that git-annex and provide a safeguard. I think they will be included in the octopus merge described above.

Step 3.2 (merge REMOTE/BRANCH into BRANCH) was a surprise to me based on my reading of the git annex sync documentation since I only expected it to only integrate changes from REMOTE/synced/BRANCH.

It seems like neither the sync documentation on branchable nor what is obtained with man git-annex-sync enumerate all of these steps, although reading them together gives an almost complete picture of what is going on. Since the documentation suggests the end user can just run these steps manually as an alternative to using git annex sync, it seems like it'd be helpful to very concretely document what those steps are. I'd be happy to take a crack at updating the documentation to be more thorough, but wanted to make sure I actually understand what is going on before doing so.

Again, I want to re-articulate how much I enjoy git annex and how difficult it would be to do any sort of version control for our data without it. I deeply appreciate the time and energy that you put into this very valuable and useful tool.

Comment by Dan — Mon Feb 17 22:59:19 2020

Remove comment

comment 9

Indeed, I had missed the case of --no-content combined with --only-annex. Now implemented.

It will be in the next release, which has slipped one day due to the above. ;-)

I've improved the documentation of synced/ branches on the git-annex-sync man page, although users normally should not need to concern themselves with them.

I see where the man page confused you about REMOTE/synced/BRANCH, that was some particularly poor wording and is fixed.

The difficulty with documenting what git-annex sync does in extreme detail is that there are quite a lot of little hacks like synced branches that most users just don't need to know about, but help users in particular situations (who also generally don't know about or even notice it either).

Just for example, sync sometimes pulls from the same remote twice. Why a second pull? Well, it knows it has spent a long time at the --content step, and so pulling again before it pushes makes it much less likely that the push will fail due to some other change having been made on the remote in the meantime. If a user were manually pulling and pushing, they would most likely pull again if their push failed due to such a situation, so there's not much point documenting what sync does (which could also change if I find a better approach).

So I prefer to keep the description of sync as high level as possible.

Comment by joey — Tue Feb 18 16:23:28 2020

Remove comment

thanks

"--no-content combined with --only-annex. Now implemented" -- thanks a lot, I was also looking for that.

Comment by Ilya_Shlyakhter — Tue Feb 18 20:00:11 2020

Remove comment

Interaction with config annex.synccontent

So far, the new --only-annex option (and related annex.synconlyannex set to true config setting) are working beautifully for me, thanks!

As described previously, --only-annex now implies --content, but this can be overridden with something like git annex sync --only-annex --no-content (and this works successfully for me). I wanted to configure it so it will always not sync content. For example, when on a new clone, running git annex sync --only-annex without first configuring wanted/group assignments/etc can result in a lot of files moving around on a big repository, which I'd rather not interrupt brusquely with C-c.

I thought that perhaps setting annex.synccontent to false might achieve this, but from some experimenting it doesn't: doing git annex config --set annex.synccontent false and then git annex sync --only-annex will still sync content). This isn't a huge surprise, since it's not a documented feature for git annex config which only described what setting annex.synccontent to true will accomplish (and doesn't mention what setting it to false would do). However, it seems like it would be in the spirit of letting users override the --only-annex implies --content option, if desired.

Apologies if this is better to discuss over at the git annex config page, or if @joeyh doesn't want to implement it at all, but just wanted to suggest it.

Comment by Dan — Mon Feb 24 21:34:50 2020

Remove comment

comment 12

Generally, command-line options should override other configuration, so --only-annex seems like it should enable content syncing even if annex.synccontent=false were a documented setting, the same as --content should.

I think a case could be made for annex.synccontent=false + annex.synconlyannex=true not syncing content.

I do wonder though if that's the best approach. It kind of seems like what you really want is a way to configure a default preferred content expression for clones that do not have a specific one configured. And that seems more broadly useful than fine-tuning the interaction of two git-annex sync configurations.