pure git-annex only workflow

I’m using git annex to manage my movie collection on various devices – my laptop, a NSLU tucked away somewhere with lots of space, some external hard drives. For this use case, I do not need the full power of git as a version control system, so having to run "git commit" and coming up with commit messages is annoying. Also, this makes sense for a version control system, but not for my media collection:

$ git annex add Hot\ Fuzz\ -\ English.mkv 
add Hot Fuzz - English.mkv (checksum...) ok
(Recording state in git...)
$ git commit -m 'another movie added'
[master 851dc8a] another movie added
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 120000 00 Noch nicht gesehen/Hot Fuzz - English.mkv
$ git push jeff
Counting objects: 38, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (20/20), done.
Writing objects: 100% (26/26), 2.00 KiB, done.
Total 26 (delta 11), reused 0 (delta 0)
remote: error: refusing to update checked out branch: refs/heads/master
remote: error: By default, updating the current branch in a non-bare repository
remote: error: is denied, because it will make the index and work tree inconsistent
remote: error: with what you pushed, and will require 'git reset --hard' to match
remote: error: the work tree to HEAD.
remote: error: 
remote: error: You can set 'receive.denyCurrentBranch' configuration variable to
remote: error: 'ignore' or 'warn' in the remote repository to allow pushing into
remote: error: its current branch; however, this is not recommended unless you
remote: error: arranged to update its work tree to match what you pushed in some
remote: error: other way.
remote: error: 
remote: error: To squelch this message and still keep the default behaviour, set
remote: error: 'receive.denyCurrentBranch' configuration variable to 'refuse'.
To jeff:/mnt/media/Movies
 ! [rejected]        git-annex -> git-annex (non-fast-forward)
 ! [remote rejected] master -> master (branch is currently checked out)
error: failed to push some refs to 'jeff:/mnt/media/Movies'
To prevent you from losing history, non-fast-forward updates were rejected
Merge the remote changes (e.g. 'git pull') before pushing again.  See the
'Note about fast-forwards' section of 'git push --help' for details.

It seems that to successfully make the new files known to the other side, I have to log into jeff and pull from my current machine.

What I would like to have is that

git annex add does not require a commit afterwards.
Changes to the files are automatically picked up with the next git-annex call (similar to how etckeeper works).
Commands "git annex push" and "git annex pull" that will sync the metadata (i.e. the list of files) in both directions without further manual intervention, at least not until the two repositories have diverged in a way that is not possible to merge sensible.

Summay: git-annex is great. git is not always. Please make it possible to use git annex without having to use git.

RSS Atom

comment 1

First, you need a bare git repository that you can push to, and pull from. This simplifies most git workflow.

Secondly, I use mr, with this in .mrconfig:

[DEFAULT]
lib =
        annexupdate() {
                git commit -a -m update || true
                git pull "$@"
                git annex merge
                git push || true
        }

[lib/sound]
update = annexupdate
[lib/big]
update = annexupdate

Which makes "mr update" in repositories where I rarely care about git details take care of syncing my changes.

I also make "mr update" do a "git annex get" of some files in some repositories that I want to always populate. git-annex and mr go well together.

Perhaps my annexupdate above should be available as "git annex sync"?

Comment by joey — Fri Dec 9 22:56:11 2011

Remove comment

comment 2

Thanks for the tips so far. I guess a bare-only repo helps, but as well is something that I don’t need (for my use case), any only have to do because git works like this.

Also, if I have a mobile device that I want to push to, then I’d have to have two repositories on the device, as I might not be able to reach my main bare repository when traveling, but I cannot push to the „real“ repo on the mobile device from my computer. I guess I am spoiled by darcs, which will happily push to a checked out remote repository, updating the checkout if possible without conflict.

If I introduce a central bare repository to push to and from; I’d still have to have the other non-bare repos as remotes, so that git-annex will know about them and their files, right?

I’d appreciate a "git annex sync" that does what you described (commit all, pull, merge, push). Especially if it comes in a "git annex sync --all" variant that syncs all reachable repositories.

Comment by nomeata — Sat Dec 10 16:28:29 2011

Remove comment

comment 3

Git can actually push into a non-bare repository, so long as the branch you change there is not a checked out one. Pushing into remotes/$foo/master and remotes/$foo/git-annex would work, however determining the value that the repository expects for $foo is something git cannot do on its own. And of course you'd still have to git merge remotes/$foo/master to get the changes.

Yes, you still keep the non-bare repos as remotes when adding a bare repository, so git-annex knows how to get to them.

I've made git annex sync run the simple script above. Perhaps it can later be improved to sync all repositories.

Comment by joey — Sat Dec 10 19:43:04 2011

Remove comment

comment 4

I thought about this some more, and I think I have a pretty decent solution that avoids a central bare repository. Instead of pushing to master (which git does not like) or trying to guess the remote branch name on the other side, there is a well-known branch name, say git-annex-master. Then a sync command would do something like this (untested):

git commit -a -m 'git annex sync' # ideally with a description derived from the diff
git merge git-annex-master
git pull someremote git-annex-master # for all reachable remotes. Or better to use fetch and then merge everything in one command?
git branch -f git-annex-master # (or checkout git-annex-master, merge master, checkout master, but since we merged before this should have the same effect
git annex merge
git push someremote git-annex-master # for all reachable remotes

The nice things are: One can push to any remote repository, and thus avoid the issue of pushing to a portable device; the merging happens on the master branch, so if it fails to merge automatically, regular git foo can resolve it, and all changes eventually reach every repository.

What do you think?

Comment by nomeata — Tue Dec 13 18:16:08 2011

Remove comment

comment 5

After some experimentation, this seems to work better:

    git commit -a -m 'git annex sync'
git merge git-annex-master
for remote in $(git remote)
do
    git fetch $remote
    git merge $remote git-annex-master
done
git branch -f git-annex-master
git annex merge
for remote in $(git remote)
do
    git push $remote git-annex git-annex-master
done

Maybe this approach can be enhance to skip stuff gracefully if there is no git-annex-master branch and then be added to what "git annex sync" does, this way those who want to use the feature can do so by running "git branch git-annex-master" once. Or, if you like this and want to make it default, just make git-annex-init create the git-annex-master branch :-)

Comment by nomeata — Tue Dec 13 18:47:18 2011

Remove comment

comment 6

It would be clearer to call "git-annex-master" "synced/master" (or really "synced/$current_branch"). That does highlight that this method of syncing is not particularly specific to git-annex.

I think this would be annoying to those who do use a central bare repository, because of the unnecessary pushing and pulling to other repos, which could be expensive to do, especially if you have a lot of interconnected repos. So having a way to enable/disable it seems best.

Maybe you should work up a patch to Command/Sync.hs, since I know you know haskell

Comment by joey — Tue Dec 13 20:53:23 2011

Remove comment

comment 7

I agree on the naming suggestions, and that it does not suit everybody. Maybe I’ll think some more about it. The point is: I’m trying to make live easy for those who do not want to manually create some complicated setup, so if it needs configuration, it is already off that track. But turning the current behavior into something people have to configure is also not well received by the users.

Given that "git annex sync" is a new command, maybe it is fine to have this as a default behavior, and offer an easy way out. The easy way out could be one of two flags that can be set for a repo (or a remote):

"central", which makes git annex sync only push and pull to and that repo (unless a different remote is given on the command line)
"unsynced", which makes git annex sync skip the repo.

Maybe central is enough.

Comment by nomeata — Sun Dec 18 12:08:51 2011

Remove comment

comment 8

I don't mind changing the behavior of git-annex sync, certainly..

Looking thru git's documentation, I found some existing configuration that could be reused following your idea. There is a remote.name.skipDefaultUpdate and a remote.name.skipFetchAll. Though both have to do with fetches, not pushes. Another approach might be to use git's remote group stuff.

Comment by joey — Mon Dec 19 18:29:01 2011

Remove comment

comment 9

Another option that would please the naive user without hindering the more advanced user: "git annex init", by default, creates a synced/master branch. "git annex sync" will pull from every /sync/master branch it finds, and also push to any /sync/master branch it finds, but will not create any. So by default (at least for new users), this provides simple one-step syncing.

Advanced users can disable this per-repo by just deleting the synced/master branch. Presumably the logic will be: Every repo that should not be pushed to, because it has access to some central repo, should not have a synced/master branch. Every other repo, including the (or one of the few) central repos, will have the branch.

This is not the most expressive solution, as it does not allow configuring syncing between arbitrary pairs of repos, but it feels like a good compromise between that and simplicity and transparency.

I think it's about time that I provide less talk and more code. I’ll see when I find the time :-)

Comment by nomeata — Mon Dec 19 22:56:26 2011

Remove comment

Finally some code

The repository at http://git.nomeata.de/?p=git-annex.git;a=summary contains changes to Commands/Sync.hs (and to the manpage) that implements this behavior. The functionality should be fine; the progress output is not very nice yet, but I’m not sure if I really understood the various Command types. It also should be more easily discoverable how to activate the behavior (by running "git branch synced/master") by providing a helpful message, at least unless git annex init creates the branch by default.

Comment by nomeata — Thu Dec 29 19:58:31 2011

Remove comment

comment 11

OMG, my first sizable haskell patch!

So trying this out..

In each repo I want to sync, I first git branch synced/master

Then in each repo, I found I had to pull from each of its remotes, to get the tracking branches that defaultSyncRemotes looks for to know those remotes are syncable. This was the surprising thing for me, I had expected sync to somehow work out which remotes were syncable without my explicit pull. And it was not very obvious that sync was not doing its thing before I did that, since it still does a lot of "stuff".

Once set up properly, git annex sync fetches from each remote, merges, and then pushes to each remote that has a synced branch. Changes propigate around even when some links are one-directional. Cool!

So it works fine, but I think more needs to be done to make setting up syncing easier. Ideally, all a user would need to do is run "git annex sync" and it syncs from all remotes, without needing to manually set up the synced/master branch.

While this would lose the ability to control which remotes are synced, I think that being able to git annex sync origin and only sync from/to origin is sufficient, for the centralized use case.

Code review:

Why did you make branch strict?

There is a bit of a bug in your use of Command.Merge.start. The git-annex branch merge code only runs once per git-annex run, and often this comes before sync fetches from the remotes, leading to a push conflict. I've fixed this in my "sync" branch, along with a few other minor things.

mergeRemote merges from refs/remotes/foo/synced/master. But that will only be up-to-date if git annex sync has recently been run there. Is there any reason it couldn't merge from refs/remotes/foo/master?

Comment by joey — Fri Dec 30 21:49:06 2011

Remove comment

comment 12

I have made a new autosync branch, where all that the user needs to do is run git annex sync and it automatically sets up the synced/master branch. I find this very easy to use, what do you think?

Note that autosync is also pretty smart about not running commands like "git merge" and "git push" when they would not do anything. So you may find git annex sync not showing all the steps you'd expect. The only step a sync always performs now is pulling from the remotes.

Comment by joey — Fri Dec 30 23:45:57 2011

Remove comment

comment 13

I have merged my autosync branch, the improved sync command will be in this year's last git-annex release!

Comment by joey — Sat Dec 31 18:34:31 2011

Remove comment

comment 14

Sorry for not replying earlier, but my non-mailinglist-communications-workflows are suboptimal :-)

Then in each repo, I found I had to pull from each of its remotes, to get the tracking branches that defaultSyncRemotes looks for to know those remotes are syncable. This was the surprising thing for me, I had expected sync to somehow work out which remotes were syncable without my explicit pull. And it was not very obvious that sync was not doing its thing before I did that, since it still does a lot of "stuff".

Right. But "git fetch" ought to be enough.

Personally, I’d just pull and push everywhere, but you pointed out that it ought to be manageable. The existence of the synced/master branch is the flag that indicates this, so you need to propagate this once. Note that if the branch were already created by "git annex init", then this would not be a problem.

It is not required to use "git fetch" once, you can also call "git annex sync " once with the remote explicitly mentioned; this would involve a fetch.

While this would lose the ability to control which remotes are synced, I think that being able to git annex sync origin and only sync from/to origin is sufficient, for the centralized use case.

I’d leave this decision to you. But I see that you took the decision already, as your code now creates the synced/master branch when it does not exist (e290f4a8).

Why did you make branch strict?

Because it did not work otherwise :-). It uses pipeRead, which is lazy, and for some reason git and/or your utility functions did not like that the output of the command was not consumed before the next git command was called. I did not investigate further. For better code, I’d suggest to add a function like pipeRead that completely reads the git output before returning, thus avoiding any issues with lazyIO.

mergeRemote merges from refs/remotes/foo/synced/master. But that will only be up-to-date if git annex sync has recently been run there. Is there any reason it couldn't merge from refs/remotes/foo/master?

Hmm, good question. It is probably save to merge from both, and push only to synced/master. But which one first? synced/master can be ahead if the repo was synced to from somewhere else, master can be ahead if there are local changes. Maybe git merge should be called on all remote heads simultaniously, thus generating only one commit for the merge. I don’t know how well that works in practice.

Thanks for including my code, Joachim

Comment by nomeata — Mon Jan 2 14:02:04 2012

Remove comment

comment 15

With a lazy branch, I get "git-annex: no branch is checked out". Weird.. my best guess is that it's because this is running at the seek stage, which is unusual, and the value is not used until a later stage and so perhaps the git command gets reaped by some cleanup code before its output is read.

(pipeRead is lazy because often it's used to read large quantities of data from git that are processed progressively.)

I did make it merge both branches, separately. It would be possible to do one single merge, but it's probably harder for the user to recover if there are conflicts in an octopus merge. The order of the merges does not seem to me to matter much, barring conflicts it will work either way. Dealing with conflicts during sync is probably a weakness of all this; after the first conflict the rest of the sync will continue failing.

Comment by joey — Mon Jan 2 16:01:49 2012

Remove comment

Add a comment