syncing non-git trees with git-annex

I have a bunch of directory trees with large data files scattered over various computers and disk drives - they contain photos, videos, music, and so on. In many cases I initially copied one of these trees from one machine to another just as a cheap and dirty backup, and then made small modifications to both trees in ways I no longer remember. For example, I returned from a trip with a bunch of new photos, and then might have rotated some of them 90 degrees on one machine, and edited or renamed them on another.

What I want to do now is use git-annex as a way of initially synchronising the trees, and then fully managing them on an ongoing basis. Note that the trees are not yet git repositories. In order to be able to detect straight-forward file renames, I believe that ?the SHA1 backend probably makes the most sense.

I've been playing around and arrived at the following setup procedure. For the sake of discussion, I assume that we have two trees a and b which live in the same directory referred to by $td, and that all large files end with the .avi suffix.

 # Setup git in 'a'.
 cd $td/a
 git init

 # Setup git-annex in 'a'.
 echo '* annex.backend=SHA1' > .gitattributes
 git add .gitattributes
 git commit -m'use SHA1 backend'
 git annex init

 # Annex all large files.
 find -name \*.avi | xargs git annex add
 git add .
 git commit -m'Initial import'

 # Setup git in 'b'.
 cd $td/b
 git clone -n $td/a new
 mv new/.git .
 rmdir new
 git reset # reset git index to b's wd - hangover from cloning from 'a'

 # Setup git-annex in 'b'.
 # This merges a's (origin's) git-annex branch into the local git-annex branch.
 git annex init

 # Annex all large files - because we're using SHA1 backend, some
 # should hash to the same keys as in 'a'.
 find -name \*.avi | xargs git annex add
 git add .
 git commit -m'Changes in b tree'

 git remote add a $td/a

 # Now pull changes in 'b' back to 'a'.
 cd $td/a
 git remote add b $td/b
 git pull b master

This seems to work, but have I missed anything?

RSS Atom

comment 1

This is an entirely reasonable way to go about it.

However, doing it this way causes files in B to always "win" -- If the same filename is in both repositories, with differing content, the version added in B will superscede the version from A. If A has a file that is not in B, a git commit -a in B will commit a deletion of that file.

I might do it your way and look at the changes in B before (or even after) committing them to see if files from A were deleted or changed.

Or, I might just instead keep B in a separate subdirectory in the repository, set up like so:

mv b old_b
git clone a b
cd b
mv ../old_b .
git annex add old_b --not --exclude '*.avi'

Or, a third way would be to commit A to a branch like branchA and B to a separate branchB, and not merge the branches at all.

Comment by joey — Wed Dec 14 17:31:31 2011

Remove comment

comment 2

As joey points out the problem is B overwrites A, so that any files in A that aren't in B will be removed. But the suggestion to keep B in a separate subdirectory in the repository means I'll end up with duplicates of files in both A and B. What I want is to have the merged superset of all files from both A and B with only one copy of identical files.

The problem is that unique symlinks in A/master are deleted when B/master is merged in. To add back the deleted files after the merge you can do this:

git checkout master~1 deleted_file_name                                                              #checkout a single deleted file called deleted_file_name
git diff master~1 master --numstat --name-only --diff-filter=D                                       #get the names of all files deleted between master and master~1
git diff master~1 master --numstat --name-only --diff-filter=D | xargs git checkout master~1         #checkout all deleted files between master and master~1

Once the first merge has been done after set up, you can continue to make changes to A and B and future merges won't require accounting for deleted files in this way.

Comment by Oliver — Fri Dec 23 22:04:08 2011

Remove comment

re-annexing previously annexed files

Here's another handy command-line which annexes all files in repo B which have already been annexed in repo A:

git status --porcelain | sed -n '/^ T /{s///;p}' | xargs git annex add

The 'T' outputted by git status for these files indicates a type change: it's a symlink to the annex in repo A, but a normal file in repo B.

Comment by Adam — Thu Mar 29 21:41:54 2012

Remove comment

Add a comment