I have a bunch of directory trees with large data files scattered over various computers and disk drives - they contain photos, videos, music, and so on. In many cases I initially copied one of these trees from one machine to another just as a cheap and dirty backup, and then made small modifications to both trees in ways I no longer remember. For example, I returned from a trip with a bunch of new photos, and then might have rotated some of them 90 degrees on one machine, and edited or renamed them on another.
What I want to do now is use git-annex as a way of initially synchronising the trees, and then fully managing them on an ongoing basis. Note that the trees are not yet git repositories. In order to be able to detect straight-forward file renames, I believe that ?the SHA1 backend probably makes the most sense.
I've been playing around and arrived at the following setup procedure. For the sake of discussion, I assume that we have two trees a
and b
which live in the same directory referred to by $td
, and that all large files end with the .avi
suffix.
# Setup git in 'a'.
cd $td/a
git init
# Setup git-annex in 'a'.
echo '* annex.backend=SHA1' > .gitattributes
git add .gitattributes
git commit -m'use SHA1 backend'
git annex init
# Annex all large files.
find -name \*.avi | xargs git annex add
git add .
git commit -m'Initial import'
# Setup git in 'b'.
cd $td/b
git clone -n $td/a new
mv new/.git .
rmdir new
git reset # reset git index to b's wd - hangover from cloning from 'a'
# Setup git-annex in 'b'.
# This merges a's (origin's) git-annex branch into the local git-annex branch.
git annex init
# Annex all large files - because we're using SHA1 backend, some
# should hash to the same keys as in 'a'.
find -name \*.avi | xargs git annex add
git add .
git commit -m'Changes in b tree'
git remote add a $td/a
# Now pull changes in 'b' back to 'a'.
cd $td/a
git remote add b $td/b
git pull b master
This seems to work, but have I missed anything?
This is an entirely reasonable way to go about it.
However, doing it this way causes files in B to always "win" -- If the same filename is in both repositories, with differing content, the version added in B will superscede the version from A. If A has a file that is not in B, a git commit -a in B will commit a deletion of that file.
I might do it your way and look at the changes in B before (or even after) committing them to see if files from A were deleted or changed.
Or, I might just instead keep B in a separate subdirectory in the repository, set up like so:
Or, a third way would be to commit A to a branch like branchA and B to a separate branchB, and not merge the branches at all.
As joey points out the problem is B overwrites A, so that any files in A that aren't in B will be removed. But the suggestion to keep B in a separate subdirectory in the repository means I'll end up with duplicates of files in both A and B. What I want is to have the merged superset of all files from both A and B with only one copy of identical files.
The problem is that unique symlinks in A/master are deleted when B/master is merged in. To add back the deleted files after the merge you can do this:
Once the first merge has been done after set up, you can continue to make changes to A and B and future merges won't require accounting for deleted files in this way.
Here's another handy command-line which annexes all files in repo B which have already been annexed in repo A:
The 'T' outputted by git status for these files indicates a type change: it's a symlink to the annex in repo A, but a normal file in repo B.