I have a large git repository with binary files scattered over different branches. I want to switch to git-annex mainly for performance reasons, but I don't want to loose my history.

I tried to rewrite the (cloned) repository with git-filter-branch but failed miserably for several reasons:

  • --tree-filter performs its operations in a temporary directory (.git-rewrite/t/) so the symlinks point to the wrong destination (../../.git/annex/).
  • annex log files are stored in .git-annex/ instead of .git-rewrite/t/.git-annex/ so the filter operation misses them

Any suggestions how to proceed?

EDIT 3/2/2010 I finally got it working for my purposes. Hardest part was preserving the branches while injecting the new git annex setup base commit.

Clone repository

git clone original migrate
cd migrate
git checkout mybranch
git checkout master
git remote rm origin

Inject git annex setup base commit and repair branches

git symbolic-ref HEAD refs/heads/newroot
git rm --cached *
git clean -f -d
git annex init master
echo \*.rpm annex.backend=SHA1 >> .gitattributes
git commit -m "store rpms in git annex" .gitattributes
git cherry-pick $(git rev-list --reverse master | head -1)
git rebase --onto newroot newroot master
git rebase --onto master mybranch~1 mybranch
git branch -d newroot

Migrate repository

mkdir .temp
cp .git-annex/* .temp/
MYWORKDIR=$(pwd) git filter-branch \
 --tag-name-filter cat \
 --tree-filter '
    mkdir -p .git-annex;
    cp ${MYWORKDIR}/.temp/* .git-annex/;
    for rpm in $(git ls-files | grep "\.rpm$"); do
        echo;
        git annex add $rpm;
        annexdest=$(readlink $rpm);
        if [ -e .git-annex/$(basename $annexdest).log ]; then
            echo "FOUND $(basename $annexdest).log";
        else
            echo "COPY $(basename $annexdest).log";
            cp ${MYWORKDIR}/.git-annex/$(basename $annexdest).log .git-annex/;
            cp ${MYWORKDIR}/.git-annex/$(basename $annexdest).log ${MYWORKDIR}/.temp/;
        fi;
        ln -sf ${annexdest#../../} $rpm;
    done;
    git reset HEAD .git-rewrite;
    :
    ' -- $(git branch | cut -c 3-)
rm -rf .temp
git reset --hard

TODO:

  • Find a way to repair branches automatically (detect branch points and run appropriate git rebase commands)

I'll be happy to try any suggestions to improve this migration script.

P.S. Is there a way to edit comments?

I don't know how to approach this yet, but I support the idea -- it would be great if there was a tool that could punch files out of git history and put them in the annex. (Of course with typical git history rewriting caveats.)

Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?

Comment by joey Fri Feb 25 05:16:48 2011

My current workflow looks like this (I'm still experimenting):

Create backup clone for migration

git clone original migrate
cd migrate
for branch in $(git branch -a | grep remotes/origin | grep -v HEAD); do git checkout --track $branch; done

Inject git annex initialization at repository base

git symbolic-ref HEAD refs/heads/newroot
git rm --cached *.rpm
git clean -f -d
git annex init master
git cherry-pick $(git rev-list --reverse master | head -1)
git rebase --onto newroot newroot master
git rebase master mybranch # how to automate this for all branches?
git branch -d newroot

Start migration with tree filter

echo \*.rpm annex.backend=SHA1 > .git/info/attributes
MYWORKDIR=$(pwd) git filter-branch --tree-filter ' \
    if [ ! -d .git-annex ]; then \
        mkdir .git-annex; \
        cp ${MYWORKDIR}/.git-annex/uuid.log .git-annex/; \
        cp ${MYWORKDIR}/.gitattributes .; \
    fi
    for rpm in $(git ls-files | grep "\.rpm$"); do \
        echo; \
        git annex add $rpm; \
        annexdest=$(readlink $rpm); \
        if [ -e .git-annex/$(basename $annexdest).log ]; then \
            echo "FOUND $(basename $annexdest).log"; \
        else \
            echo "COPY $(basename $annexdest).log"; \
            cp ${MYWORKDIR}/.git-annex/$(basename $annexdest).log .git-annex/; \
        fi; \
        ln -sf ${annexdest#../../} $rpm; \
    done; \
    git reset HEAD .git-rewrite; \
    : \
    ' -- $(git branch | cut -c 3-)
rm -rf .temp
git reset --hard

There are still some drawbacks:

  • git history shows that git annex log files are modified with each checkin
  • branches have to be rebased manually before starting migration
Comment by tyger Tue Mar 1 14:07:50 2011

Sounds like it might be enough to add a switch to git-annex that overrides where it considers the top of the git repository to be?

It should sufficient to honor GIT_DIR/GIT_WORK_TREE/GIT_INDEX_FILE environment variables. git filter-branch sets GIT_WORK_TREE to ., but this can be mitigated by starting the filter script with 'GIT_WORK_TREE=$(pwd $GIT_WORK_TREE)'. E.g. GIT_DIR=/home/tyger/repo/.git, GIT_WORK_TREE=/home/tyger/repo/.git-rewrite/t, then git annex should be able to compute the correct relative path or maybe use absolute pathes in symlinks.

Another problem I observed is that git annex add automatically commits the symlink; this behaviour doesn't work well with filter-tree. git annex commits the wrong path (.git-rewrite/t/LINK instead of LINK). Also filter-tree doesn't expect that the filter script commmits anything; new files in the temporary work tree will be committed by filter-tree on each iteration of the filter script (missing files will be removed).

Comment by tyger Wed Mar 2 08:15:37 2011
For the portion: git rebase master mybranch # how to automate this for all branches?

Try this:

branch_to_ignore='git-annex|master|newroot'
for branch in $(git for-each-ref --sort=-committerdate refs/heads --format='%(refname:short)' | egrep -v $branch_to_ignore )
    do git rebase --onto master "$branch~" "$branch"
    echo "Rebasing branch $branch onto master...." 
done

Feel free to add/correct as necessary
Comment by Laura Thu Jan 16 17:47:45 2014