I have a git annex repo for all my media that has grown to 57866 files and git operations are getting slow, especially on external spinning hard drives, so I decided to split it into separate repositories.

Here is how to split out a repository that contains a subset of the files in the larger repository. The larger repository is left as-is, but similar methods can be used to remove the files from it. Or, it can be deleted once it gets split up into several smaller repositories.

(This is the reverse of [[migrating two seperate disconnected directories to git annex]].)

Suppose the old big repo is at ~/oldrepo, and you want to split out photos from it, and those are all located inside ~/oldrepo/photos.

First, let's create a new empty repo.

mkdir ~/photos
cd photos
git init

Now to populate the new repo with the files we want from the old repo. We can use git filter-branch to create a git branch that contains only the history of the files in photos. That command has a lot of options and ways to use it, but here is one simple way:

cd ~/oldrepo

# filter a branch to with only the files wanted by the new repository
git branch split-master master
git filter-branch --prune-empty --subdirectory-filter photos split-master

# replace the new repo's master branch with the filtered branch
git push ~/photos split-master
git branch -D split-master
cd ~/photos
git reset --hard split-master
git branch -d split-master

Next, the git-annex branch needs to be filtered to include only the files in photos, and that filtered branch sent to the new repository. That can be done with the git-annex-filter-branch(1) command.

cd ~/oldrepo
annexrev=$(git annex filter-branch photos --include-all-key-information --include-all-repo-config --include-global-config)
git push ~/photos $annexrev:refs/heads/git-annex

Next, initialize git-annex on the new repository. This uses the same annex.uuid as was in the old repository. That's ok, because the repository that's been split off will never have the old repository as a remote.

cd ~/photos
git annex reinit $(git config --file ../tofilter/.git/config annex.uuid)

Finally the annexed file contents need to be copied to the new repository:

cd ~/photos

# Hardlink all the annexed data from the old repo
cp -rl ~/oldrepo/.git/annex/objects .git/annex/

# Remove unneeded hard links
git annex unused --quiet
git annex drop --unused --force

# Fix up annex links to content and make sure it's all ok.
git annex fsck

Warning: This method of copying the annexed file contents and dropping the unused ones causes the git-annex branch to log information.

alternative older method

Here is another way to do it. Suppose the old big repo is at ~/oldrepo:

# Create a new repo for photos only
mkdir ~/photos
cd photos
git init
git annex init laptop

# Hardlink all the annexed data from the old repo
cp -rl ~/oldrepo/.git/annex/objects .git/annex/

# Regenerate the git annex metadata
git annex fsck --fast

# Also split the repo on the usb key
cd /media/usbkey
git clone ~/photos
cd photos
git annex init usbkey
cp -rl ../oldrepo/.git/annex/objects .git/annex/
git annex fsck --fast

# Connect the annexes as remotes of each other
git remote add laptop ~/photos
cd ~/photos
git remote add usbkey /media/usbkey

At this point, I went through all repos doing standard cleanup:

# Remove unneeded hard links
git annex unused
git annex dropunused --force 1-12345

# Sync
git annex sync

To make sure nothing is missing, I used git annex find --not --in=here to see if, for example, the usbkey that should have everything could be missing some thing.

Update: Antoine Beaupré pointed me to this tip about Repositories with large number of files which I will try next time one of my repositories grows enough to hit a performance issue.

This document was originally written by Enrico Zini and added to this wiki by anarcat.