I have a git annex repo for all my media that has grown to 57866 files and git operations are getting slow, especially on external spinning hard drives, so I decided to split it into separate repositories.
Here is how to split out a repository that contains a subset of the files in the larger repository. The larger repository is left as-is, but similar methods can be used to remove the files from it. Or, it can be deleted once it gets split up into several smaller repositories.
(This is the reverse of [[migrating two seperate disconnected directories to git annex]].)
Suppose the old big repo is at ~/oldrepo
, and you want to split out
photos from it, and those are all located inside ~/oldrepo/photos
.
First, let's create a new empty repo.
mkdir ~/photos
cd photos
git init
Now to populate the new repo with the files we want from the old repo. We
can use git filter-branch
to create a git branch that contains only the
history of the files in photos
. That command has a lot of options and
ways to use it, but here is one simple way:
cd ~/oldrepo
# filter a branch to with only the files wanted by the new repository
git branch split-master master
git filter-branch --prune-empty --subdirectory-filter photos split-master
# replace the new repo's master branch with the filtered branch
git push ~/photos split-master
git branch -D split-master
cd ~/photos
git reset --hard split-master
git branch -d split-master
Next, the git-annex branch needs to be filtered to include only
the files in photos
, and that filtered branch sent to the new repository.
That can be done with the git-annex-filter-branch(1) command.
cd ~/oldrepo
annexrev=$(git annex filter-branch photos --include-all-key-information --include-all-repo-config --include-global-config)
git push ~/photos $annexrev:refs/heads/git-annex
Next, initialize git-annex on the new repository. This uses the same annex.uuid as was in the old repository. That's ok, because the repository that's been split off will never have the old repository as a remote.
cd ~/photos
git annex reinit $(git config --file ../tofilter/.git/config annex.uuid)
Finally the annexed file contents need to be copied to the new repository:
cd ~/photos
# Hardlink all the annexed data from the old repo
cp -rl ~/oldrepo/.git/annex/objects .git/annex/
# Remove unneeded hard links
git annex unused --quiet
git annex drop --unused --force
# Fix up annex links to content and make sure it's all ok.
git annex fsck
Warning: This method of copying the annexed file contents and dropping the unused ones causes the git-annex branch to log information.
alternative older method
Here is another way to do it. Suppose the old big repo is at ~/oldrepo
:
# Create a new repo for photos only
mkdir ~/photos
cd photos
git init
git annex init laptop
# Hardlink all the annexed data from the old repo
cp -rl ~/oldrepo/.git/annex/objects .git/annex/
# Regenerate the git annex metadata
git annex fsck --fast
# Also split the repo on the usb key
cd /media/usbkey
git clone ~/photos
cd photos
git annex init usbkey
cp -rl ../oldrepo/.git/annex/objects .git/annex/
git annex fsck --fast
# Connect the annexes as remotes of each other
git remote add laptop ~/photos
cd ~/photos
git remote add usbkey /media/usbkey
At this point, I went through all repos doing standard cleanup:
# Remove unneeded hard links
git annex unused
git annex dropunused --force 1-12345
# Sync
git annex sync
To make sure nothing is missing, I used git annex find --not --in=here
to see if, for example, the usbkey that should have everything could be missing
some thing.
Update: Antoine Beaupré pointed me to this tip about Repositories with large number of files which I will try next time one of my repositories grows enough to hit a performance issue.
This document was originally written by Enrico Zini and added to this wiki by anarcat.
2021 update: The new git-annex-filter-branch command can be used to produce a filtered version of the git-annex branch that only includes information for the files you want. I have updated the tip to show how to do it that way, and kept the old way as an alternative
The old, alternative way is a simple way to split a repository, but the resulting split git repository will be larger than is really necessary. (The new method avoids this problem.)
When you
dropunused
all the hard links that are not present in the repository, git-annex will commit a log to the git-annex branch saying "I don't have this content" for each of them. That seems unnecessary since it probably does not have an earlier log saying it contained the content that was hard linked into it, and perhaps could be improved in git-annex to not record that unncessarily, but that's what it does currently.So I suggest running
git annex forget
after the dropunused or at some later point. That will delete all traces of those log files from the git-annex branch.Indeed it would be nice if there was an easy way to split a git annex repository into smaller ones, while those smaller ones also obtain all the git-annex branch availability/metadata information about the files they inherit. The situations comes up quite frequently whenever it is desired to modularize bigger repositories. The simplest use case is to make a specific subdirectory into a git/git-annex submodule. Is there a way/recipe to easily accomplish also moving all git-annex branch metadata. And the original repository should get those files removed within its git tree.
One possible way we see is to clone the original repository, remove all other files, move subdirectory files "up" needed number of directories, and then rewrite git history to forget and then use
annex forget
but that one wouldn't "forget" information about the files which are not in the current tree, so would also require some manual trimming ofgit-annex
branch beforeannex forget
.But may be there is a better way?
I have my repositories set up like this:
local repository (local) <-> bare backup repository (bkp)
My repository has several subfolders dir1, dir2, dir3, .., that I want to split up in separate repositories.
What I did was this:
Now I'm left with the only directory that I want to keep. My plan would be to run
Repeat for dir1 .. dirN.
I know it's a lot of back and forth copying but apart from that are there any downsides with this approach?
Thank you!
@jochen.keil, your use of drop --force risks data loss if a file is not backed up to a remote. I don't think the --force is necessary. And it should be possible to use
git annex dropunused
instead, after the filtering.If you are doing this in several copies of the repo, you will end up with multiple git-annex repos that all have the same annex.uuid. While you intend to keep these repos separate, that's still somewhat asking for trouble.
You have not filtered the git-annex branch at all, so it still has information about files that you have filtered out of the git history otherwise.
To solve both, I think you could
git-annex reinit
with a new uuid you make up,git annex fsck --fast
to update its location tracking to use the new uuid, and thengit-annex dead
the old uuid, and all remotes (such as bkp). Thengit annex forget
will clean out the data in the git-annex branch about the files you've filtered out.Hello,
I want to split my big git annex repo into smaller ones in order to workaround scalability issues (too many files), and to enable simpler synchronization rules and authorizations. As I first step, I'd like to split out my family pictures in a separate repo which does not know about the other content.
Based on the instructions from this page, I came up with the test script below. It works, thank you!
However, there is a drawback: in order to really forget the non-pictures content, I had to also forget about an old offsite backup (a git clone), which holds (some of) the pictures. Is there a way to keep tracking the pictures in this backup?
I think I wouldn't have had the issue if this backup had been a special remote instead of a clone.
Exactly what does this command do? What metadata is regenerated and what does "regenerated" in this context mean?
Hi,
I went through the hassle of splitting up repositories once again. I wrote earlier about it (comment 4) and got some valuable input from Joey!
My repositories contain pictures only. They are organized by
YYYY/MM/DD/YYYY-MM-DDTHH:MM:SS_Filename.ext
where theYYYY
directory is also the repository (i.e. contains the.git
subfolder). Every repository is backed by a bare repository which is located on a much larger, but slower, zfs RAID. Here's an example:My problem was that I had a bunch (hundreds) of files that I didn't want to be there. So, the first step was to create a list of globs of all files that I wanted to move away, for example:
I used
bash
to expand those globs to a list of files:Then I cloned the original repository (here:
2020
) twice, and removed the remote:I also created two new bare repositories:
Where I hardlinke'd the annexed objects from the original bare repository (
2020.git
):After that I went back to the first clone, added the new remote and fixed the file location using
fsck
:Once that was done I could easily remove the files:
Fixing up the second repository requires a second list of files: everything that's left over in
2020.new.1
:Now in the second repository I did the same as for the first but with
files.2.txt
:Before removing the old repositories (
2020
and2020.git
) I double checked that all files are still around:Finally, there should be no differences and it's save to
rm -rf
2020
and2020.git
.Hope this helps. Any comments and suggestions welcome!