I worked out how to retroactively annex a large file that had been checked into a git repo some time ago. I thought this might be useful for others, so I am posting it here.
Suppose you have a git repo where somebody had checked in a large file you would like to have annexed, but there are a bunch of commits after it and you don't want to loose history, but you also don't want everybody to have to retrieve the large file when they clone the repo. This will re-write history as if the file had been annexed when it was originally added.
This command works for me, it relies on the current behavior of git which is to use a directory named .git-rewrite/t/ at the top of the git tree for the extracted tree. This will not be fast and it will rewrite history, so be sure that everybody who has a copy of your repo is OK with accepting the new history. If the behavior of git changes, you can specify the directory to use with the -d option. Currently, the t/ directory is created inside the directory you specify, so "-d ./.git-rewrite/" should be roughly equivalent to the default.
Enough with the explanation, on to the command:
git filter-branch --tree-filter 'for FILE in file1 file2 file3;do if [ -f "$FILE" ] && [ ! -L "$FILE" ];then git rm --cached "$FILE";git annex add "$FILE";ln -sf `readlink "$FILE"|sed -e "s:^../../::"` "$FILE";fi;done' --tag-name-filter cat -- --all
replace file1 file2 file3... with whatever paths you want retroactively annexed. If you wanted bigfile1.bin in the top dir and subdir1/bigfile2.bin to be retroactively annexed try:
git filter-branch --tree-filter 'for FILE in bigfile1.bin subdir1/bigfile2.bin;do if [ -f "$FILE" ] && [ ! -L "$FILE" ];then git rm --cached "$FILE";git annex add "$FILE";ln -sf `readlink "$FILE"|sed -e "s:^../../::"` "$FILE";fi;done' --tag-name-filter cat -- --all
If your repo has tags then you should take a look at the git-filter-branch man page about the --tag-name-filter option and decide what you want to do. By default this will re-write the tags "nearly properly".
You'll probably also want to look at the git-filter-branch man page's section titled "CHECKLIST FOR SHRINKING A REPOSITORY" if you want to free up the space in the existing repo that you just changed history on.
Based on the hints given here I've worked on a filter to both annex and add urls via filter-branch:
https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-filter
The script above is very specific but I think there are a few ideas that can be used in general, the general structure is
which would be run using
or similar.
find
to make sure the only rewritten symlinks are for the newly annexed files, this way it is possible to annex an unknown set of filenames-c annex.alwayscommit=false
and doing agit annex merge
at the end instead might be faster.One thing I noticed is that git-annex needs to checksum each file even if they were previously annexed (rather obviously since there is no general way to tell if the file is the same as the old one without checksumming), but in the specific case that we are replacing files that are already in git, we do actually have the sha1 checksum for each file in question, which could be used.
So, trying to work with this, I wrote a filter script that starts out annexing stuff in the first commit, and continously writes out sha1<->filename<->git-annex-object triplets to a global file, when it then starts with the next commit, it compares the sha1s of the index with those of the global file, and any matches are manually symlinked directly to the corresponding git-annex-object without checksumming.
I've done a few tests and this seems to be considerably faster than letting git-annex checksum everything.
This is from a git-svn import of the (free software) Red Eclipse game project, there are approximately 3500 files (images, maps, models, etc.) being annexed in each commit (and around 5300 commits, hence why I really, really care about speed):
10 commits: ~7min
100 commits: ~38min
For comparison, the old and new method (the difference should increase with the amount of commits):
old, 20 commits ~32min
new, 20 commits: ~11min
The script itself is a bit of a monstrosity in bash(/grep/sed/awk/git), and the files that are annexed are hardcoded (removed in forming $oldindexfiles), but should be fairly easy to adapt:
https://gitorious.org/arand-scripts/arand-scripts/blobs/master/annex-ffilter
The usage would be something like:
I suggest you use it with at least two orders of magnitude more caution than normal filter-branch.
Hope it might be useful for someone else wrestling with filter-branch and git-annex
Thanks for the tip One question though: how do I push this new history out throughout my other Annexes? All I managed to make it do was revert the rewrite so the raw file appeared again...
I recently had the need of re-kind-of-annexing an unusually large repo (one of the largest?). With some tricks and the right code I managed to get it down to 170000 commits in 19 minutes and extracing ~8GB of blobs. Attaching the link here as I feel it might be helpful for very large projects (where git-filter-branch can become prohibitively slow)
https://www.primianotucci.com/blog/large-scale-git-history-rewrites
Hmm, guyz? Are you serious with these scripts?
files are indexed as both removed and untracked and are still in place
files are seen as untracked
files are replaced with symlinks and are in the index
Make sure that you don't have annex.largefiles settings that would prevent annexing the files.
Wow, scary
Dilyin's comment is scary. It suggests bad things can happen, but is not very clear.
Bloated history is one thing.
Obviously broken repo is bad but can be (slowly) recovered from remotes.
Subtly crippled history that you don't notice can be a major problem (especially once you have propagated it to all your remotes to "recover from bloat").
More common than it seems
There's a case probably more common than people actually report: mistakenly doing
git add
instead ofgit annex add
and realizing it only after a number of commits. Doinggit annex add
at that time will have the file duplicated (regular git and annex).Extra wish: when doing
git annex add
of a file that is already present in git history,git-annex
could notice and tell.Simple solution?
Can anyone elaborate on the scripts provided here, are they safe? What can happen if improperly used or in corner cases?
.gitattributes
.Thank you.
Been using the one-liner. Despite the warning, I'm not dead yet.
There's much more to do than the one-liner.
This post offers instructions.
First simple try: slow
Was slow (estimated >600s for 189 commits).
In tmpfs: about 6 times faster
I have cloned repository into /run/user/1000/rewrite-git, which is a tmpfs mount point. (Machine has plenty of RAM.)
There I also did
git annex init
, git-annex found its state branches.On second try I also did
So that filter-branch would clean that, too.
There,
filter-branch
operation finished in 90s first try, 149s second try..git/objects
wasn't smaller.Practicing reduction on clone
This produced no visible benefit:
time git gc --aggressive time git repack -a -d
Even cloning and retrying on clone. Oh, but I should have done
git clone file:///path
as said on git-filter-branch man page's section titled "CHECKLIST FOR SHRINKING A REPOSITORY"This (as seen on https://rtyley.github.io/bfg-repo-cleaner/ ) was efficient:
.git/objects
shrunk from 148M to 58MAll this was on a clone of the repo in tmpfs.
Propagating cleaned up branches to origin
This confirmed that filter-branch did not change last tree:
This, expectedly, was refused:
On origin, I checked out the hash of current master, then on tmpfs clone
Looks good.
I'm not doing the aggressive shrink now, because of the "two orders of magnitude more caution than normal filter-branch" recommended by arand.
Now what? Check if precious not broken
I'm planning to do the same operation on the other repos, then :
git annex sync
works between all those fellowsJoey, does this seem okay? Any comment?