safely dropping git-annex history

the git-annex branch of a repository i've had running since 2010 has grown to unmanagable dimensions (5gb in a fresh clone of the git-annex branch, while the master branch has merely 40mb, part of which is due to checked-in files), resulting in git-annex-merges to take in the order of magnitude of 15 minutes. getting an initial clone of the git-annex branch (not the data) takes hours alone in the "remote: Counting objects" phase (admittedly, the origin server is limited in ram, so it spends its time swapping the git process back and forth).

is there a recommended way for how to reset the git-annex branch in a coordinated way? of course, this would have to happen on all copies of the repo at the same time.

the workflow i currently imagine is

rename all copies of the repository (the_repo → the_repo-old, the_repo.git → the_repo-old.git)
clone the old origin repository to a new origin with --single-branch. (this would be the oportunity to git filter-branch --prune-empty --index-filter 'git rm --cached --ignore-unmatch .git-annex -r' master as well, to get rid of commits of pre-whatever versions)
git annex init on the master repository
clone it to all the other copies and git annex init there
set all the configuration options (untrusted repos etc) again
either
- git annex reinject the files that are already present on the respective machines, or
- move the .git/annex/objects files over from the original locations, and use git annex fsck to make git-annex discover which files it already has, if that works. (i have numcopies=2, thus i'd dare to move instead of copy even when trying this out the first time. complete copies, even of partially checked out clones, will exceed the capacities of most clients)

my questions in that endeavor are:

is there already a standard workflow for this?
if not, will the above do the trick?
can anything be done to avoid such problems in future?

RSS Atom

comment 1

Yes, you can use fsck like that. I outlined a similar approach here, and I think you don't even need to make new git repositories, just delete the old branch and git gc it -- but I've not heard of anyone doing this yet.

So, since 2010, your repo must have gone through at least one and probably two repository format changes, which bloated the git branch. Hopefully we'll have no more of those. My largest repo that also went through that is under 150 mb however.

There was a recent bug fix where git annex copy unnecessarily updated location log even when the file was already copied. That kind of thing can bloat the repository, especially if you had that in a cron job... You might find git annex log useful to look through history of files and see if there have been a lot of location changes logged for whatever reason.

Comment by joeyh.name — Wed Oct 31 16:03:55 2012

Remove comment

worked well

the procedure i outlined originally worked well for me; the method chosen for reinjection was moving over the .git/annex/objects directory and doing a git annex fsck.

special care had to be taken of the special remote (rsync+gpg) -- i guess that's why they are called special ;-) . as described in the forum post you linked, i had to copy over remote.log and the uuid.log line from the old git-annex branch -- otherwise, a git annex initremote would have generated a new hmac, effectively resetting the remote repo.

the formerly 5gb git-annex branch (admittedly not git gc'd recently, but that just wasn't feasible any more) shrunk down to around 25mb of current location information. i'll keep an eye on how it's growing to see if the problem is inherent or if it was just old bugs causing trouble.

Comment by chrysn — Sun Nov 4 12:23:56 2012

Remove comment

drop "content removed from annex" history

Joey,

dropping the git-annex branch and subsequent fsck worked. Moreover, as I turned my repository in containing over 700k objects due to a silly cycle of git annex add / git annex unannex, bloating both git-annex and master history, to clean up I successfully performed a squashed rebase of master onto itself.

Here's what I did, in detail:

$ git checkout git-annex
$ cp *.log ..
$ git checkout master
$ git br -D git-annex
$ git br -D synced/git-annex
$ git checkout <first commit>
$ git checkout -b git-annex
$ cp ../*.log .
$ <remove the changes done in the first commit, my case just adding a .gitignore>
$ git add *.log
$ git commit --amend -m 'Init'

With this, I got rid of the many update commits. Now, the fun part:

$ git checkout master
$ git rebase -i <first commit>
<In the git-rebase-todo, I squashed almost everything, except a few commits I wanted to preserve>
$ :wq

Rebase went fine, and I was left with a clean master. I brought also synced/master up to date:

$ git checkout synced/master
$ git reset --hard master

Now I re-created all the location links with fsck:

$ git annex fsck

And eventually, got rid of the redundant history:

$ git reflog expire --expire=now --expire-unreachable=now --all
$ git gc --prune=now
$ git repack
$ git prune

yay, 500k objects less ^_^'.

Comment by vjt — Tue Jun 18 02:12:01 2013

Remove comment

would git annex fsck also recreate location tracking info for special remotes?

Suppose I want to drop history like discussed in this entry. How well does this deal with special (e.g. directory on an offline disk) remotes?

Comment by Michael — Wed Jul 17 17:02:50 2013

Remove comment

comment 5

Also, would simply squashing git-annex branch history (without fsck etc) work? This seems easier.

Comment by Michael — Wed Jul 17 17:06:39 2013

Remove comment

comment 6

You can use git annex fsck --from remote to verify that every file location tracking thinks is on the remote still is. It's innefficient though -- it has to download the whole file to check the special remote still has the right content! That transfer can be avoided by adding --fast.

This is documented in the man page.

Comment by joeyh.name — Wed Jul 17 18:57:12 2013

Remove comment

comment 7

I don't see any reason why squashing git-annex branch history would not work. If you squash it to the same sha in each clone, things would be very happy, but even if you squash it to different shas, the union merge should result in those different versions of the same data automatically merging together.

Comment by joeyh.name — Wed Jul 17 18:59:03 2013

Remove comment

git annex merge driver?

I've tried rebasing git-annex branch, and I hit a bunch of conflicts (both in uuid.log and for individual content file logs) of the form:


<<<<<<< HEAD
1369615760.859476s 1 016d9095-0cbc-4734-a498-4e0421e257d7
=======
1369615760.845334s 1 016d9095-0cbc-4734-a498-4e0421e257d7
>>>>>>> 52e60e8... update
1369615359.195672s 1 38c359dc-a7d9-498d-a818-2e9beae995b8

As I understand, git-annex has a special timestamp-based merge driver to deal with these. Is there a way to use that with git rebase?

Comment by Michael — Fri Jul 19 17:04:53 2013

Remove comment

git checkout --orphan

Instead of rebase, --orphan seems to be the right answer for pruning history: create a new git-annex orphan branch and git add and commit the files. So:



git status 

# verify there are no uncommitted or untracked files

# master branch
git branch -m old-master
git checkout --orphan master
git add .
git commit -m 'first commit'

# git annex branch
git branch -m git-annex old-git-annex
git checkout git-annex
git checkout --orphan git-annex
git add .
git commit -m 'first commit'
git checkout master

# at this point, you may want to double-check that everything is still OK

# finally, remove branches and clean up the objects:
git branch -D old-master old-git-annex
git reflog expire --expire=now --all
git prune
git gc

The repo remains functional and .git is smaller.

Comment by Michael — Fri Jul 19 17:49:37 2013

Remove comment

comment 10

Since you seem to have found a way that works, I'm only answering for completeness: To resolve a conflicted merge on the git-annex branch, you can just add all lines present in either side of the merge, in any order. This won't necessarily be the most minimal resolution, but it is guaranteed to always be a valid one. There is a git-union-merge program in the git-annex source (not built by default) that can do that when merging any set of branches.

Comment by joeyh.name — Sat Jul 20 20:03:20 2013

Remove comment

propagating squashed history to other remotes

The easiest method seems to be to force-push git-annex and master to other remotes, e.g.

git push -f myremote git-annex:refs/heads/git-annex

Before doing this, make sure location logs etc had a chance to propagate across all remotes.

It's a good idea to remove synched/ branches before doing git-annex sync on the repos with rewritten history, too: git branch -D synced/master git branch -D synced/annex

Comment by Michael — Mon Aug 12 01:44:59 2013

Remove comment

comment 12

git annex forget automates this now, without needing to force-push or have a flag day. Needs a version of git-annex supporting it installed on all the computers you use the repo on. Repos notice they need to forget when git annex is run in them, and do, automatically.

Comment by joeyh.name — Wed Sep 4 06:38:00 2013

Remove comment

Are these methods still working?

I am new to git annex. I like the concept of partial checkout, given that I have multiple storages.

My repo is small right now, 2000+ files collectively amounting to 24 MB in size. However, as soon as I add and unlock the repo, the size of the Annex folder becomes 50+ MB. Is this normal? I try to sync with google drive and the size goes to 90+ Mb post syncing.

I feel like I must be doing something wrong. Does the size of gorw upto 4 times the original data. I tried the methods listed in above comments, none worked. Seeing that last comment is 4+ years old, I wanted to know if these methods are still viable?

Comment by Rizwan — Tue Jun 12 14:58:03 2018

Remove comment

comment 14

@Rizwan yes, the things documented here still work. But, they drop old history about the past locations of files, they do not drop all information about the current locations of files. git-annex is keeping track of which files you have stored in the local repository and on the remote, and that information necessisarily takes up space.

You're using git-annex with small files -- around 12 kilobytes average file size. It's designed for mostly much larger files, so you will see a larger percentage of overhead in using it than you would if your files were bigger.

Comment by joey — Tue Jun 12 16:11:27 2018

Remove comment

Add a comment