Yesterday I spent making a release, and shopping for a new laptop, since this one is dying. (Soon I'll be able to compile git-annex fast-ish! Yay!) And thinking about ?wishlist: dropping git-annex history.
Today, I added the git annex forget
command. It's currently been lightly
tested, seems to work, and is living in the forget
branch until I gain
confidence with it. It should be perfectly safe to use, even if it's buggy,
because you can use git reflog git-annex
to pull out and revert to an old
version of your git-annex branch. So if you're been wanting this feature,
please beta test!
I actually implemented something more generic than just forgetting git history. There's now a whole mechanism for git-annex doing distributed transitions of whatever sort is needed.
There were several subtleties involved in distributed transitions:
First is how to tell when a given transition has already been done on a branch. At first I was thinking that the transition log should include the sha of the first commit on the old branch that got rewritten. However, that would mean that after a single transition had been done, every git-annex branch merge would need to look up the first commit of the current branch, to see if it's done the transition yet. That's slow! Instead, transitions are logged with a timestamp, and as long as a branch contains a transition with the same timestamp, it's been done.
A really tricky problem is what to do if the local repository has transitioned, but a remote has not, and changes keep being made to the remote. What it does so far is incorporate the changes from the remote into the index, and re-run the transition code over the whole thing to yeild a single new commit. This might not be very efficient (once I write the more full-featured transition code), but it lets the local repo keep up with what's going on in the remote, without directly merging with it (which would revert the transition). And once the remote repository has its git-annex upgraded to one that knows about transitions, it will finish up the transition on its side automatically, and the two branches will once again merge.
Related to the previous problem, we don't want to keep trying to merge from a remote branch when it's not yet transitioned. So a blacklist is used, of untransitioned commits that have already been integrated.
One really subtle thing is that when the user does a transition more
complicated than git annex forget
, like the git annex forget --dead
that I need to implement to forget dead remotes, they're not just telling
git-annex to forget whatever dead remotes it knows right now. They're
actually telling git-annex to perform the transition one time on every
existing clone of the repository, at some point in the future. Repositories
with unfinished transitions could hang around for years, and at some future
point when git-annex runs in the repository again, it would merge in the
current state of the world, and re-do the transition. So you might tell it
to forget dead remotes today, and then the very repository you ran that in
later becomes dead, and a long-slumbering repo wakes up and forgets about
the repo that started the whole process! I hope users don't find this
massively confusing, but that's how the implementation works right now.
I think I have at least two more days of work to do to finish up this feature.
I still need to add some extra features like forgetting about dead remotes, and forgetting about keys that are no longer present on any remote.
After
git annex forget
,git annex sync
will fail to push the synced/annex branch to remotes, since the branch is no longer a fast-forward of the old one. I will probably fix this by makinggit annex sync
do a fallback push of a unique branch in this case, like the assistant already does. Although I may need to adjust that code to handle this case, too..For some reason the automatic transitioning code triggers a "(recovery from race)" commit. This is certainly a bug somewhere, because you can't have a race with only 1 participant.
Today's work was sponsored by Richard Hartmann.
I have some repos where due to some hiccups file versions (not in the working tree anymore) were lost and now they come up again and again when fsck is running. So I would be happy if I could make my repos forget these not available files via "git annex forget $key" and perhaps even have a better solution to show all objects with numcopies=0.
git annex fsck
will shut up.git annex fsck
in.As discussed on irc. Fsck --all does check more then the working tree and therefore for fsck to not complain this would be a worthy feature to be added. (git annex forget $key)
Another thing I found, which was annoying is that I have objects in my annex not tracked anywhere it seems. "git annex fsck --all" complains about not having access to the object. "git log --stat -S '$key'" doesn't have any record. "git annex fsck" has no issues and "git annex unused" comes up empty too. I'm not sure where these objects still reside or why how to remove this annoying failure.
So not only should "git annex forget $key" remove references from within all branches, but should also clean up the aforementioned loose objects, which are neither unused, nor available, nor referenced.
Is there any update on cleaning up object/file references to objects/content not at all present and lost. I would love my git annex fsck --all to show current failures and not these old files all the time. Thanks