Situation/trouble
I have a set of big repos. Each full replica manages about 172.000 annexed files plus a number of small regular files in git history, for a total of about 1.8TB. At filesystem level, find
reports about 924.000 entries (directories, files, symlinks).
That worked rather well for a while except that a number of operations are slow (even outside git-annex, e.g. a plain find
takes more than our hour the first time). Also, the git part got rather heavy. Hundreds of megabytes that should have been annexed were committed as regular files and vice versa.
The whole setup even survived some catastrophe rather well, for example 6 months ago when the first 1.5GB of one hard drive was accidentally overwritten. fsck
with an alternate superblock fixed the lower level, while git annex repair
fixed the rest nicely.
Last night, though, the git parts got corrupted and I'm struggling to get things back to a sane state. git log
shows only recent history then fails. Various attempts with git annex repair
failed so far, I'll try again adding a new local "bare" git clone of the server as remote for git annex repair
to use.
Storage media and filesystems seem sane, still. Software has been unchanged for a long time:
- client run Ubuntu 16.04 with locally compiled git-annex version: 6.20161001-gade6ab4
- server runs with locally compiled (in a Debian unstable chroot) git-annex version: 6.20161011-g3135d35 .
Considered solution
I'm considering:
- recreating a new set of replicas
- each replica on same filesystem as old one
- recreated only from the checked-out tree and the
.git/annex.objects
tree. - without copying data (re-reading on import is okay, but no room on a 2TB disk for duplicating 1.8TB)
- non-constraint: this will lose detailed history which is an inconvenience we can live with.
Solution, practically
(1) Assuming git annex fsck
can take into account objects manually placed into .git/annex/objects
mkdir $newrepo
cd $newrepo
git init
git annex init
cp -al $oldrepo/* $newrepo/ # ignores .git and other .* (dotfiles)
cp -al $oldrepo/.git/annex/objects/* $newrepo/.git/annex/objects/*
git annex fsck # will this find and use result of cp ?
# git annex fsck will also tell if some checked-out files lack their annexed data
git remote add ...(other replicas)...
git annex sync ...(other replicas)...
git annex unused # will tell if some files don't appear in checked-out tree?
(2) Else...
mkdir $newrepo
cd $newrepo
git init
git annex init
cp -al $oldrepo/* $newrepo/ # ignores .git and other .* (dotfiles)
cp -al $oldrepo/.git/annex/objects/* ${newrepo}.objectdup
git annex reinject --known ${newrepo}.objectdup # will that perform a copy? I must not in this case.
# or something like find "${newrepo}.objectdup" -type f -exec git annex reinject --known {} \;
git annex fsck # will tell if some checked-out files lack their annexed data
git remote add ...(other replicas)...
git annex sync ...(other replicas)...
# if some files in $oldrepo/.git/annex/objects/* don't appear in checked-out tree, the won't be picked up by reinject and remain in ${newrepo}.objectdup
Questions
No one wins when a lot of time is spent on dead ends. Before I spend time testing if solutions 1 and 2 can work, is there any caveat to mention?
For example, perhaps one must clone from a common empty ancestor instead of creating independent annexed then sync?
What else? Is the whole approach sane? Doomed? Anything simpler/better?
Thanks a lot.
My solution is very roundabout but preserves a lot of information, but did involve buying another drive (and exclusively using v5 indirect mode!).
I create a new repository (on the new drive) which I import all the contents of the "keyfiles" (contents of .git/annex/objects). Then I create another repository with the filelinks (symlinks pointing to .git/annex/objects). After adding the keyfiles remote to this, this lets me see which content is present and valid, which got corrupted, is missing etc.
Then I can move the valid content from this recovery annex into a proper annex and try and repair/find the corrupted/missing.
Thanks @CandyAngel. This is similar to wat I'm doing and somehow validates. I'm trying to repair on the same filesystem without long recopy. I don't understand why your solution is specific to v5 indirect mode.
Indeed git annex
fsck
can take into account objects manually placed into.git/annex/objects
.Let's create a repo:
Everything is fine.
Let's say this repo has its git structures broken and we rebuild it from checked-out tree and
.git/annex/objects
. We'll lose tree history and location tracking history but recover content.Ok we have an empty repo. Let's import tree.
Notice that
git annex repair
does not care about annexed objects, only history data.But
fsck
notices about missing objects.So, in a sense,
git annex fsck
andgit annex repair
operate on nearly independent things.Let's get annexed objects back.
Hooray,
fsck
notices that objects are back.Conclusion
I can use approach (1). Extra benefit: it will notice if some files got corrupted on the filesystem.
Approach (2) would mean, if any file was corrupted on the filesystem, it would have been considered the correct content, and I'd prefer to avoid that.