Imagine putting a git-annex drive in a time capsule. In 20, or 50, or 100 years, you'd like its contents to be as accessible as possible to whoever digs it up.

This is a hard problem. git-annex cannot completely solve it, but it does its best to not contribute to the problem. Here are some aspects of the problem:

  • How are files accessed? Git-annex carefully adds minimal complexity to access files in a repository. Nothing needs to be done to extract files from the repository; they are there on disk in the usual way, with just some symlinks pointing at the annexed file contents. Neither git-annex nor git is needed to get at the file contents.

    (Also, git-annex provides an "uninit" command that moves everything out of the annex, if you should ever want to stop using it.)

  • What file formats are used? Will they still be readable? To deal with this, it's best to stick to plain text files, and the most common image, sound, etc formats. Consider storing the same content in multiple formats.

  • What filesystem is used on the drive? Will that filesystem still be available? Whatever you choose to use, git-annex can put files on it. Even if you choose (ugh) FAT.

  • What is the hardware interface of the drive? Will hardware still exist to talk to it?

  • What if some of the data is damaged? git-annex facilitates storing a configurable number of copies of the file contents. The metadata about your files is stored in git, and so every clone of the repository means another copy of that is stored. Also, git-annex uses filenames for the data that encode everything needed to match it back to the metadata. So if a filesystem is badly corrupted and all your annexed files end up in lost+found, they can easily be lifted back out into another clone of the repository. Even if the filenames are lost, it's possible to recover data from lost+found.

Imagine a rather contrived doomsday scenario: the file paths and/or basenames are important and, for some reason, the symlinks are not present (perhaps they got deleted, or aren't supported). git and git-annex no longer exist and let's assume knowledge of git internals is not useful here. All the content is there, stored under hashed file names under .git/annex/objects.

I may be missing something obvious but I think options for restoring file paths include:

  • direct mode bypasses this issue; all the files are right there.
  • the WORM backend perhaps carries enough information in the object file names to work with.
  • file content/metadata may be sufficient to easily recreate a sensible directory structure in some cases, so no worries.

These first two options may represent compromises in various use-cases and the last may not be applicable or, if it is, practical. The object-path mapping could trivially be backed up in plain text in lieu of these. Like I said, I may be overlooking something here that makes this unnecessary or even a non-concern (actually, I've convinced myself it's not a serious concern in most of the use-cases I've considered, but crossing i's and dotting t's).

Comment by electrichead Mon Aug 25 15:51:00 2014