To make sure we can archive our data safely, we need to:
- Store revisions
- Allow files to be tracked while moved to archival spaces
- Be platform-agnostic
- Sync
- Protect against bit-rot
1 and 3 are handled by git itself; everything is a straight forward graph-structure comprised of plain text pointers *(accepting that some filesystems do not easily expose file metadata, but that's on them as we can simply chose to use a different system if that's important)
2 and 4 seem to be handled by git-annex
But 5 is missing.
Thankfully, we already have a technology that can fill in elegantly here: parity files.
2 potential user stories:
Put everything together
- This user wants everything together and in the filesystem in case one of the tools she relies on disappears.
Might have a structure like this:
- Project
- documents
- contract.pdf
- contract.pdf.vol000+01.par2
- contract.pdf.vol001+02.par2
- contract.pdf.vol003+04.par2
- Client brochure.zip
- Client brochure.zip.vol000+01.par2
- Client brochure.zip.vol001+02.par2
- Client brochure.zip.vol003+04.par2
- documents
- Project
Or like this:
- Project
- documents
- contract.pdf
- Client brochure.zip
- documents.vol000+01.par2
- documents.vol001+02.par2
- documents.vol003+04.par2
- documents
- Project
Keep everything clean
- This user doesn't want to clutter folders with extra files. He would rather only have the data files themselves in case they need to be zipped and sent to clients. If he had setup 1, he would delete *.par before zipping, leading to potential data loss.
- Might have a structure like this:
- Project
- documents
- contract.pdf
- Client brochure.zip
- documents
- [git-annex]
- contract.pdf.vol000+01.par2
- contract.pdf.vol001+02.par2
- contract.pdf.vol003+04.par2
- Client brochure.zip.vol000+01.par2
- Client brochure.zip.vol001+02.par2
- Client brochure.zip.vol003+04.par2
- Project
This would also enhance the data-checking capabilities of git-annex, as data loss could be fixed and new parity files generated from the recovered files transparently, self-healing the archive.
git repositories don't contain parity files for their data. Instead, git relies on multiple copies of the repository to keep things safe. Not as efficient as parity files, but a lot easier, and protects against many more disasters than do parity files. git-annex takes the same approach. Lots Of Copies Keeps Stuff Safe.
Even if git-annex started generating parity files for its objects, the git repository would still not have them, so bit flips could still corrupt your git-annex repository.
Nothing stops you from writing git hooks that maintain parity files alongside all the files in a git repository. If you do that, you'll get parity files for the git-annex files too. But I don't see this being needed in git-annex itself and AFAICS there are plenty of hooks in git and git-annex to allow doing that.
Unless I'm not understanding correctly, Git and git-annex have different expected use-cases.
With Git, it assumed that you will have a repository for each contained project, and keep the number of files small for easy replication across devices (which leads to multiple copies) With git-annex, it's possible (and seems more convenient for setup/clients) to have a monolithic repository where parts of it are replicated to devices. Yes, it's best practice to have multiple complete copies, but as the repository grows to 4TB, 12TB, or more, it's much less likely especially for a 'I guess I'm not a casual user any more' user who has simply been purchasing hard drives and maybe a NAS when needed.
Certainly, I'm just being selfish: git-annex looks like an excellent answer to my ongoing question of 'how could I do files better', and this suggestion is squarely aimed at my personal wishlist. If checking integrity and self-healing large-file repositories doesn't interest anyone else then I'll look in to rolling my own solution to share. In the meantime; which git-annex hooks do you suggest I look at? (Bonus points if you can share the high-level logic that would make this work)
I've wanted parity files in git-annex for a really long time, but never asked about it because I expected the response it got :P
So there are other people interested in this!
The post-update-annex hook is run whenever the git-annex branch changes, so any change to the annexed files stored in the repository will be followed by this hook running.
But then how would that hook figure out which content files have been added, in order to add parity files for them? Well, someone else had a similar type of redundancy information they wanted to add, see ?Log function to enumerate all recent git-annex changes for a thread about it.