This page tries to regroup a set of Really Bad Ideas people had with git-annex in the past that can lead to catastrophic data loss, abusive disk usage, improper swearing and other unfortunate experiences.
This could also be called the "git annex worst practices", but is different than what git annex is not in that it covers normal use cases of git-annex, just implemented in the wrong way. Hopefully, git-annex should make it as hard as possible to do those things, but sometimes, you just can't help it, people figure out the worst possible ways of doing things.
.git/annex directory, in the hope of saving
disk space, is a horrible idea. The general antipattern is:
git clone repoA repoB mv repoB/.git/annex repoB/.git/annex.bak ln -s repoA/.git/annex repoB/.git/annex
This is bad because git-annex will believe it has two copy of the files and then would let you drop the single copy, therefore leading to data loss.
The proper way of doing this is through git-annex's hardlink support,
by cloning the repository with the
git clone --shared repoA repoB
This will setup repoB as an "untrusted" repository and use hardlinks to copy files between the two repos, using space only once. This works, of course, only on filesystems that support hardlinks, but that's usually the case for filesystems that support symlinks.
- share .git/annex/objects across multiple repositories on one machine
- at least one IRC discussion
Probably no way to fix this in git-annex - if users want to shoot themselves in the foot by messing with the backend, there's not much we can do to change that in this case.
To quote the git-annex-reinit manpage:
Normally, initializing a repository generates a new, unique identifier (UUID) for that repository. Occasionally it may be useful to reuse a UUID -- for example, if a repository got deleted, and you're setting it back up.
git-annex-reinit can be used to reuse UUIDs for deleted repositories. But what happens if you reuse the UUID of an existing repository, or a repository that hasn't been properly emptied before being declared dead? This can lead to git-annex getting confused because, in that case, git-annex may think some files are still present in the revived repository (while they may not actually be).
This should never result in data loss, because git-annex does not trust its records about the contents of a repository, and checks that it really contains files before dropping them from other repositories. (The one exception to this rule is trusted repositories, whose contents are never checked. See the next two sections for more about problems with trusted repositories.)
The proper way of using reinit is to make sure you run
git-annex-fsck (optionally with
--fast to save time) on the
revived repo right after running reinit. This will ensure that at
least the location log will be updated, and git-annex will notice if
files are missing.
An improvement to git-annex here would be to allow reinit to work without arguments to at least not encourage UUID reuse.
When you use git-annex-trust on a repository, you disable some very important sanity checks that make sure that git-annex never loses the content of files. So trusting a repository is a good way to shoot yourself in the foot and lose data. Like the man page says, "Use with care."
When you have made git-annex trust a repository, you can lose data
by dropping files from that repository. For example, suppose file
present in the trusted repository, and also in a second repository.
Now suppose you run
git annex drop foo in both repositories.
Normally, git-annex will not let both copies of the file be removed,
but if the trusted repository is able to verify that the second
repository has a copy, it will delete its copy. Then the drop in the second
repository will trust the trusted repository still has its copy,
and so the last copy of the file gets deleted.
Either avoid using trusted repositories, or avoid dropping content
from them, or make sure you
git annex sync just right, so
other reposities know that data has been removed from a trusted repository.
Another way trusted repositories are unsafe is that even after they're deleted, git-annex will trust that they contained the files they used to contain.
Always use git-annex-dead to tell git-annex when a repository has been deleted, especially if it was trusted.
Feel free to add your lessons in catastrophe here! It's educational and fun, and will improve git-annex for everyone.