Hello everyone,
We are distributing large datasets with git-annex under our own infrastructure (for scientific purposes) and we need to implement long term backups on a different site. What is most important for us is to backup the original data files which are expensive to produce. These backups should only be retrieved in case of a disaster, so retrieval costs and delays don't really matter, what matters the most is to keep storage costs as low as possible. Amazon Glacier is therefore a good fit for this purpose.
At first sight the easiest way was to use glacier special remotes, as it makes it very easy to backup the files we need with git annex copy
. Another advantage is hybrid encryption : git-annex allows flexible end-to-end encryption with decryption keys redundancy.
However, there seems to be a few pitfalls :
What if a disaster occurs and we lose the git repository all the whole history/git-annex branch ? Could we still recover the data itself with this strategy ? First assuming no encryption, then assuming encryption and that we have access to only the cipher (stored in remote.log) and at least one decryption private key. Since we would not be able to
git-annex get
the files, it will be painful - but I am wondering if this is even possible at all as for instance Glacier vaults don't seem to store filenames...Datasets might undergo updates after being archived. We are not very interested in backing up these updates or very infrequently, as Glacier archives are meant to backup the underlying data in our case. But what if e.g. some large files are moved or merged ? Will this break
git annex get
?glacier-cli does not seem to be actively maintained. For instance, it breaks with python3 (had to use python2.7 to git annex get files). Glacier is meant for long term backups. Wouldn't it be a problem if breaking changes occur at some level (AWS/boto) and glacier-cli remains outdated ? It seems to be more plausible than a failure with Glacier itself.
Based on these considerations, I am wondering whether we should consider other options, including :
- not using glacier but s3 w/ glacier storage class
- not using git-annex at all for the archives but rclone for instance
I am also wondering if glacier would be suited for frequent backups, without uncontrolled costs - even though this is not really our need here, the most important is to have a copy of the data after it's been organised and cleaned up.
Thank you for your insights !
Realistically, you do need the git branches to restore, as well as the contents of the special remote. This applies to git-annex and special remotes in general. Otherwise you'll have all the content, but not the filenames the content belong to. It is relatively easy to back the git branches up though - they're small (metadata only, if you add all data files through git-annex) and git being natively distributed makes it easy. You do need to do it though - which means that relying on Glacier special remote for disaster recovery isn't on its own enough.
git-annex uses content addressed storage. If you make a change (see
git annex lock
andgit annex unlock
), you're effectively adding a new file, and you can decide whether to keep the old file or not. git-annex will handle whichever you leave in your git branch and regardless of what it's named.I'm the glacier-cli author and I'm still around. Are you the person who filed the issue on Python 3 support the other day? I haven't had a chance to look yet. Mostly glacier-cli is trivial (it's really not much code - take a look!), "done", and I'll take PRs to fix it to work against the latest boto library version, which is what I assume is the problem. I would have wrote it against Python 3 to start with only boto didn't support Python 3 at the time. glacier-cli is designed to be a "thin" wrapper - it doesn't use any special format - your git-annex data just ends up in Glacier as-is, so it should be extractable with relatively straightforward effort with any other tool, not just glacier-cli. However glacier-cli is still just something I wrote and maintain in my spare time, so if you aren't confident diving in to fix something if I'm not available or vanish, and you're not sure about finding someone else to do that either, then that's fair enough.