I am interested in using git annex
to manage encrypted backups to Amazon S3/Glacier. So git annex
will be used with the main file directory in direct mode and an encrypted S3 or Glacier remote set up in archive mode and then git annex add .
and git annex sync
will be run periodically. The intent is for this set up to be a backup for catastrophic failure, so I want to make sure I take care of future-proofing and disaster recovery properly. So my basic question is what would I need to have backed up and what would I have to do if the computer with the main repository died. I try to break that out into more specific questions below.
S3/Glacier remotes store the contents of
.git/annex/objects
in encrypted form with hashes for file names and nothing else (other than a uuid). The hashes do not match the keys in the main repo. Are they the same keys encrypted? Is there a way to look up the S3 file name corresponding to a file in the repo?For
shared
encryption, I see the cipher text inremote.log
in thegit-annex
branch. Assuming I didn't have access togit annex
, what would I need to do to convert that cipher text into a form that I could use withgpg
to decrypt files?Same question but for
hybrid
encryption rather thanshared
. I assume the answer is similar but I need to decrypt the cipher first with my gpg key? How do I do that?Assuming I did have access to
git annex
, what would I need to create a new repo on a new computer with access to all of the files in the S3/Glacier bucket? I think I would need my Amazon credentials (possibly already embedded in the git repo), my gpg key if using hybrid or public key encryption, and the.git
folder as it was the last time files were pushed to the S3/Glacier remote (which would have the necessary decryption information for shared encryption). Is that right? I guess mainly I am checking that the remote does not store any metadata about the repo, so forgit annex
to be able to pull files back out I would need a backup of the.git
directory and that back up would need to be up to date (can't just copy remote.log and havegit annex
work out the rest from the remote's contents). So for a full backup, my script would need totar
the.git
directory, encrypt it, and push it to S3/Glacier separately aftergit annex
does a sync. Then I could recover everything as long as I had a secure backup of my Amazon credentials and my encryption key(s).
That's what's "special" about special remotes vs regular git remotes: They only store the content of annexed files and not the git repository. Back up the git repository separately (and your gpg key if it's being used, and the credentials if you didn't use embedcreds=yes)
To use your backup, you can make a clone of the backed up git repository and use
git annex enableremote
to enable it to use the special remote.See encryption for details of how the encryption is implemented. I've seen people follow that and manually use the data from the git repo to decrypt files, but I don't have a pointer to an example at the moment.
Thanks for the response.
With
pubkey
encryption, I am able to decrypt the remote's files normally using justgpg --decrypt
with the public key used to encrypt them in my keyring. One thing I don't understand aboutpubkey
encryption: what is in thecipher=
entry inremote.log
after the 256 bytes representing the HMAC cipher? Thegpg
key pair is used for encryption, so there is no encryption cipher to put inremote.log
after the HMAC cipher, I would have thought.I understand the basic encryption set up, but I don't know how to use
gpg
to work with a raw cipher or raw encrypted text. For example withpubkey
encryption, my interpretation is that thecipher=
entry inremote.log
is encrypted with the public key, but I can't just pass teh entry togpg --decrypt
becausegpg
expects the encrypted input to be formatted in a way that specifies which key to use to decrypt. Otherwise it saysgpg: no valid OpenPGP data found.
Similarly forshared
encryption, I don't see how to pass the second half of theremote.log
cipher=
entry togpg
to decrypt the remote's files.I am also having trouble generating the special remote's keys. I created a
directory
remote withshared
encryption (so that thecipher=
entry inremote.log
would not be encrypted). Then I added one file calledtmp.txt
which was stored in the annex asSHA256E-s9--9c73fdec185d79405f58fc8b4e0ac22fa5ed2de7b7611a61b37606c905509650.txt
I didgit checkout git-annex -- remote.log
and then tried the following:but the output does not match the GPGHMACSHA1 file name in the remote, and I don't understand why. I tried other variations as well (dropping the
.txt
or the SHA256E-s9--` prefix) but they did not work either.Small update: I have gotten the HMAC and decryption steps to work for shared encryption. I didn't figure out what was wrong with my HMAC command above yet, but I was able to reproduce the special remote's keys in Python using the
hmac
andhashlib
libraries. I was trying to hash the correct string in the comment above (the full key with theSHA...
prefix and the file extension suffix). For decryption, I usedgpg
on the encrypted file in the special remote and passed it the cipher fromremote.log
with the first 256 bytes removed as the passphrase (in the format returned bybase64.b64decode()
in Python).I still need to figure out how to decrypt the ciphers for
pubkey
andhybrid
.I will try to put together a tip with the steps needed to reproduce special remote keys and to decrypt special remote files using only command line tools after that (assuming I can translate the Python steps back to command line tools; otherwise part of the steps will be in Python).
hybrid
andpubkey
remotes is actually pretty straightforward:echo -n <cipher_entry> | base64 -d | gpg --decrypt
. I think with that I have enough to put together a short summary of decrypting by hand the contents ofhybrid
,pubkey
, andshared
special remotes.When using
pubkey
, the second 256 bytes of ciphertext are currently not used for anything.For
hybrid
andshared
, the 256 bytes of ciphertext are used as a symmetric cipher. So the gpg option to use for both encrypting and decrypting is --symmetric. gpg then prompts for a passphrase, and the ciphertext is what's used.joey, thanks for answering my questions. I have put together a bash script to check my understanding of how the different encryption methods work. The script is below. With it, I am able to generate the encrypted keys for items for
hybrid
andshared
encrypteddirectory
remotes and to decrypt items forhybrid
,shared
, andpubkey
encrypteddirectory
remotes (and presumably other special remotes but I tested withdirectory
remotes since those were simplest). The only thing I have not been able to get to work is the generation of the encrypted item keys forpubkey
remotes. Do you see what I am doing wrong for that case? Once I figure that out, I can make a tip entry on the wiki with the script.I think it's great you're working on this, and future proofing is probably a good place to put the result.
I did some digging, and the cipher used for pubkey includes a trailing new line. The trailing newline seems to be lost in your script, and adding it makes it work.
I think the actual problem in the script is in this line:
I verified that decrypt_cipher outputs the trailing newline, but in capturing its output, the shell chomps it. Can't remember ATM if there's a way to avoid that in shell scripting.
Arguably git-annex should not be treating that newline as part of the cipher text, although it's probably too late to change that, certainly for existing repositories..
Thanks for spending the time to check the script. It would have taken me quite a while to catch that issue. I was also a little bit surprised that the hybrid encryption mode uses the base64 encoded cipher string while the shared encryption mode uses the cipher after decoding from base64.
I can now reproduce the HMAC'ed keys and decrypt the files all the supported forms of encryption using openssl and gpg which makes me feel comfortable using them for long term storage.
I'm trying to add support for the sharedpubkey encryption scheme, and while decrypting the files is straight-forward, I cannot get the "lookup_key" function to work with this scheme. I tried the following:
as well as many other variations (full cipher, no base 64 etc...) without success. The cipher is said to be unencrypted, so I guess it is base-64 encoded only. Would it be possible that an extra character has been added to the cipher like in the "pubkey" encryption scheme ?
My tests are done in a directory special remote, sharedpubkey encryption, no chunks, a single test file in the repository.
@oliv5 sharedpubkey's cipher has the same newline problem as pubkey does, as discussed above. Unlike pubkey, it has to be base64-decoded first, and then the extra newline appended to that.
Note that base64 -d does emit the newline (verified with hexdump); again the shell is shooting you in the foot by eliminating it.
BTW, a very simple code hack that makes it easy to dump out the cipher git-annex is using:
Then git-annex info remote will display it. Obviously, this patch is insecure.