This was the design doc for encryption and is preserved for the curious. For an example of using git-annex with an encrypted S3 remote, see using Amazon S3.
encryption key management
The basis of this scheme was originally developed by Lars Wirzenius et al for Obnam.
Data is encrypted by GnuPG, using a symmetric cipher. The cipher is
generated by GnuPG when the special remote is created. By default the
best entropy pool is used, hence the generation may take a while; One
can use initremote
with the --fast
option
to speed up things, but at the expense of using random numbers of a
lower quality. The generated cipher is then checked into your git
repository, encrypted using one or more OpenPGP public keys. This scheme
allows new OpenPGP private keys to be given access to content that has
already been stored in the remote.
Different encrypted remotes need to be able to each use different ciphers. Allowing multiple ciphers to be used within a single remote would add a lot of complexity, so is not supported. Instead, if you want a new cipher, create a new S3 bucket, or whatever. There does not seem to be much benefit to using the same cipher for two different encrypted remotes.
So, the encrypted cipher is just stored with the rest of a remote's
configuration in remote.log
(see internals). When git
annex initremote
makes a remote, it generates a random symmetric
cipher, and encrypt it with the specified gpg key. To allow another gpg
public key access, update the encrypted cipher to be encrypted to both gpg
keys.
Note that there's a shared encryption mode where the cipher is not encrypted. When this mode is used, any clone of the git repository can decrypt files stored in its special remote.
filename enumeration
If the names of files are encrypted or securely hashed, or whatever is chosen, this makes it harder for git-annex (let alone untrusted third parties!) to get a list of the files that are stored on a given enrypted remote. But, does git-annex really ever need to do such an enumeration?
Apparently not. git annex unused --from remote
can now check for
unused data that is stored on a remote, and it does so based only on
location log data for the remote. This assumes that the location log is
kept accurately.
What about git annex fsck --from remote
? Such a command should be able to,
for each file in the repository, contact the encrypted remote to check
if it has the file. This can be done without enumeration, although it will
mean running gpg once per file fscked, to get the encrypted filename.
So, the files stored in the remote should be encrypted. But, it needs to be a repeatable encryption, so they cannot just be gpg encrypted, that would yeild a new name each time. Instead, HMAC is used. Any hash could be used with HMAC. SHA-1 is the default, but other hashes can be chosen for new remotes.
It was suggested that it might not be wise to use the same cipher for both gpg and HMAC. Being paranoid, it's best not to tie the security of one to the security of the other. So, the encrypted cipher described above is actually split in two; the first half is used for HMAC, and the second half for gpg.
Does the HMAC cipher need to be gpg encrypted? Imagine if it were stored in plainext in the git repository. Anyone who can access the git repository already knows the actual filenames, and typically also the content hashes of annexed content. Having access to the HMAC cipher could perhaps be said to only let them verify that data they already know.
While this seems a pretty persuasive argument, I'm not 100% convinced, and anyway, most times that the HMAC cipher is needed, the gpg cipher is also needed. Keeping the HMAC cipher encrypted does slow down two things: dropping content from encrypted remotes, and checking if encrypted remotes really have content. If it's later determined to be safe to not encrypt the HMAC cipher, the current design allows changing that, even for existing remotes.
other use of the symmetric cipher
The symmetric cipher can be used to encrypt other content than the content
sent to the remote. In particular, it may make sense to encrypt whatever
access keys are used by the special remote with the cipher, and store that
in remote.log
. This way anyone whose gpg key has been given access to
the cipher can get access to whatever other credentials are needed to
use the special remote.
For example, the S3 special remote does this if configured with embedcreds=yes.
risks
A risk of this scheme is that, once the symmetric cipher has been
obtained, it allows full access to all the encrypted content. Indeed
anyone owning a key that used to be granted access could already have
decrypted the cipher and stored a copy. While it is in possible to
remove a key with keyid-=
, it is designed for a
completely different purpose and does not actually revoke
access.
If git-annex stores the decrypted symmetric cipher in memory, then there is a risk that it could be intercepted from there by an attacker. Gpg ameliorates these type of risks by using locked memory. For git-annex, note that an attacker with local machine access can tell at least all the filenames and metadata of files stored in the encrypted remote anyway, and can access whatever content is stored locally.
This design does not address obfuscating the size of files by chunking them. However, chunking was later added; see chunks.
New encryption keys could be used for different directories/files/patterns/times/whatever. One could then encrypt this new key for the public keys of other people/machines and push them out along with the actual data. This would allow some level of access restriction or future revocation. git-annex would need to keep track of which files can be decrypted with which keys. I am undecided if that information needs to be encrypted or not.
Encrypted object files should be checksummed in encrypted form so that it's possible to verify integrity without knowing any keys. Same goes for encrypted keys, etc.
Chunking files in this context seems like needless overkill. This might make sense to store a DVD image on CDs or similar, at some point. But not for encryption, imo. Coming up with sane chunk sizes for all use cases is literally impossible and as you pointed out, correlation by the remote admin is trivial.
I see no use case for verifying encrypted object files w/o access to the encryption key. And possible use cases for not allowing anyone to verify your data.
If there are to be multiple encryption keys usable within a single encrypted remote, than they would need to be given some kind of name (a since symmetric key is used, there is no pubkey to provide a name), and the name encoded in the files stored in the remote. While certainly doable I'm not sold that adding a layer of indirection is worthwhile. It only seems it would be worthwhile if setting up a new encrypted remote was expensive to do. Perhaps that could be the case for some type of remote other than S3 buckets.
Assuming you're storing your encrypted annex with me and I with you, our regular cron jobs to verify all data will catch corruption in each other's annexes.
Checksums of the encrypted objects could be optional, mitigating any potential attack scenarios.
It's not only about the cost of setting up new remotes. It would also be a way to keep data in one annex while making it accessible only in a subset of them. For example, I might need some private letters at work, but I don't want my work machine to be able to access them all.
@Richard the easy way to deal with that scenario is to set up a remote that work can access, and only put in it files work should be able to see. Needing to specify which key a file should be encrypted to when putting it in a remote that supported multiple keys would add another level of complexity which that avoids.
Of course, the right approach is probably to have a separate repository for work. If you don't trust it with seeing file contents, you probably also don't trust it with the contents of your git repository.
"For git-annex, note that an attacker with local machine access can tell at least all the filenames and metadata of files stored in the encrypted remote anyway, and can access whatever content is stored locally."
Better security is given by sshfs + cryptmount, which I used when I recently setup a git-annex repository on a free shell account from a provider I do not trust.
See http://code.cjb.net/free-secure-online-backup.html for what I did to get a really secure solution.
Kind regards,
Hans Ekbrand
Hans,
You are misunderstanding how git-annex encryption works. The "untrusted host" and the "local machine" are not the same machine. git-annex only transfers pre-encrypted files to the "untrusted host".
You should setup a git-annex encrypted remote and watch how it works so you can see for yourself that it is not insecure.
Your solution does not provide better security, it accomplishes the same thing as git-annex in a more complicated way. In addition, since you are mounting the image from the client your solution will not work with multiple clients.
Justin,
thanks for clearing that up. It's great that git-annex has implemented mechanisms to work securely on untrusted hosts. My solution is thus only interesting for files that are impractical to manage with git-annex (e.g. data for/from applications that need rw-access to a large number of files). And, possibly, for providers that do not provide rsync.
Your remark that my solution does not work with more than one client, is not entirely accurate. No more than one client can access the repository at any given time, but as long as access is not simultaneous, any number of clients can access the repository. Still, your point is taken, it's a limitation I should mention.
It would be interesting to compare the performance of individually encrypted files to encrypted image-file. My intuition says that encrypted image-file should be faster, but that's just a guess.