The idea is to have a type of key that is based on a hash of the annexed file, but with some form of encryption or password protection.
So someone who can access the git repository is prevented from checking known hashes of files, and learning about content in the repository that is not available to them. (However, they would be able to see filenames and commit metadata, which might also expose information about what the files are. This is not a fully encrypted git repository like can be made with gcrypt.)
This might be an added layer of security, or the git repository might be made public, including perhaps some of the annexed files, while this is used to obscure the hashes of some other annexed files. For example, a scientific dataset might make public files that are derived from patient data, and use this for the patient data files.
If two such repositories were merged, git-annex would need to somehow be able to tell how to decrypt a key, which could have come from either repository. So it seems that the key needs to include within it some identifier of the secret that is used to encrypt it. For example:
ENC--ident--foo
Where ident would be something like a UUID of the secret, and foo is the file's hash (and perhaps size, or indeed a whole regular key) that is encrypted by the secret.
Should the key's size be included only in encrypted form, or in plaintext?
Since git-annex constructs a Key without using the associated Backend, it currently parses the size field the same way for each type of key. So, it does not seem to be possible to encrypt the size field and use that encrypted size to populate keySize. What would be doable though, is to replace uses of keySize with a new Backend method that returns the size.
Should the encryption method be reversible?
If the encryption method is not reversible, the key's size would need to be included in plaintext, or left out entirely.
But it does not seem necessary for the encryption method to be reversible otherwise. Consider if scrypt was used. When adding a file, git-annex would first hash it, and then run it through scrypt. That is not reversible, so when fscking a file, just repeat the same process and compare the resulting scrypt keys.
Not being reversible is a nice benefit, because it makes it much harder for an attacker to brute-force. If it's reversible the attacker can brute-force the user's password, looking for a password that decrypts to something that looks right.
--Joey
Hey joey,
As a simple-to-implement yet quite effective approach to the problem of storing some secrets in a public git-annex repo, wouldn't a very slow hash/key derivation function (like scrypt) as keys for those specific files be enough? The hash can be public when brute-forcing is infeasible. So for git-annex:
SCRYPT-n10-r100-p1--...
)With sane defaults (maybe settings that make hashing take several seconds?), this would make git-annex a very nice way of hiding some files' content in public repositories while still tracking it.
Some resources:
Including the salt in the key is an interesting idea. I am not sure I buy that it would be secure.
Normally with scrypt (or argon2 etc), and a strong password, the attacker has to make a huge number of guesses, and so the comparatively modest amount of work per guess is enough to make it infeasible for them to succeed.
Here though, the attacker will only be interested in guessing the hashes of known files that they care about. That might be millions of files and so be a reasonable amount of work to try them all. Probably less work than a good password, so the hash difficulty parameters would need to be turned up to secure against that attacker.
But... If the attacker only cares about a single file, they only have to run scrypt once.
Well, the attacker would still need to run scrypt many times to brute-force the actual content, IIUC.
I understand it like this (might be wrong, I'm no security expert):
git annex fsck
is slow on that one file. Big files hash slowly anyway as well...). It should definitely be harder than cracking a breached password database where it is clear that the passwords don't contain newlines and there most likely is a size limit, etc.If my above assumptions are correct, an
scrypt
key backend for git-annex should make for a nice way of hiding the content of some files in a public repo, right?P.S: Thinking about the 'that secured file could contain potentially large random comments acting as a salt next to the actual secret' together with removing the size from the key (
git annex migrate --remove-size
) should already be pretty safe with the default SHA256-backend, right? 🤔I'm talking about a single file that the attacker already knows the content of, and wishes to determine if it's present in the repository. That takes a single scrypt operation, so scrypt's added difficulty is not relevant.
If the attacker doesn't know the file content, and is trying to hash random values until they find one that matches the git-annex key, then a sha2 is equally good protection as scrypt, because the amount of work is computationally infeasible in both cases.
Ah alright, in that case scrypt doesn't help, I agree. Among the large userbase for such a versatile software as git-annex there surely are people that could make use of hiding the presence of some 'common' files from their public repos. Though it rather seems like quite an edge case to me 😅 Better go full-gcrypt in that case I'd say...
In that case, this'll be enough for my application, thanks! 🙏
Ok, I think we've concluded there is nothing to do here! I'll close this todo then.