Designate a metadata field, say alt_keys, to store alternate keys for the content designated by the key with the metadata. Then, after initially adding a URL key, and after some time getting its content, a checksum-based key such as MD5 could be added as the URL key's metadata. Then, without needing to migrate, the URL key could be treated like checksum-based keys, e.g. downloaded from untrusted remotes, fsck'ed, etc.
The problem with migrating keys is that a separate copy of the contents is stored in the annex under the old key; but it you force-drop that, symlinks in older commits will become invalid. You could rewrite git history, but that brings its own problems.
Also, sometimes one can determine the MD5 from the URL without downloading the file; e.g. with gsutil stat for gs:// URIs, or by downloading an .md5 file stored next to the main file, or because an MD5 was computed by a workflow manager that produced the file (Cromwell does this). The special remote's "CHECKURL" implementation could record an MD5E key in the alt_keys metadata field of the URL key. Then 'addurl --fast' could check alt_keys, and store in git an MD5E key rather than a URL key, if available.
This would mean that, every time something about a key is looked up in the git-annex branch, it would also need to look at the metadata to see if this
alt_keys
field is set.So it doubles the time of every single query of the git-annex branch.
I don't think that's a good idea, querying the git-annex branch is already often a bottleneck to commands.
The proposed implementation may be inefficient, but the idea has merit.
What if that information is stored in a place where it can be used to verify migrations?
For example, when entering that the migrating remote dropped the data into
git-annex:aaa/bbb/SHA1-s1234--somehash.log
, somewhere near there a record could be added that this was migrated to SHA512-s1234--longerhash. When then all the other remotes are asked to drop that file, they can actually do that because they see that it has been migrated, can verify the migration and are free to drop the file.Even later, when a remote wants to get an old name (eg. because it checked out an old version of master), it can look up the key, find where it was migrated to, and make the data available under its own name (by copying, or maybe by placing a symlink pointing file from
.git/annex/objects/Aa/Bb/SHA1-s1234--somehash/SHA1-s1234--somehash
to the new.So, to fully and properly implement what the title of this todo suggests -- "alternate keys for same content" -- might be hard. But to simply enable adding checksums to WORM/URL keys, stored separately on the git-annex branch rather than encoded in the key's name, is simpler. This would let some WORM/URL keys to be treated as checksum-based keys when getting contents from untrusted remotes or when checking integrity with
git-annex-fsck
. But this isn't really "alternate keys for same content": the content would be stored under only the WORM/URL key under which it was initially recorded. The corresponding MD5 key would not be recorded in location tracking as present.Checking whether a WORM/URL key has an associated checksum could be sped up by keeping a Bloom filter representing the set of WORM/URL keys for which
alt_keys
is set.In the
addurl --fast
case for special remotes, where the remote can determine a file's checksum without downloading, a checksum-based key would be recorded to begin with, as happens withaddurl
without--fast
. Currently I do this by manually calling plumbing commands likegit-annex-setpresentkey
, but havingaddurl
do it seems better.There is also
aaa/bbb/*.log.cid
in git-annex branch for "per-remote content identifiers for keys". It could be another place to store alternate keys, but it is per-remote, so... no.As for the metadata field
alt_keys
— it is another case of "setting a metadata field to a key" in Bidirectional metadata.Also, there is an interesting idea of git-annex-migrate using git-replace.
By the way, as far as I know (maybe things have changed since then), ipfs has a similar problem of different identifiers for the same content. Because it encodes how things are stored. And hash functions can also be changed.
I wonder if storing checksums in a general-purpose mutable metadata field may cause security issues. Someone could use the
git-annex-metadata
command to overwrite the checksum. It should be stored in a read-only field written only bygit-annex
itself, like thefield-lastchanged
metadata already is.Of course, if someone is able to write the git-annex branch directly, or get the user to pull merges to it, they could alter the checksum stored there. Maybe, only trust stored checksums if
merge.verifySignatures=true
?I think Ilya Shlyakhter gets to a fundamental problem in his comment above. Any way that git-annex stores data about an alternate key that is recorded in git, allows anyone to spoof bad data.
For example, if I have a SHA256 key stored in git-annex, it would be a bad security hole if I fetched from Ilya's repository and suddenly git-annex was willing to accept some MD5 key as being the same content as my SHA256 key. Even if the two keys had the same content currently, that MD5 key can be collision attacked later.
So there would need to be a direction in which key upgrades were allowed. Which is fine for
WORM -> SHA256
, but less clear forSHA1 -> SHA256
and much less clear for other pairs of modern hashes.I did a small experiment to gauge how much the git repo size would grow if migration were recorded in log files in the git-annex branch.
In my experiment, I started with 1000 files using sha256. The size of the git objects (after repack by git gc --aggressive) was 0.5 mb. I then migrated them to sha512, which increased the size of git objects to 1.1 mb (after repacking).
Then I recorded in the git-annex branch additional log files for each of the sha512 keys that contained the corresponding sha256 key. That grew the git objects to 1.4 mb after repacking.
This was a little disappointing. I'd hoped that repacking would avoid duplication of the sha256 keys, which are both in the log files I wrote and are used as filenames. But the data I wrote to the logs is only 75 kb total, and git grew 4x that.
I tried the same thing except instead of separate log files I added to git one log file that contained pairs of sha256 and sha512 keys. That log file was 213 kb and adding it to the git repo grew it by 102 kb. So there was some compression there, but less than I would have hoped, and not much better than just gzip -9 of the log file (113 kb). Of course putting all the migration information in a single file like this would add a lot of
complexity to accessing it.
So adding this information to the git-annex branch would involve at best around a 16% overhead, which is a surprising amount.
(It would be possible to make
git-annex forget --drop-dead
remove the information about old migrated keys if they later get marked as dead, and so regain the space.)This is also rather redundant information to store in git, since most of the time when file foo has been migrated, the old key can be determined by looking at
git log foo
. Not always of course because foo might have been renamed after migration, for example.Another way to store migration information in the git-annex branch would to be graft in the pre-migration tree and the post-migration tree. Diffing those two trees would show what migrated, and most of the time this would use almost no additional space in git, because the user will have committed both those trees anyway, or something very close to them. But it would be more expensive to extract the migration information then, and this would need a local cache of migrations to be built up from examining those diffs..
About the idea of recording a checksum of the content of a URL or WORM key, without migrating to a SHA key, that does seem worth considering. (And maybe was the original idea of this todo really..)
If that were implemented, it would be necessary for more than one checksum to be able to be recorded for a given URL key. Because different clones might get different content from the URL and each add its checksum.
So, this would not be as strong an assurance as using a SHA key that you're referring to a specific peice of data. It would be useful to protect against bit rot, but not as a way to pin a file to a particular version. Which is often something one does want to do in a git repository!
I do think that implementing that would be a lot simpler. And it would only affect performance when verifying the content of URL or WORM keys, when it would need to look up the checksum in the git-annex branch.