Designate a metadata field, say alt_keys, to store alternate keys for the content designated by the key with the metadata. Then, after initially adding a URL key, and after some time getting its content, a checksum-based key such as MD5 could be added as the URL key's metadata. Then, without needing to migrate, the URL key could be treated like checksum-based keys, e.g. downloaded from untrusted remotes, fsck'ed, etc.
The problem with migrating keys is that a separate copy of the contents is stored in the annex under the old key; but it you force-drop that, symlinks in older commits will become invalid. You could rewrite git history, but that brings its own problems.
Also, sometimes one can determine the MD5 from the URL without downloading the file; e.g. with gsutil stat for gs:// URIs, or by downloading an .md5 file stored next to the main file, or because an MD5 was computed by a workflow manager that produced the file (Cromwell does this). The special remote's "CHECKURL" implementation could record an MD5E key in the alt_keys metadata field of the URL key. Then 'addurl --fast' could check alt_keys, and store in git an MD5E key rather than a URL key, if available.
This would mean that, every time something about a key is looked up in the git-annex branch, it would also need to look at the metadata to see if this
alt_keys
field is set.So it doubles the time of every single query of the git-annex branch.
I don't think that's a good idea, querying the git-annex branch is already often a bottleneck to commands.
The proposed implementation may be inefficient, but the idea has merit.
What if that information is stored in a place where it can be used to verify migrations?
For example, when entering that the migrating remote dropped the data into
git-annex:aaa/bbb/SHA1-s1234--somehash.log
, somewhere near there a record could be added that this was migrated to SHA512-s1234--longerhash. When then all the other remotes are asked to drop that file, they can actually do that because they see that it has been migrated, can verify the migration and are free to drop the file.Even later, when a remote wants to get an old name (eg. because it checked out an old version of master), it can look up the key, find where it was migrated to, and make the data available under its own name (by copying, or maybe by placing a symlink pointing file from
.git/annex/objects/Aa/Bb/SHA1-s1234--somehash/SHA1-s1234--somehash
to the new.So, to fully and properly implement what the title of this todo suggests -- "alternate keys for same content" -- might be hard. But to simply enable adding checksums to WORM/URL keys, stored separately on the git-annex branch rather than encoded in the key's name, is simpler. This would let some WORM/URL keys to be treated as checksum-based keys when getting contents from untrusted remotes or when checking integrity with
git-annex-fsck
. But this isn't really "alternate keys for same content": the content would be stored under only the WORM/URL key under which it was initially recorded. The corresponding MD5 key would not be recorded in location tracking as present.Checking whether a WORM/URL key has an associated checksum could be sped up by keeping a Bloom filter representing the set of WORM/URL keys for which
alt_keys
is set.In the
addurl --fast
case for special remotes, where the remote can determine a file's checksum without downloading, a checksum-based key would be recorded to begin with, as happens withaddurl
without--fast
. Currently I do this by manually calling plumbing commands likegit-annex-setpresentkey
, but havingaddurl
do it seems better.There is also
aaa/bbb/*.log.cid
in git-annex branch for "per-remote content identifiers for keys". It could be another place to store alternate keys, but it is per-remote, so... no.As for the metadata field
alt_keys
— it is another case of "setting a metadata field to a key" in Bidirectional metadata.Also, there is an interesting idea of git-annex-migrate using git-replace.
By the way, as far as I know (maybe things have changed since then), ipfs has a similar problem of different identifiers for the same content. Because it encodes how things are stored. And hash functions can also be changed.
I wonder if storing checksums in a general-purpose mutable metadata field may cause security issues. Someone could use the
git-annex-metadata
command to overwrite the checksum. It should be stored in a read-only field written only bygit-annex
itself, like thefield-lastchanged
metadata already is.Of course, if someone is able to write the git-annex branch directly, or get the user to pull merges to it, they could alter the checksum stored there. Maybe, only trust stored checksums if
merge.verifySignatures=true
?