todo/alternate keys for same contentgit-annexhttp://git-annex.branchable.com/todo/alternate_keys_for_same_content/git-annexikiwiki2023-12-01T18:43:01Zcomment 1http://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_1_7a7f287bcde5353072100294dd8edce6/joey2020-06-17T01:18:32Z2020-01-30T18:36:17Z
<p>This would mean that, every time something about a key is looked up in the
git-annex branch, it would also need to look at the metadata to see if this
<code>alt_keys</code> field is set.</p>
<p>So it doubles the time of every single query of the git-annex branch.</p>
<p>I don't think that's a good idea, querying the git-annex branch is already
often a bottleneck to commands.</p>
alternate keyshttp://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_2_0f464f4970e499371fcb65e0d06202cf/Ilya_Shlyakhter2020-06-17T01:18:32Z2020-01-31T19:23:35Z
"every time something about a key is looked up in the git-annex branch, it would also need to look at the metadata to see if this alt_keys field is set" -- not every time, just when checking if the key is checksum-based, and if content matches the checksum. Also, isn't metadata <a href="http://git-annex.branchable.com/design/caching_database/">cached in a database</a>?
Re: comment 1 http://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_3_c99b23e878e37e205a3182d3b6d3f2b2/chrysn2020-06-17T01:18:32Z2020-01-31T19:47:59Z
<p>The proposed implementation may be inefficient, but the idea has merit.</p>
<p>What if that information is stored in a place where it can be used to verify migrations?</p>
<p>For example, when entering that the migrating remote dropped the data into <code>git-annex:aaa/bbb/SHA1-s1234--somehash.log</code>, somewhere near there a record could be added that this was migrated to SHA512-s1234--longerhash. When then all the other remotes are asked to drop that file, they can actually do that because they see that it has been migrated, can verify the migration and are free to drop the file.</p>
<p>Even later, when a remote wants to get an old name (eg. because it checked out an old version of master), it can look up the key, find where it was migrated to, and make the data available under its own name (by copying, or maybe by placing a symlink pointing file from <code>.git/annex/objects/Aa/Bb/SHA1-s1234--somehash/SHA1-s1234--somehash</code> to the new.</p>
comment 4http://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_4_a5fb6045595da0c490098e46f76db9b8/Ilya_Shlyakhter2020-06-17T01:18:32Z2020-01-31T20:32:00Z
"can be used to verify migrations" -- my hope was to <em>avoid</em> migrations, i.e. to get the benefit you'd get from migrating to a checksum-based key, without doing the migration.
simpler proposalhttp://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_5_230d35bd623189818002901455964ca4/Ilya_Shlyakhter2020-06-17T01:18:32Z2020-01-31T21:46:57Z
<p>So, to fully and properly implement what the title of this todo suggests -- "alternate keys for same content" -- might be hard. But to simply enable adding checksums to WORM/URL keys, stored separately on the git-annex branch rather than encoded in the key's name, is simpler. This would let some WORM/URL keys to be treated as checksum-based keys when getting contents from untrusted remotes or when checking integrity with <code>git-annex-fsck</code>. But this isn't really "alternate keys for same content": the content would be stored under only the WORM/URL key under which it was initially recorded. The corresponding MD5 key would not be recorded in <a href="http://git-annex.branchable.com/location_tracking/">location tracking</a> as present.</p>
<p>Checking whether a WORM/URL key has an associated checksum could be sped up by keeping a Bloom filter representing the set of WORM/URL keys for which <code>alt_keys</code> is set.</p>
<p>In the <code>addurl --fast</code> case for special remotes, where the remote can determine a file's checksum without downloading, a checksum-based key would be recorded to begin with, as happens with <code>addurl</code> without <code>--fast</code>. Currently I do this by manually calling plumbing commands like <code>git-annex-setpresentkey</code>, but having <code>addurl</code> do it seems better.</p>
comment 6http://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_6_ae8355ec917e7a7a240cdb88714c55d0/Chel2020-06-17T01:18:32Z2020-02-01T02:32:01Z
<p>There is also <code>aaa/bbb/*.log.cid</code> in git-annex branch for "per-remote content identifiers for keys".
It could be another place to store alternate keys, but it is per-remote, so... no.</p>
<p>As for the metadata field <code>alt_keys</code> — it is another case of
"<a href="http://git-annex.branchable.com/todo/Bidirectional_metadata/#comment-788380998b25267c5b99c4a865277102">setting a metadata field to a key</a>"
in <a href="http://git-annex.branchable.com/todo/Bidirectional_metadata/">Bidirectional metadata</a>.</p>
<p>Also, there is an interesting idea of <a href="http://git-annex.branchable.com/todo/git-annex-migrate_using_git-replace/">git-annex-migrate using git-replace</a>.</p>
<p>By the way, as far as I know (maybe things have changed since then),
ipfs has a similar problem of different identifiers for the same content.
Because it encodes how things are stored. And hash functions can also be changed.</p>
potential security issues?http://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_7_1c0a975893c63c14b3f6e17712b5191c/Ilya_Shlyakhter2020-06-17T01:18:32Z2020-02-06T21:00:55Z
<p>I wonder if storing checksums in a general-purpose mutable metadata field may cause security issues. Someone could use the <a href="http://git-annex.branchable.com/git-annex-metadata/"><code>git-annex-metadata</code></a> command to overwrite the checksum. It should be stored in a read-only field written only by <code>git-annex</code> itself, like the <code>field-lastchanged</code> metadata already is.</p>
<p>Of course, if someone is able to write the <a href="http://git-annex.branchable.com/internals/#The_git-annex_branch">git-annex branch</a> directly, or get the user to pull merges to it, they could alter the checksum stored there. Maybe, only trust stored checksums if <code>merge.verifySignatures=true</code>?</p>
comment 8http://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_8_4b16c48a2d9f4926d63f6ab54fe801d3/joey2023-11-30T21:07:21Z2023-11-30T20:43:53Z
<p>I think Ilya Shlyakhter gets to a fundamental problem in his comment above.
Any way that git-annex stores data about an alternate key that is recorded
in git, allows anyone to spoof bad data.</p>
<p>For example, if I have a SHA256 key stored in git-annex, it would be a bad
security hole if I fetched from Ilya's repository and suddenly git-annex
was willing to accept some MD5 key as being the same content as my SHA256
key. Even if the two keys had the same content currently, that MD5 key can
be collision attacked later.</p>
<p>So there would need to be a direction in which key upgrades were allowed.
Which is fine for <code>WORM -> SHA256</code>, but less clear for <code>SHA1 -> SHA256</code>
and much less clear for other pairs of modern hashes.</p>
git-annex branch size when storing migration informationhttp://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_9_42d240bbfc6ab858219ffa0f873c3eb4/joey2023-12-01T18:17:03Z2023-12-01T16:10:11Z
<p>I did a small experiment to gauge how much the git repo size would grow if
migration were recorded in log files in the git-annex branch.</p>
<p>In my experiment, I started with 1000 files using sha256. The size of the
git objects (after repack by git gc --aggressive) was 0.5 mb. I then
migrated them to sha512, which increased the size of git objects to 1.1 mb
(after repacking).</p>
<p>Then I recorded in the git-annex branch additional log files for each of
the sha512 keys that contained the corresponding sha256 key. That grew the
git objects to 1.4 mb after repacking.</p>
<p>This was a little disappointing. I'd hoped that repacking would avoid
duplication of the sha256 keys, which are both in the log files I wrote
and are used as filenames. But the data I wrote to the logs is only 75 kb
total, and git grew 4x that.</p>
<p>I tried the same thing except instead of separate log files I added to git
one log file that contained pairs of sha256 and sha512 keys. That log file
was 213 kb and adding it to the git repo grew it by 102 kb. So there was
some compression there, but less than I would have hoped, and not much
better than just gzip -9 of the log file (113 kb). Of course putting all
the migration information in a single file like this would add a lot of<br />
complexity to accessing it.</p>
<p>So adding this information to the git-annex branch would involve at best
around a 16% overhead, which is a surprising amount.</p>
<p>(It would be possible to make <code>git-annex forget --drop-dead</code> remove the
information about old migrated keys if they later get marked as dead, and
so regain the space.)</p>
<p>This is also rather redundant information to store in git, since most
of the time when file foo has been migrated, the old key can be determined
by looking at <code>git log foo</code>. Not always of course because foo might have
been renamed after migration, for example.</p>
<p>Another way to store migration information in the git-annex branch would to
be graft in the pre-migration tree and the post-migration tree. Diffing
those two trees would show what migrated, and most of the time this would
use almost no additional space in git, because the user will have committed
both those trees anyway, or something very close to them. But it would be
more expensive to extract the migration information then, and this would
need a local cache of migrations to be built up from examining those diffs..</p>
Re: simpler proposalhttp://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_10_22ff867952875856b20339a8829c5944/joey2023-12-01T18:43:01Z2023-12-01T18:00:20Z
<p>About the idea of recording a checksum of the content of a URL or WORM key,
without migrating to a SHA key, that does seem worth considering. (And
maybe was the original idea of this todo really..)</p>
<p>If that were implemented, it would be necessary for more than one checksum
to be able to be recorded for a given URL key. Because different
clones might get different content from the URL and each add its checksum.</p>
<p>So, this would not be as strong an assurance as using a SHA key that you're
referring to a specific peice of data. It would be useful to protect
against bit rot, but not as a way to pin a file to a particular version.
Which is often something one does want to do in a git repository!</p>
<p>I do think that implementing that would be a lot simpler. And it would
only affect performance when verifying the content of URL or WORM keys,
when it would need to look up the checksum in the git-annex branch.</p>
comment 11http://git-annex.branchable.com/todo/alternate_keys_for_same_content/comment_11_3323eff3d94d366595bf2b7e78c01dce/joey2023-12-01T18:43:01Z2023-12-01T18:41:30Z
See <a href="http://git-annex.branchable.com/todo/distributed_migration/">distributed migration</a>...