alternate keys for same content

git-annex/ todo/ alternate keys for same content

Edit
RecentChanges
History
Preferences
Branchable
11 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

Designate a metadata field, say alt_keys, to store alternate keys for the content designated by the key with the metadata. Then, after initially adding a URL key, and after some time getting its content, a checksum-based key such as MD5 could be added as the URL key's metadata. Then, without needing to migrate, the URL key could be treated like checksum-based keys, e.g. downloaded from untrusted remotes, fsck'ed, etc.

The problem with migrating keys is that a separate copy of the contents is stored in the annex under the old key; but it you force-drop that, symlinks in older commits will become invalid. You could rewrite git history, but that brings its own problems.

Also, sometimes one can determine the MD5 from the URL without downloading the file; e.g. with gsutil stat for gs:// URIs, or by downloading an .md5 file stored next to the main file, or because an MD5 was computed by a workflow manager that produced the file (Cromwell does this). The special remote's "CHECKURL" implementation could record an MD5E key in the alt_keys metadata field of the URL key. Then 'addurl --fast' could check alt_keys, and store in git an MD5E key rather than a URL key, if available.

RSS Atom

comment 1

This would mean that, every time something about a key is looked up in the git-annex branch, it would also need to look at the metadata to see if this alt_keys field is set.

So it doubles the time of every single query of the git-annex branch.

I don't think that's a good idea, querying the git-annex branch is already often a bottleneck to commands.

Comment by joey — Thu Jan 30 18:36:17 2020

Remove comment

alternate keys

"every time something about a key is looked up in the git-annex branch, it would also need to look at the metadata to see if this alt_keys field is set" -- not every time, just when checking if the key is checksum-based, and if content matches the checksum. Also, isn't metadata cached in a database?

Comment by Ilya_Shlyakhter — Fri Jan 31 19:23:35 2020

Remove comment

Re: comment 1

The proposed implementation may be inefficient, but the idea has merit.

What if that information is stored in a place where it can be used to verify migrations?

For example, when entering that the migrating remote dropped the data into git-annex:aaa/bbb/SHA1-s1234--somehash.log, somewhere near there a record could be added that this was migrated to SHA512-s1234--longerhash. When then all the other remotes are asked to drop that file, they can actually do that because they see that it has been migrated, can verify the migration and are free to drop the file.

Even later, when a remote wants to get an old name (eg. because it checked out an old version of master), it can look up the key, find where it was migrated to, and make the data available under its own name (by copying, or maybe by placing a symlink pointing file from .git/annex/objects/Aa/Bb/SHA1-s1234--somehash/SHA1-s1234--somehash to the new.

Comment by chrysn — Fri Jan 31 19:47:59 2020

Remove comment

comment 4

"can be used to verify migrations" -- my hope was to avoid migrations, i.e. to get the benefit you'd get from migrating to a checksum-based key, without doing the migration.

Comment by Ilya_Shlyakhter — Fri Jan 31 20:32:00 2020

Remove comment

simpler proposal

So, to fully and properly implement what the title of this todo suggests -- "alternate keys for same content" -- might be hard. But to simply enable adding checksums to WORM/URL keys, stored separately on the git-annex branch rather than encoded in the key's name, is simpler. This would let some WORM/URL keys to be treated as checksum-based keys when getting contents from untrusted remotes or when checking integrity with git-annex-fsck. But this isn't really "alternate keys for same content": the content would be stored under only the WORM/URL key under which it was initially recorded. The corresponding MD5 key would not be recorded in location tracking as present.

Checking whether a WORM/URL key has an associated checksum could be sped up by keeping a Bloom filter representing the set of WORM/URL keys for which alt_keys is set.

In the addurl --fast case for special remotes, where the remote can determine a file's checksum without downloading, a checksum-based key would be recorded to begin with, as happens with addurl without --fast. Currently I do this by manually calling plumbing commands like git-annex-setpresentkey, but having addurl do it seems better.

Comment by Ilya_Shlyakhter — Fri Jan 31 21:46:57 2020

Remove comment

comment 6

There is also aaa/bbb/*.log.cid in git-annex branch for "per-remote content identifiers for keys". It could be another place to store alternate keys, but it is per-remote, so... no.

As for the metadata field alt_keys — it is another case of "setting a metadata field to a key" in Bidirectional metadata.

Also, there is an interesting idea of git-annex-migrate using git-replace.

By the way, as far as I know (maybe things have changed since then), ipfs has a similar problem of different identifiers for the same content. Because it encodes how things are stored. And hash functions can also be changed.

Comment by Chel — Sat Feb 1 02:32:01 2020

Remove comment

potential security issues?

I wonder if storing checksums in a general-purpose mutable metadata field may cause security issues. Someone could use the git-annex-metadata command to overwrite the checksum. It should be stored in a read-only field written only by git-annex itself, like the field-lastchanged metadata already is.

Of course, if someone is able to write the git-annex branch directly, or get the user to pull merges to it, they could alter the checksum stored there. Maybe, only trust stored checksums if merge.verifySignatures=true?

Comment by Ilya_Shlyakhter — Thu Feb 6 21:00:55 2020

Remove comment

comment 8

I think Ilya Shlyakhter gets to a fundamental problem in his comment above. Any way that git-annex stores data about an alternate key that is recorded in git, allows anyone to spoof bad data.

For example, if I have a SHA256 key stored in git-annex, it would be a bad security hole if I fetched from Ilya's repository and suddenly git-annex was willing to accept some MD5 key as being the same content as my SHA256 key. Even if the two keys had the same content currently, that MD5 key can be collision attacked later.

So there would need to be a direction in which key upgrades were allowed. Which is fine for WORM -> SHA256, but less clear for SHA1 -> SHA256 and much less clear for other pairs of modern hashes.

Comment by joey — Thu Nov 30 20:43:53 2023

Remove comment

git-annex branch size when storing migration information

I did a small experiment to gauge how much the git repo size would grow if migration were recorded in log files in the git-annex branch.

In my experiment, I started with 1000 files using sha256. The size of the git objects (after repack by git gc --aggressive) was 0.5 mb. I then migrated them to sha512, which increased the size of git objects to 1.1 mb (after repacking).

Then I recorded in the git-annex branch additional log files for each of the sha512 keys that contained the corresponding sha256 key. That grew the git objects to 1.4 mb after repacking.

This was a little disappointing. I'd hoped that repacking would avoid duplication of the sha256 keys, which are both in the log files I wrote and are used as filenames. But the data I wrote to the logs is only 75 kb total, and git grew 4x that.

I tried the same thing except instead of separate log files I added to git one log file that contained pairs of sha256 and sha512 keys. That log file was 213 kb and adding it to the git repo grew it by 102 kb. So there was some compression there, but less than I would have hoped, and not much better than just gzip -9 of the log file (113 kb). Of course putting all the migration information in a single file like this would add a lot of
complexity to accessing it.

So adding this information to the git-annex branch would involve at best around a 16% overhead, which is a surprising amount.

(It would be possible to make git-annex forget --drop-dead remove the information about old migrated keys if they later get marked as dead, and so regain the space.)

This is also rather redundant information to store in git, since most of the time when file foo has been migrated, the old key can be determined by looking at git log foo. Not always of course because foo might have been renamed after migration, for example.

Another way to store migration information in the git-annex branch would to be graft in the pre-migration tree and the post-migration tree. Diffing those two trees would show what migrated, and most of the time this would use almost no additional space in git, because the user will have committed both those trees anyway, or something very close to them. But it would be more expensive to extract the migration information then, and this would need a local cache of migrations to be built up from examining those diffs..

Comment by joey — Fri Dec 1 16:10:11 2023

Remove comment

Re: simpler proposal

About the idea of recording a checksum of the content of a URL or WORM key, without migrating to a SHA key, that does seem worth considering. (And maybe was the original idea of this todo really..)

If that were implemented, it would be necessary for more than one checksum to be able to be recorded for a given URL key. Because different clones might get different content from the URL and each add its checksum.

So, this would not be as strong an assurance as using a SHA key that you're referring to a specific peice of data. It would be useful to protect against bit rot, but not as a way to pin a file to a particular version. Which is often something one does want to do in a git repository!

I do think that implementing that would be a lot simpler. And it would only affect performance when verifying the content of URL or WORM keys, when it would need to look up the checksum in the git-annex branch.

Comment by joey — Fri Dec 1 18:00:20 2023

Remove comment

comment 11

See ?distributed migration...

Comment by joey — Fri Dec 1 18:41:30 2023

Remove comment

Add a comment

Tags: unlikely

Links: Hardlink keys with same hash but differrent ext/comment 1 5af33851bdac4f34d83e8ef2e1d45355 bugs/URL key potential data loss/comment 1 a921dab2c4335f690df4d5189fe4e4c2 external backends/comment 5 37fd824cf9f2dcc59616b9d49e38c262 forum/Back up videos to youtube?/comment 4 303b4f755dd117f8b22c98669e7bf932 forum/Revisiting migration and multiple keys git-annex-addurl/comment 12 f75e35ae6f739e98aeb15c3f8708be8a option to individually hash chunks/comment 1 c6d5133efb3ddc9ac632530600257bdf

Last edited Wed Jun 17 01:18:32 2020