Currently, SHA256E creates duplicate files for different extensions, i.e.:
$ l * && l -Li * && sha256sum *
lrwxrwxrwx 1 atemu users 198 2021-05-04 03:47 random.1 -> .git/annex/objects/F9/Kk/SHA256E-s104857600--2fdbdc9c3b23d1986a743aede593765e57ade9f173f9fd9766057f0efd63197a.1/SHA256E-s104857600--2fdbdc9c3b23d1986a743aede593765e57ade9f173f9fd9766057f0efd63197a.1
lrwxrwxrwx 1 atemu users 198 2021-05-05 10:01 random.2 -> .git/annex/objects/Pm/J1/SHA256E-s104857600--2fdbdc9c3b23d1986a743aede593765e57ade9f173f9fd9766057f0efd63197a.2/SHA256E-s104857600--2fdbdc9c3b23d1986a743aede593765e57ade9f173f9fd9766057f0efd63197a.2
3720 -r--r--r-- 1 atemu users 100M 2021-05-04 03:47 random.1
49696 -r--r--r-- 1 atemu users 100M 2021-05-05 10:01 random.2
2fdbdc9c3b23d1986a743aede593765e57ade9f173f9fd9766057f0efd63197a random.1
2fdbdc9c3b23d1986a743aede593765e57ade9f173f9fd9766057f0efd63197a random.2
These have the exact same content though, they could be hardlinks of one another instead and nothing would change.
There is some precident for this;
git annex migrate
hard links the annex objects for the old and new key.But hard links are also used for unlocked files with annex.thin, and when the object file is hard linked to somewhere already, it is unable to hard link it to the unlocked file location, and so annex.thin doesn't work.
It would perhaps make sense for such a hard link to be broken when populating an unlocked file when annex.thin is enabled. Ie, replace the object file with a copy of itself. Which can then be hard linked to the unlocked file. That would also improve the situation when it's been migrated.
This feature would also need a way to find all the other equivilant keys that are in use, which would have to be done whenever an object file gets populated. So it would need to be very fast, otherwise it would slow down eg
git annex get
. While the keys database does have the necessary information in it (finally), it would need to select keys matching a pattern, which I doubt would be an optmised query, even if the index in the database can be used. (See https://sqlite.org/optoverview.html#like_opt) Even if the query were optimised, it could turn out to be a significant speed hit.And I have a feeling that most people bothered by this duplication would just do better to migrate away from SHA256E to SHA256..
Why? You should be able to create as many hardlinks of any hardlink as you please or am I missing something? A "regular" file is just a hardlink with refcount = 1 and if it works there, it should work for any other refcount.
A simpler approach might be to change the file layout so that files with the same hash are always in the same directory, I.e.:
->
This is basically the SHA256 backend but the files in the final dir in have the respective extensions like the SHA256E backend but they are hardlinks of another (SHA256Hybrid/SHA256Hardlink ?).
Lookups should be super cheap with this method because files with the same hash are simply in the same directory as all the others. These probably shouldn't be treated as separate keys either but as "instances" of the same key.
@Atemu because the hard link to an unlocked file risks it being modified, and losing the content of another key is outside the risk profile allowed by annex.thin.
Also it could be hard linked for other reasons, like a local clone does, and that also should not be exposed to modification.
Moving all the object files to the same directory does not help in the case of git-annex get, where the object files don't exist yet. It would still need a way to find all the equivilant keys to populate those.
I see. I guess annex.thin's risk profile would simply need to be different when using SHA256H then. If you use annex.thin with hardlinks, you're at high risk of accidental modification anyways, so that wouldn't be much of a change tbh.
Also, since it'd be the exact same file and key, such behaviour should be expected and would be exactly the same as with the SHA256 backend.
I'm not 100% familiar with how
get
works but what I'm thinking of might work like this:git annex get random.1
SHA256H should behave just like SHA256 here; it gets the SHA256-indexed key with the same logic. The only difference should be that it adds another file with a suffix when writing.
But making a hardlink from SHA256-foo.1 to SHA256-foo does not help when another file calls it SHA256-foo.2 ...
Ah I see. The whole tree would have to be scanned for SHA256H references of a certain SHA256 key on
get
and tree-modifying commands likecheckout
.With the existing backends that doesn't need to happen since the key existing in the local repo implies that it is reachable via the symlinks in the checkout.
Trickier than I had anticipated, thank you for your insights!
Maybe the extension aliases of a key could be recorded and simply applied in every repo a key exists in. This might require another lookup on get (though it could perhaps be done in the same lookup) but, since it'd require a lookup of a key anyways under SHA256E because they're entirely separate, we wouldn't lose on performance either.
The only problem would be propagating new aliases on
sync
...Right. It is possible to get this into a sqlite database and update it incrementally. The keys database is populated this way, but that's by tracking changes to the HEAD branch. This would instead involve tracking changes to the git-annex branch.
See cache key info for discussion of this kind of thing.