avoid duplicating key twice in symlink to object file

git-annex/ todo/ avoid duplicating key twice in symlink to object file

Edit
RecentChanges
History
Preferences
Branchable
9 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

The link targets of annexed files are currently very long. This creates problems e.g. when browsing directories in Emacs (I mostly work through a text terminal). Ideally, the key would not be repeated twice in the link, but I understand this is hard to do compatibly. Maybe, the following simpler alternative could be implemented? Key checksums are currently represented in base16 using only the characters 0-9a-f . The same information could be represented with shorter strings using base64url or other encoding, where a larger range of chars is used. So for each backend you'd add a corresponding one that does the same thing, but encodes the checksum part of the key with shorter strings.

Or, if you're tired of backend requests, maybe implement a scheme for external backends, like the one for external special remotes? For external backend EXTNNN the user would put a script git-annex-external-backend-NNN in the path; the script would support commands like calckey, examinekey . Then I could also implement e.g. canonicalizing backends that strip away variable but semantically irrelevant information before computing the checksum.

RSS Atom

comment 1

The key is in the path twice as a security measure (the write bit is removed from the directory, to prevent rm -rfing all the files away by mistake).

This is known to cause slowness while traversing the objects directory, which is why there is repository tuning. Perhaps you want to set some of these?

Comment by CandyAngel — Thu Sep 27 07:39:57 2018

Remove comment

comment 2

"The key is in the path twice as a security measure" -- would key/f.txt be less secure than key/key.txt? I thought the security comes from having both a dir and a file, not from them both having the key in their name?

Comment by Ilya_Shlyakhter — Thu Sep 27 11:09:04 2018

Remove comment

comment 3

I'm not convinced that git-annex should try to make the symlinks shorter just because some programs have UIs that don't work well with longer symlinks. UIs can be improved.

I like to use ls -lL for example, which conveniently avoids displaying the symlink target and also shows the size of the annexed file.

Using the MD5 backend will also give you much shorter symlinks..

External backends is an interesting idea, but needing to deal with the backend being missing or failing to work could have wide repurcussions in the code base. It feels like too much complexity for too little gain.

Comment by joey — Thu Oct 4 18:35:28 2018

Remove comment

comment 4

"I'm not convinced that git-annex should try to make the symlinks shorter just because some programs have UIs that don't work well with longer symlinks" -- UI is just one plus of shorter keys. Another is that some systems can't handle long paths; e.g. backends says don't use 512 or 384 hashes on Windows. Another is that long keys and symlinks increase the amount of data git deals with, which can matter for large repos. Using base64 encoding for hashes would shorten key lengths by a third; not repeating the hash twice in symlinks would give another factor of 2.

Comment by Ilya_Shlyakhter — Fri Oct 12 01:05:30 2018

Remove comment

comment 5

I was arguing for removal of the KEY-directory/ for a while See e.g. as old as https://github.com/datalad/datalad/issues/32 . There is an issue/discussion on this website too somewhere, couldn't find quickly. IMHO it is just a "tech" problem, i.e. no design principle forbids fixing it. It might though lead to performance issues since the containing directory then needs to be chmod'ed back and forth to introduce changes to the KEY-file under it, but it is probably very similar to what it is now anyways.

FWIW in DataLad we moved to use MD5E backend as the default to at least somewhat relief the burden of long symlinks. I think we are "secure" enough for what we use DataLad here

Comment by yarikoptic — Fri Oct 12 13:42:07 2018

Remove comment

comment 6

Removing KEY/directory could give more savings, but sometimes there is more than one file there (eg key metadata), so the dir makes sense. But the content filename in the dir needn’t repeat the key. But, changing that could be hard. Adding backend variamts with base64-encoded checksums seems possible though?

Comment by Ilya_Shlyakhter — Fri Oct 12 13:58:05 2018

Remove comment

comment 7

Since there is a separate todo item external backends, let's not discuss that idea here.

key/f would have been a great idea to have had 10 years ago. (Although it does mean that if the object file somehow gets moved out of its directory, there's no indication in its name that it's a git-annex object file)

But if that's all this todo is about, we'd need some kind of transition plan for existing repos with history containing symlinks to key/key. I doubt there is a good way to make that transition.

Comment by joey — Thu Jan 30 18:49:47 2020

Remove comment

re: shorter symlinks

I don't need strategy nor safety. Create a tunable for new repos to disable directory sha completely (and if not sustainable - even readonly bit safety) - and that is enough. I will replay whole my history again onto new repo and make fs snapshotting more frequently. I can even live without base64. Changing defaults is undesirable - and that is understandable. But we would still prefer to have an option, even if to bear the whole grunt of consequences ourselves. Sometimes I wonder if it worth to learn Haskell only to fork gitannex for specific needs, but the reason for this encompassing endeavor seems lame