In both the .git/annex directory and the git-annex branch, two levels of hash directories are used, to avoid issues with too many files in one directory.
Two separate hash methods are used.
hashdirmixed is only used for non-bare git repositories. (We'd like to stop using this, but it'd be too annoying to change all the git-annex symlinks!)
hashdirlower is used for bare git repositories, the git-annex branch, and on special remotes as well.
Note that git annex find
and git annex examinekey
can be used with
the --format
option to find the hash directories. The explanation
below is only for completeness.
new hash format
This uses two directories, each with a three-letter name, such as "f87/4d5"
The directory names come from the first 6 characters of the md5sum of the key when serialized as a hex string.
For example:
echo -n "SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" | md5sum
old hash format
This uses two directories, each with a two-letter name, such as "pX/1J"
It takes the md5sum of the key, but rather than a string, represents it as 4 32bit words. Only the first word is used. It is converted into a string by the same mechanism that would be used to encode a normal md5sum value into a string, but where that would normally encode the bits using the 16 characters 0-9a-f, this instead uses the 32 characters "0123456789zqjxkmvwgpfZQJXKMVWGPF". The first 2 letters of the resulting string are the first directory, and the second 2 are the second directory.
chunk keys
The same hash directory is used for a chunk key as would be used for the key that it's a chunk of.
The correct old hash value for the empty file SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 is pX/ZJ .
The text describes the old hash value computation incorrectly, because it doesn't mention that 1 bit is skipped between each group of 5 bits. See the sample implementation in display_32bits_as_dir in https://github.com/joeyh/git-annex/blob/master/Locations.hs
1c to support Péter's statement:
The only reason for the letter choice is that it avoids making random words with possibly unintentional meanings..
why the extra processing to generate the hashing directories?
we already have a hash here, for example,
SHA256E-s8242375--5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3.ogg
has a loon5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3
hash. Why not use the first characters of that? This is will not change for a give file, and has a higher chance of generating collisions (which is a good thing here, because we can reuse directories).In other words, why aren't the hashes of
SHA256E-s8242375--5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3.ogg
simply5f8/249
? --anarcat