hashing

In both the .git/annex directory and the git-annex branch, two levels of hash directories are used, to avoid issues with too many files in one directory.

Two separate hash methods are used.

hashdirmixed is only used for non-bare git repositories. (We'd like to stop using this, but it'd be too annoying to change all the git-annex symlinks!)
hashdirlower is used for bare git repositories, the git-annex branch, and on special remotes as well.

Note that git annex find and git annex examinekey can be used with the --format option to find the hash directories. The explanation below is only for completeness.

new hash format

This uses two directories, each with a three-letter name, such as "f87/4d5"

The directory names come from the first 6 characters of the md5sum of the key when serialized as a hex string.

For example:

echo -n "SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" | md5sum

old hash format

This uses two directories, each with a two-letter name, such as "pX/1J"

It takes the md5sum of the key, but rather than a string, represents it as 4 32bit words. Only the first word is used. It is converted into a string by the same mechanism that would be used to encode a normal md5sum value into a string, but where that would normally encode the bits using the 16 characters 0-9a-f, this instead uses the 32 characters "0123456789zqjxkmvwgpfZQJXKMVWGPF". The first 2 letters of the resulting string are the first directory, and the second 2 are the second directory.

chunk keys

The same hash directory is used for a chunk key as would be used for the key that it's a chunk of.

RSS Atom

comment 1

The correct old hash value for the empty file SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 is pX/ZJ .

The text describes the old hash value computation incorrectly, because it doesn't mention that 1 bit is skipped between each group of 5 bits. See the sample implementation in display_32bits_as_dir in https://github.com/joeyh/git-annex/blob/master/Locations.hs

Comment by Péter — Fri Jan 31 00:45:47 2014

Remove comment

comment 2

1c to support Péter's statement:

$> git annex examinekey --format='${hashdirmixed}' "SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
pX/ZJ/%

Comment by Yaroslav — Thu Dec 4 20:26:47 2014

Remove comment

any particular reason for the chosen characters for base32 encoding

are the characters "0123456789zqjxkmvwgpfZQJXKMVWGPF" chosen randomly for the base32 encoding or was there a reason to choose exactly these?

Comment by josch — Sat Jan 31 17:13:57 2015

Remove comment

comment 4

The only reason for the letter choice is that it avoids making random words with possibly unintentional meanings..

Comment by joey — Wed Feb 4 17:14:24 2015

Remove comment

why md5sum?

why the extra processing to generate the hashing directories?

we already have a hash here, for example, SHA256E-s8242375--5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3.ogg has a loon 5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3 hash. Why not use the first characters of that? This is will not change for a give file, and has a higher chance of generating collisions (which is a good thing here, because we can reuse directories).

In other words, why aren't the hashes of SHA256E-s8242375--5f82490990812ad3feabb02355750710a9d94283ab256d1c691c3bf8d7d9fbe3.ogg simply 5f8/249? --anarcat

Comment by anarcat [id.koumbit.net] — Fri Feb 13 15:59:46 2015

Remove comment

re: why md5sum?

Not all types of keys contain hashes.

Comment by joey — Tue Feb 17 21:51:59 2015

Remove comment

Python implementation

I wrote a Python implementation of the two hashing functions for a project of mine. Here it is, hoping it can be helpful for somone.

Comment by giomasce — Sun Mar 22 22:38:54 2015

Remove comment

Add a comment