When a file is annexed, a key is generated from its content and/or metadata. The file checked into git symlinks to the key. This key can later be used to retrieve the file's content (its value).

Multiple pluggable key-value backends are supported, and a single repository can use different ones for different files.

  • SHA256E -- The default backend for new files, combines a 256 bit SHA-2 hash of the file's content with the file's extension. This allows verifying that the file content is right, and can avoid duplicates of files with the same content. Its need to generate checksums can make it slower for large files.
  • SHA256 -- Does not include the file extension in the key, which can lead to better deduplication but can confuse some programs.
  • WORM ("Write Once, Read Many") This assumes that any file with the same filename, size, and modification time has the same content. This is the least expensive backend, recommended for really large files or slow systems.
  • SHA512, SHA512E -- Best SHA-2 hash, for the very paranoid.
  • SHA1, SHA1E, MD5, MD5E -- Smaller hashes than SHA256 for those who want a checksum but are not concerned about security.
  • SHA384, SHA384E, SHA224, SHA224E -- Hashes for people who like unusual sizes.
  • SKEIN512, SKEIN512E, SKEIN256, SKEIN256E -- Skein hash, a well-regarded SHA3 hash competition finalist.

Note that the SHA512, SKEIN512 and SHA384 generate long paths, which are known to not work on Windows. If interoperability on Windows is a concern, avoid those backends.

The annex.backends git-config setting can be used to list the backends git-annex should use. The first one listed will be used by default when new files are added.

For finer control of what backend is used when adding different types of files, the .gitattributes file can be used. The annex.backend attribute can be set to the name of the backend to use for matching files.

For example, to use the SHA256E backend for sound files, which tend to be smallish and might be modified or copied over time, while using the WORM backend for everything else, you could set in .gitattributes:

* annex.backend=WORM
*.mp3 annex.backend=SHA256E
*.ogg annex.backend=SHA256E

It turns out that (at least on x86-64 machines) SHA512 is faster than SHA256. In some benchmarks I performed1 SHA256 was 1.8–2.2x slower than SHA1 while SHA512 was only 1.5–1.6x slower.

SHA224 and SHA384 are effectively just truncated versions of SHA256 and SHA512 so their performance characteristics are identical.

1 time head -c 100000000 /dev/zero | shasum -a 512

Comment by NanoTech Fri Aug 10 04:37:32 2012

In case you came here looking for the URL backend.

The URL backend

Several documents on the web refer to a special "URL backend", e.g. Large file management with git-annex [LWN.net]. Historical content will never be updated yet it drives people to living places.

Why a URL backend ?

It is interesting because you can:

  • let git-annex rest on the fact that some documents are available as extra copies available at any time (but from something that is not a git repository).
  • track these documents like your own with all git features, which opens up some truly marvelous combinations, which this margin is too narrow to contain (Pierre d.F. wouldn't disapprove ;-).

How/Where now ?

git-annex used to have a URL backend. It seems that the design changed into a "special remote" feature, not limited to the web. You can now track files available through plain directories, rsync, webdav, some cloud storage, etc, even clay tablets. For details see special remotes.

Comment by Stéphane Thu Jan 3 10:59:35 2013

It's a bit confusing to read that SHA256 does not include the file extension from which I can deduct that SHA256E does include it. What else does it include? I used to "seed" my git-annex with localy available data by "git-annex add"-ing it in a temporary folder without doing a commit and than to initiate a copy from the slow remote annex repo. My theory was that remote copy sees the pre-seeded files and does not need to copy them again.

But does this theory hold true for different file names, extensions, modification date, full path? Maybe you could also link to the code that implements the different backends so that curious readers can check for themselves.

Thank you!

Comment by Thomas Wed Jul 31 11:55:09 2013

I'd really like to have a SHA256e backend -- same as SHA256E but making sure that extensions of the files in .git/annex are converted to lower case. I normally try to convert filenames from cameras etc to lower case, but not all people that I share annex with do so consistently. In my use case, I need to be able to find duplicates among files and .jpg vs .JPG throws git annex dedup off. Otherwise E backends are superior to non-E for me. Thanks, Michael.

Comment by Michael Wed Oct 30 02:00:45 2013
The page states "[non-E backends] can confuse some programs". I like the ideal simplicity and recoverability of pure checksum backends but "confusion" sounds a bit worrying. Any practical examples of these problems to help me choose?
Comment by Jarno Wed Oct 30 21:25:00 2013
Some examples of problems with the raw SHA backends include, IIRC, calibre, and many programs on OSX. These programs look at the extension of the filename the symlink points at.
Comment by joeyh.name Fri Nov 1 15:47:26 2013

Related to the question posed in http://git-annex.branchable.com/forum/switching_backends/ can git annex be told to use the existing backend for a given file?

The use case for this is that you have an existing repo that started out e.g. with SHA256, but new files are being added with SHA256E since that's the default now.

But I was doing:

git annex edit .
rsync /some/old/copy/ .
git annex add .

And was expecting it to show no changes for existing files, but it did, it would be nice if that was not the case.

Comment by Ævar Arnfjörð Tue Aug 5 21:35:34 2014
Ævar, you can use git annex add --backend=SHA256 to temporarily override the backend.
Comment by joeyh.name Tue Aug 12 18:00:46 2014

the SHA* backends generate too-complicated paths:

lrwxrwxrwx 1 root root 193 Apr 22 2009 test.ogg -> ../../../.git/annex/objects/fX/pz/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890

I don't want the additional directory. What is it for?? It contains exactly one file and adds a couple of disk seeks to file lookup.

Comment by Matthias Tue Jan 6 09:41:03 2015

@Matthias, that directory structure is not controlled by the backend. It is explained in internals

Comment by joey Tue Jan 6 17:58:28 2015

probably in many cases MD5SUM might be sufficient to cover the space of the available load and

  • its size would be even smaller than SHA1 (thus smaller git-annex footprint)
  • immediate matching to often distributed MD5SUMs
  • matching to ETags (whenever wasn't a multipart upload) in S3 buckets

or use of MD5SUM hash is really not recommended for non-encryption-critical cases too?

Comment by Yaroslav Thu Jan 29 22:07:40 2015

I've added MD5 and MD5E. Of course, if you choose to use these, or the WORM backend, you give up the cryptographic verification that the content currently in your repository is the same content that was in it before. Whether that matters in your application is up to you.

Comment by joey Wed Feb 4 17:25:45 2015
for the MD5/MD5E (and now I have found "email replies to me" - I will become a power user of branchable ;) )
Comment by Yaroslav Mon Feb 9 14:04:27 2015