backends

The "backend" in git-annex specifies how a key is generated from a file's content and/or filesystem metadata. Most backends are different kinds of hashes. A single repository can use different backends for different files. The key includes the backend that is used for that key.

configuring which backend to use

The annex.backend git-config setting can be used to configure the default backend to use when adding new files.

For finer control of what backend is used when adding different types of files, the .gitattributes file can be used. The annex.backend attribute can be set to the name of the backend to use for matching files.

For example, to use the SHA256E backend for sound files, which tend to be smallish and might be modified or copied over time, while using the WORM backend for everything else, you could set in .gitattributes:

* annex.backend=WORM
*.mp3 annex.backend=SHA256E
*.ogg annex.backend=SHA256E

recommended backends to use

SHA256E -- The default backend for new files, combines a 256 bit SHA-2 hash of the file's content with the file's extension. This allows verifying that the file content is right, and can avoid duplicates of files with the same content. Its need to generate checksums can make it slower for large files.
SHA256 -- SHA-2 hash that does not include the file extension in the key, which can lead to better deduplication but can confuse some programs.
SHA512, SHA512E -- Best SHA-2 hash, for the very paranoid.
SHA384, SHA384E, SHA224, SHA224E -- SHA-2 hashes for people who like unusual sizes.
SHA3_512, SHA3_512E, SHA3_384, SHA3_384E, SHA3_256, SHA3_256E, SHA3_224, SHA3_224E -- SHA-3 hashes, for bleeding edge fun.
SKEIN512, SKEIN512E, SKEIN256, SKEIN256E -- Skein hash, a well-regarded SHA3 hash competition finalist.
BLAKE2B160, BLAKE2B224, BLAKE2B256, BLAKE2B384, BLAKE2B512 BLAKE2B160E, BLAKE2B224E, BLAKE2B256E, BLAKE2B384E, BLAKE2B512E -- Fast Blake2 hash variants optimised for 64 bit platforms.
BLAKE2S160, BLAKE2S224, BLAKE2S256 BLAKE2S160E, BLAKE2S224E, BLAKE2S256E -- Fast Blake2 hash variants optimised for 32 bit platforms.
BLAKE2BP512, BLAKE2BP512E -- Fast Blake2 hash variants optimised for 4-way CPUs.
BLAKE2SP224, BLAKE2SP256 BLAKE2SP224E, BLAKE2SP256E -- Fast Blake2 hash variants optimised for 8-way CPUs.
VURL -- This is like an URL (see below) but the content can be verified with a cryptographically secure checksum that is recorded in the git-annex branch. It's generated when using eg git-annex addurl --fast --verifiable.

non-cryptograpgically secure backends

The backends below do not guarantee cryptographically that the content of an annexed file remains unchanged.

SHA1, SHA1E, MD5, MD5E -- Smaller hashes than SHA256 for those who want a checksum but are not concerned about security.
WORM ("Write Once, Read Many") -- This assumes that any file with the same filename, size, and modification time has the same content. This is the least expensive backend, recommended for really large files or slow systems.
URL -- This is a key that is generated from the url to a file. It's generated when using eg, git annex addurl --fast, when the file content is not available for hashing.
The key may not contain the full URL; for long URLs, part of the URL may be represented by a checksum.
The URL key may contain & characters; be sure to quote the key if passing it to a shell script. These types of keys are distinct from URLs/URIs that may be attached to a key (using any backend) indicating the key's location on the web or in one of special remotes.

external backends

While most backends are built into git-annex, it also supports external backends. These are programs with names like git-annex-backend-XFOO, which can be provided by others. See external backend protocol for details about how to write them.

Here's a list of external backends. Edit this page to add yours to the list.

git-annex-backend-XFOO is a demo program implementing the protocol with a shell script.

Like with git-annex's builtin backends, you can add "E" to the end of the name of an external backend, to get a version that includes the file extension in the key.

internal use backends

Keys using these backends can sometimes be visible, but they are used by git-annex for its own purposes, and not for your annexed files.

GIT -- This is used internally by git-annex when exporting trees containing files stored in git, rather than git-annex. It represents a git sha. This is never used for git-annex links, but information about keys of this type is stored in the git-annex branch.
GITBUNDLE and GITMANIFEST -- Used by git-remote-annex to store a git repository in a special remote. See this page for details about these.

notes

If you want to be able to prove that you're working with the same file contents that were checked into a repository earlier, you should avoid using non-cryptographically-secure backends, and will need to use signed git commits. See using signed git commits for details.

Retrieval of WORM and URL from many special remotes is prohibited for security reasons.

Note that the various 512 and 384 length hashes result in long paths, which are known to not work on Windows. If interoperability on Windows is a concern, avoid those.

RSS Atom

SHA performance

It turns out that (at least on x86-64 machines) SHA512 is faster than SHA256. In some benchmarks I performed¹ SHA256 was 1.8–2.2x slower than SHA1 while SHA512 was only 1.5–1.6x slower.

SHA224 and SHA384 are effectively just truncated versions of SHA256 and SHA512 so their performance characteristics are identical.

¹ time head -c 100000000 /dev/zero | shasum -a 512

Comment by NanoTech — Fri Aug 10 04:37:32 2012

Remove comment

Tracking remote copies not even stored locally / URL backend turned into a "special remote".

In case you came here looking for the URL backend.

The URL backend

Several documents on the web refer to a special "URL backend", e.g. Large file management with git-annex [LWN.net]. Historical content will never be updated yet it drives people to living places.

Why a URL backend ?

It is interesting because you can:

let git-annex rest on the fact that some documents are available as extra copies available at any time (but from something that is not a git repository).
track these documents like your own with all git features, which opens up some truly marvelous combinations, which this margin is too narrow to contain (Pierre d.F. wouldn't disapprove ;-).

How/Where now ?

git-annex used to have a URL backend. It seems that the design changed into a "special remote" feature, not limited to the web. You can now track files available through plain directories, rsync, webdav, some cloud storage, etc, even clay tablets. For details see special remotes.

Comment by Stéphane — Thu Jan 3 10:59:35 2013

Remove comment

Please be more specific about what information goes into the key

It's a bit confusing to read that SHA256 does not include the file extension from which I can deduct that SHA256E does include it. What else does it include? I used to "seed" my git-annex with localy available data by "git-annex add"-ing it in a temporary folder without doing a commit and than to initiate a copy from the slow remote annex repo. My theory was that remote copy sees the pre-seeded files and does not need to copy them again.

But does this theory hold true for different file names, extensions, modification date, full path? Maybe you could also link to the code that implements the different backends so that curious readers can check for themselves.

Thank you!

Comment by Thomas — Wed Jul 31 11:55:09 2013

Remove comment

SHA256e

I'd really like to have a SHA256e backend -- same as SHA256E but making sure that extensions of the files in .git/annex are converted to lower case. I normally try to convert filenames from cameras etc to lower case, but not all people that I share annex with do so consistently. In my use case, I need to be able to find duplicates among files and .jpg vs .JPG throws git annex dedup off. Otherwise E backends are superior to non-E for me. Thanks, Michael.

Comment by Michael — Wed Oct 30 02:00:45 2013

Remove comment

Non-E backend drawbacks?

The page states "[non-E backends] can confuse some programs". I like the ideal simplicity and recoverability of pure checksum backends but "confusion" sounds a bit worrying. Any practical examples of these problems to help me choose?

Comment by Jarno — Wed Oct 30 21:25:00 2013

Remove comment

comment 6

Some examples of problems with the raw SHA backends include, IIRC, calibre, and many programs on OSX. These programs look at the extension of the filename the symlink points at.

Comment by joeyh.name — Fri Nov 1 15:47:26 2013

Remove comment

Can annex use existing backends when amending existing files?

Related to the question posed in http://git-annex.branchable.com/forum/switching_backends/ can git annex be told to use the existing backend for a given file?

The use case for this is that you have an existing repo that started out e.g. with SHA256, but new files are being added with SHA256E since that's the default now.

But I was doing:

git annex edit .
rsync /some/old/copy/ .
git annex add .

And was expecting it to show no changes for existing files, but it did, it would be nice if that was not the case.

Comment by Ævar Arnfjörð — Tue Aug 5 21:35:34 2014

Remove comment

comment 8

Ævar, you can use git annex add --backend=SHA256 to temporarily override the backend.

Comment by joeyh.name — Tue Aug 12 18:00:46 2014

Remove comment

Over-long pathnames?

the SHA* backends generate too-complicated paths:

lrwxrwxrwx 1 root root 193 Apr 22 2009 test.ogg -> ../../../.git/annex/objects/fX/pz/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890

I don't want the additional directory. What is it for?? It contains exactly one file and adds a couple of disk seeks to file lookup.

Comment by Matthias — Tue Jan 6 09:41:03 2015

Remove comment

comment 10

@Matthias, that directory structure is not controlled by the backend. It is explained in internals

Comment by joey — Tue Jan 6 17:58:28 2015

Remove comment

add MD5SUM (with E) backend?

probably in many cases MD5SUM might be sufficient to cover the space of the available load and

its size would be even smaller than SHA1 (thus smaller git-annex footprint)
immediate matching to often distributed MD5SUMs
matching to ETags (whenever wasn't a multipart upload) in S3 buckets

or use of MD5SUM hash is really not recommended for non-encryption-critical cases too?

Comment by Yaroslav — Thu Jan 29 22:07:40 2015

Remove comment

MD5

I've added MD5 and MD5E. Of course, if you choose to use these, or the WORM backend, you give up the cryptographic verification that the content currently in your repository is the same content that was in it before. Whether that matters in your application is up to you.

Comment by joey — Wed Feb 4 17:25:45 2015

Remove comment

THANK YOU JOEY

for the MD5/MD5E (and now I have found "email replies to me" - I will become a power user of branchable

)

Comment by Yaroslav — Mon Feb 9 14:04:27 2015

Remove comment

Backend of specified file

How can I determine backend of specified file? Looking over man pages and can't find it.

Comment by xelez0 — Sun Jul 5 13:19:43 2015

Remove comment

comment 15

It's not explicit, but 'git annex info $FILE' tells you the key, which has the backend as its first component:

## git annex info CG\ Cookie/Compositing\ in\ Blender/01_CompositingInBlender_SourceFiles.zip 
file: CG Cookie/Compositing in Blender/01_CompositingInBlender_SourceFiles.zip
size: 744.51 megabytes
key: SHA256E-s744506832--08d2daced60b5eb6509044d5eefca82e7a6899350f49adc0083014229739515e.zip

I don't think there are any situations where the first component of the key isn't the backend, but don't hold me to that, please

Comment by CandyAngel — Sun Jul 5 14:54:18 2015

Remove comment

comment 16

Or I could not be an idiot and tell you the command specifically looking up a key for a file: lookupkey

## git annex lookupkey CG\ Cookie/Compositing\ in\ Blender/01_CompositingInBlender_SourceFiles.zip 
SHA256E-s744506832--08d2daced60b5eb6509044d5eefca82e7a6899350f49adc0083014229739515e.zip

So to get the backend (if the first component is always the backend):

## git annex lookupkey CG\ Cookie/Compositing\ in\ Blender/01_CompositingInBlender_SourceFiles.zip | cut -d- -f1
SHA256E

Comment by CandyAngel — Sun Jul 5 15:00:46 2015

Remove comment

comment 17

See key format.

Comment by joey — Mon Jul 6 16:03:53 2015

Remove comment

Howto verify encrypted files

Hi,

I'd like to be able verify the consistancy of the files on a rsync remote without having access to the git repository or the gpg-key. This can easily be done with unencrypted files by running "sha256sum filename". Is there a way to do the same thing with encrypted files?

Thank you very much!

Comment by junk — Sun Aug 14 17:07:03 2016

Remove comment

comment 19

@junk, this page is not really the place to ask such an unrelated question. Please use the forum for such questions.

(Anyway, git-annex uses gpg to encrypt data, so you can perhaps use gpg to check the embedded checksum, but I have never done it, and git-annex certianly doesn't support doing it.)

Comment by joey — Mon Sep 5 19:47:48 2016

Remove comment

Lower-case extension backends: +1

SHA256e I'd really like to have a SHA256e backend -- same as SHA256E but making sure that extensions of the files in .git/annex are converted to lower case. I normally try to convert filenames from cameras etc to lower case, but not all people that I share annex with do so consistently. In my use case, I need to be able to find duplicates among files and .jpg vs .JPG throws git annex dedup off. Otherwise E backends are superior to non-E for me. Thanks, Michael.

Hello,

TL;DR: I second Michael's wish for hashing backends that aligns extensions to lowercase.

Context, files with same content, extension have different case

I realized a moment ago that git-annex basically automatically deduplicates with file granularity, which is very nice... unless duplicates have varying case, which does happen. For some cameras, if you download files through a cable you get one file name with one case, if you read the card directly with a card reader you get another case (and another filename, by the way).

In invite anyone interested to drop a line here.

Workaround

I understand I can align case after-the-fact with a bash shell command like below. Beware: man page of rename says there exist other versions that don't check for destination file, so the line below in some specific case (two files with same name, different content, file name only differs in case extension) might cause you to lose some information. Or perhaps other cases. Make sure you know what you do, I'm not responsible.

for EXT in aac amr arw asf avi bup crw ctg dsc dv jpg lrv m4a m4v mov mp3 mp4 mpe mpg mrk ndf nef njb ogg pdf png pnm ppm ps psd thm tif tiff txt ufraw wav xcf xcfbz2 xmp
do find . -iname "*.${EXT}" -print0 | xargs -0 rename -v "s/${EXT}/${EXT,,}/i" ; done

If you prefer to align to upper-case, replace ,, with ^^. This is bash syntax.

Please consider `SHA256e` backend (and others).

Anyway the shell command above is a workaround. A case-insensitive hashing backend seems a natural thing to do. It would bring the best of both worlds: deduplicate efficiently while not confusing programs that depend on symlink target having a particular extension.

Comment by stephane-gourichon-lpad — Sat Oct 22 17:55:13 2016

Remove comment

keys from metadata

"When a file is annexed, a key is generated from its content and/or metadata" -- the 'metadata' here just refers to the file name/size/mtime, not to https://git-annex.branchable.com/git-annex-metadata/ , correct?

Comment by Ilya S — Fri Sep 7 15:57:37 2018

Remove comment

comment 22

Is it possible to configure git-annex to use different backends based on file size? I.e. use a faster hash or even WORM for larger files.

Comment by Ilya S — Fri Sep 7 15:59:06 2018

Remove comment

comment 23

@Ilya, indeed, this page is talking about filesystem metadata. I've updated it for clarity.

There is not currently a way to switch backend based on file size, although you can use annex.largefiles to make it check eg smaller files directly into git rather than annexing them.

Comment by joey — Tue Sep 11 16:52:20 2018

Remove comment

comment 24

It seems that *E backends ignore file extensions longer than four chars: https://git-annex.branchable.com/bugs/file_extensions_of624_chars_ignored_by42E_backends/ Is there some reason for doing it this way?

Comment by Ilya_Shlyakhter — Wed Sep 19 16:50:05 2018

Remove comment

comment 25

@Ilya_Shlyakhter, it's a heuristic, what consititues a file extension is not very well defined (consider ".tar.gz" and ".not-an-extension"). The heuristic has been refined over the years, but will never be perfect. But, I don't know of any 5+ character file extensions in common use!

Comment by joey — Mon Sep 24 15:42:24 2018

Remove comment

stable vs unstable keys

Backend.hs classifies keys as "stable" or not, with URL keys being unstable. How is this distinction used? I found only one place where it's used, but couldn't quite understand it. If "stable" means "containing a hash of the content", then wouldn't WORM keys be unstable too?

Comment by Ilya_Shlyakhter — Fri Oct 26 15:58:26 2018

Remove comment

re: stable vs unstable keys

It's only used to avoid uploading one chunk from one object that the key points to, and then later upload a chunk from a different object.

While WORM keys could in theory "collide" and the same key point to different content, that's no different than MD5 or SHA1 keys colliding; it's a smallish risk, easily quantified, and you take that risk by choosing to use those keys.

The risk that the content at an url might change varies over time or something like that, so I think it makes sense to treat URL keys as specially unstable.

Comment by joey — Mon Oct 29 18:55:09 2018

Remove comment

comment 28

"The risk that the content at an url might change varies over time or something like that, so I think it makes sense to treat URL keys as specially unstable." -- but, if I understand correctly, a URL key does not actually represent a URL? Rather, a URL can be attached to any key, and if the contents of some URLs claimed by a remote is unstable, such remotes should be marked as untrusted; while if the contents of a URL key is stored in a trusted remote, that contents is not unstable. But URL and WORM keys are both "unstable" in that their contents can't be verified.
[[todo/alternate_keys_for_same_content] could mitigate that.

Comment by Ilya_Shlyakhter — Thu Nov 1 17:12:59 2018

Remove comment

comment 29

The scenario that isStableKey is being used to guard against is two repos downloading the content of an url and each getting different content, followed by one repo uploading some chunks of its content and then the other repo "finishing" the upload with chunks of its different content. That would result in a mismash of chunks being stored in the remote.

It's true that it could also happen using WORM with an url attached to it. (Not with other types of keys that verify a checksum.) Though it seems much less likely, since the file size is at least checked for WORM, while with URL keys there's often no recorded file size. And, WORMs don't typically have urls attached (I can't think of a single time I've ever done that, it just feels like asking for trouble), while URL keys always do.

If this is a serious concern, I'd suggest you open a todo or bug report about it, there are far too many comments to wade through here already. We could think about, perhaps not allowing download of WORM keys from urls or something like that..

Comment by joey — Mon Nov 12 20:25:37 2018

Remove comment

Backend which doesn't stoee files at all?

I'd like to be able to have a "thin" repo on a FAT32 filesystem. Since this precludes hardlinks, is there a way to make a backend that just keeps track of the file's hash so we can detect when it changes? This would obviously need to rely on having copies in other repos for backup purposes. I'm thinking a mode that behaves more like Unison, which just used its fingerprint file to detect changes that need to be synced.

There would still be a file in the backend named by SHA256, but instead of storing the content it would store the location of possible local copies of the file. This would obviously need to use a smudge filter. It could be the default backend for thin repos on filesystems that don't support hardlinks.

Comment by annex2384 — Thu Feb 20 14:52:21 2020

Remove comment

backends vs special remotes

@annex2384 "Backend which doesn't store files at all?" -- are you sure you're thinking of backends and not of special remotes? Backends don't "store files", special remotes do. Backends create keys identifying specific contents.

Not sure I fully understand your use case, but you could write an external special remote that, for a given git-annex key, stores "the location of possible local copies of the file" -- e.g. using SETSTATE or SETURIPRESENT.

Comment by Ilya_Shlyakhter — Thu Feb 20 19:23:04 2020

Remove comment

Improving on the WORM situation

What do you think about a new backend? There are some hash functions out there that aren't necessarily cryptographically secure but are more performant and certainly better than metadata checks at deduplication.

Modern filesystems like BTRFS implement checksumming at the filesystem level for both data and metadata to detect (but not repair, not without some sort of parity data or mirroring) bitrot. Based on BTRFS's nice checksumming overview, maybe CRC32C or xxHash would be good options? Their benchmarks suggest they're 60× and 40× faster than SHA256, respectively. According to the benchmark source, xxHash offers improved collision resistance over CRC32C, but the latter is more accessible on legacy systems (not much of a concern for git-annex, I think, given this would be a new, opt-in feature anyway). Since those benchmarks, the latest version, XXH3, has received a stable release, and is apparently able to keep pace with RAM sequential read.

I was even thinking there could be an option to tap into filesystem checksumming, but BTRFS does this at the block level or after compression¹, so that's off the table.

Please let me know if this is the right place for this, or if I should've opened a forum post.

Comment by lh — Wed Apr 13 22:19:40 2022

Remove comment

new backends

What do you think about a new backend?

See the "External backends" section on this page for info on adding your own backends.

Comment by Ilya_Shlyakhter — Sun Apr 17 19:02:09 2022

Remove comment

comment 34

add xxHash backend is tracking requirements for adding xxh3.

Comment by joey — Tue Apr 19 16:17:30 2022

Remove comment

combine worm with partial hash

would it be possible to provide a combined backend of worm + partial hash? i'd imagine that this would make the backend faster than merely hashes while also lower the probability of erroneously identifying two different, but worm-equivalent files.

Comment by windfish — Thu Feb 1 11:53:54 2024

Remove comment

Add a comment