The "backend" in git-annex specifies how a key is generated from a file's content and/or filesystem metadata. Most backends are different kinds of hashes. A single repository can use different backends for different files. The key includes the backend that is used for that key.
configuring which backend to use
The annex.backend
git-config setting can be used to configure the
default backend to use when adding new files.
For finer control of what backend is used when adding different types of
files, the .gitattributes
file can be used. The annex.backend
attribute can be set to the name of the backend to use for matching files.
For example, to use the SHA256E backend for sound files, which tend to be
smallish and might be modified or copied over time,
while using the WORM backend for everything else, you could set
in .gitattributes
:
* annex.backend=WORM
*.mp3 annex.backend=SHA256E
*.ogg annex.backend=SHA256E
recommended backends to use
SHA256E
-- The default backend for new files, combines a 256 bit SHA-2 hash of the file's content with the file's extension. This allows verifying that the file content is right, and can avoid duplicates of files with the same content. Its need to generate checksums can make it slower for large files.SHA256
-- SHA-2 hash that does not include the file extension in the key, which can lead to better deduplication but can confuse some programs.SHA512
,SHA512E
-- Best SHA-2 hash, for the very paranoid.SHA384
,SHA384E
,SHA224
,SHA224E
-- SHA-2 hashes for people who like unusual sizes.SHA3_512
,SHA3_512E
,SHA3_384
,SHA3_384E
,SHA3_256
,SHA3_256E
,SHA3_224
,SHA3_224E
-- SHA-3 hashes, for bleeding edge fun.SKEIN512
,SKEIN512E
,SKEIN256
,SKEIN256E
-- Skein hash, a well-regarded SHA3 hash competition finalist.BLAKE2B160
,BLAKE2B224
,BLAKE2B256
,BLAKE2B384
,BLAKE2B512
BLAKE2B160E
,BLAKE2B224E
,BLAKE2B256E
,BLAKE2B384E
,BLAKE2B512E
-- Fast Blake2 hash variants optimised for 64 bit platforms.BLAKE2S160
,BLAKE2S224
,BLAKE2S256
BLAKE2S160E
,BLAKE2S224E
,BLAKE2S256E
-- Fast Blake2 hash variants optimised for 32 bit platforms.BLAKE2BP512
,BLAKE2BP512E
-- Fast Blake2 hash variants optimised for 4-way CPUs.BLAKE2SP224
,BLAKE2SP256
BLAKE2SP224E
,BLAKE2SP256E
-- Fast Blake2 hash variants optimised for 8-way CPUs.VURL
-- This is like anURL
(see below) but the content can be verified with a cryptographically secure checksum that is recorded in the git-annex branch. It's generated when using eggit-annex addurl --fast --verifiable
.
non-cryptograpgically secure backends
The backends below do not guarantee cryptographically that the content of an annexed file remains unchanged.
SHA1
,SHA1E
,MD5
,MD5E
-- Smaller hashes thanSHA256
for those who want a checksum but are not concerned about security.WORM
("Write Once, Read Many") -- This assumes that any file with the same filename, size, and modification time has the same content. This is the least expensive backend, recommended for really large files or slow systems.URL
-- This is a key that is generated from the url to a file. It's generated when using eg,git annex addurl --fast
, when the file content is not available for hashing.
The key may not contain the full URL; for long URLs, part of the URL may be represented by a checksum.
The URL key may contain&
characters; be sure to quote the key if passing it to a shell script. These types of keys are distinct from URLs/URIs that may be attached to a key (using any backend) indicating the key's location on the web or in one of special remotes.
external backends
While most backends are built into git-annex, it also supports external
backends. These are programs with names like git-annex-backend-XFOO
,
which can be provided by others. See external backend protocol
for details about how to write them.
Here's a list of external backends. Edit this page to add yours to the list.
- git-annex-backend-XFOO is a demo program implementing the protocol with a shell script.
Like with git-annex's builtin backends, you can add "E" to the end of the name of an external backend, to get a version that includes the file extension in the key.
internal use backends
Keys using these backends can sometimes be visible, but they are used by git-annex for its own purposes, and not for your annexed files.
GIT
-- This is used internally by git-annex when exporting trees containing files stored in git, rather than git-annex. It represents a git sha. This is never used for git-annex links, but information about keys of this type is stored in the git-annex branch.GITBUNDLE
andGITMANIFEST
-- Used by git-remote-annex to store a git repository in a special remote. See this page for details about these.
notes
If you want to be able to prove that you're working with the same file contents that were checked into a repository earlier, you should avoid using non-cryptographically-secure backends, and will need to use signed git commits. See using signed git commits for details.
Retrieval of WORM and URL from many special remotes is prohibited for security reasons.
Note that the various 512 and 384 length hashes result in long paths, which are known to not work on Windows. If interoperability on Windows is a concern, avoid those.
It turns out that (at least on x86-64 machines)
SHA512
is faster thanSHA256
. In some benchmarks I performed1SHA256
was 1.8–2.2x slower thanSHA1
whileSHA512
was only 1.5–1.6x slower.SHA224
andSHA384
are effectively just truncated versions ofSHA256
andSHA512
so their performance characteristics are identical.1
time head -c 100000000 /dev/zero | shasum -a 512
In case you came here looking for the URL backend.
The URL backend
Several documents on the web refer to a special "URL backend", e.g. Large file management with git-annex [LWN.net]. Historical content will never be updated yet it drives people to living places.
Why a URL backend ?
It is interesting because you can:
git-annex
rest on the fact that some documents are available as extra copies available at any time (but from something that is not a git repository).How/Where now ?
git-annex
used to have a URL backend. It seems that the design changed into a "special remote" feature, not limited to the web. You can now track files available through plain directories, rsync, webdav, some cloud storage, etc, even clay tablets. For details see special remotes.It's a bit confusing to read that SHA256 does not include the file extension from which I can deduct that SHA256E does include it. What else does it include? I used to "seed" my git-annex with localy available data by "git-annex add"-ing it in a temporary folder without doing a commit and than to initiate a copy from the slow remote annex repo. My theory was that remote copy sees the pre-seeded files and does not need to copy them again.
But does this theory hold true for different file names, extensions, modification date, full path? Maybe you could also link to the code that implements the different backends so that curious readers can check for themselves.
Thank you!
I'd really like to have a SHA256e backend -- same as SHA256E but making sure that extensions of the files in .git/annex are converted to lower case. I normally try to convert filenames from cameras etc to lower case, but not all people that I share annex with do so consistently. In my use case, I need to be able to find duplicates among files and .jpg vs .JPG throws git annex dedup off. Otherwise E backends are superior to non-E for me. Thanks, Michael.
Related to the question posed in http://git-annex.branchable.com/forum/switching_backends/ can git annex be told to use the existing backend for a given file?
The use case for this is that you have an existing repo that started out e.g. with SHA256, but new files are being added with SHA256E since that's the default now.
But I was doing:
And was expecting it to show no changes for existing files, but it did, it would be nice if that was not the case.
git annex add --backend=SHA256
to temporarily override the backend.the SHA* backends generate too-complicated paths:
lrwxrwxrwx 1 root root 193 Apr 22 2009 test.ogg -> ../../../.git/annex/objects/fX/pz/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890
I don't want the additional directory. What is it for?? It contains exactly one file and adds a couple of disk seeks to file lookup.
@Matthias, that directory structure is not controlled by the backend. It is explained in internals
probably in many cases MD5SUM might be sufficient to cover the space of the available load and
or use of MD5SUM hash is really not recommended for non-encryption-critical cases too?
I've added MD5 and MD5E. Of course, if you choose to use these, or the WORM backend, you give up the cryptographic verification that the content currently in your repository is the same content that was in it before. Whether that matters in your application is up to you.
It's not explicit, but 'git annex info $FILE' tells you the key, which has the backend as its first component:
I don't think there are any situations where the first component of the key isn't the backend, but don't hold me to that, please
Or I could not be an idiot and tell you the command specifically looking up a key for a file: lookupkey
So to get the backend (if the first component is always the backend):
Hi,
I'd like to be able verify the consistancy of the files on a rsync remote without having access to the git repository or the gpg-key. This can easily be done with unencrypted files by running "sha256sum filename". Is there a way to do the same thing with encrypted files?
Thank you very much!
@junk, this page is not really the place to ask such an unrelated question. Please use the forum for such questions.
(Anyway, git-annex uses gpg to encrypt data, so you can perhaps use gpg to check the embedded checksum, but I have never done it, and git-annex certianly doesn't support doing it.)
Hello,
TL;DR: I second Michael's wish for hashing backends that aligns extensions to lowercase.
Context, files with same content, extension have different case
I realized a moment ago that git-annex basically automatically deduplicates with file granularity, which is very nice... unless duplicates have varying case, which does happen. For some cameras, if you download files through a cable you get one file name with one case, if you read the card directly with a card reader you get another case (and another filename, by the way).
In invite anyone interested to drop a line here.
Workaround
I understand I can align case after-the-fact with a bash shell command like below. Beware: man page of
rename
says there exist other versions that don't check for destination file, so the line below in some specific case (two files with same name, different content, file name only differs in case extension) might cause you to lose some information. Or perhaps other cases. Make sure you know what you do, I'm not responsible.If you prefer to align to upper-case, replace
,,
with^^
. This is bash syntax.Please consider
SHA256e
backend (and others).Anyway the shell command above is a workaround. A case-insensitive hashing backend seems a natural thing to do. It would bring the best of both worlds: deduplicate efficiently while not confusing programs that depend on symlink target having a particular extension.
@Ilya, indeed, this page is talking about filesystem metadata. I've updated it for clarity.
There is not currently a way to switch backend based on file size, although you can use annex.largefiles to make it check eg smaller files directly into git rather than annexing them.
It seems that *E backends ignore file extensions longer than four chars: https://git-annex.branchable.com/bugs/file_extensions_of624_chars_ignored_by42E_backends/ Is there some reason for doing it this way?
@Ilya_Shlyakhter, it's a heuristic, what consititues a file extension is not very well defined (consider ".tar.gz" and ".not-an-extension"). The heuristic has been refined over the years, but will never be perfect. But, I don't know of any 5+ character file extensions in common use!
It's only used to avoid uploading one chunk from one object that the key points to, and then later upload a chunk from a different object.
While WORM keys could in theory "collide" and the same key point to different content, that's no different than MD5 or SHA1 keys colliding; it's a smallish risk, easily quantified, and you take that risk by choosing to use those keys.
The risk that the content at an url might change varies over time or something like that, so I think it makes sense to treat URL keys as specially unstable.
"The risk that the content at an url might change varies over time or something like that, so I think it makes sense to treat URL keys as specially unstable." -- but, if I understand correctly, a URL key does not actually represent a URL? Rather, a URL can be attached to any key, and if the contents of some URLs claimed by a remote is unstable, such remotes should be marked as untrusted; while if the contents of a URL key is stored in a trusted remote, that contents is not unstable. But URL and WORM keys are both "unstable" in that their contents can't be verified.
[[todo/alternate_keys_for_same_content] could mitigate that.
The scenario that isStableKey is being used to guard against is two repos downloading the content of an url and each getting different content, followed by one repo uploading some chunks of its content and then the other repo "finishing" the upload with chunks of its different content. That would result in a mismash of chunks being stored in the remote.
It's true that it could also happen using WORM with an url attached to it. (Not with other types of keys that verify a checksum.) Though it seems much less likely, since the file size is at least checked for WORM, while with URL keys there's often no recorded file size. And, WORMs don't typically have urls attached (I can't think of a single time I've ever done that, it just feels like asking for trouble), while URL keys always do.
If this is a serious concern, I'd suggest you open a todo or bug report about it, there are far too many comments to wade through here already. We could think about, perhaps not allowing download of WORM keys from urls or something like that..
I'd like to be able to have a "thin" repo on a FAT32 filesystem. Since this precludes hardlinks, is there a way to make a backend that just keeps track of the file's hash so we can detect when it changes? This would obviously need to rely on having copies in other repos for backup purposes. I'm thinking a mode that behaves more like Unison, which just used its fingerprint file to detect changes that need to be synced.
There would still be a file in the backend named by SHA256, but instead of storing the content it would store the location of possible local copies of the file. This would obviously need to use a smudge filter. It could be the default backend for thin repos on filesystems that don't support hardlinks.
@annex2384 "Backend which doesn't store files at all?" -- are you sure you're thinking of backends and not of special remotes? Backends don't "store files", special remotes do. Backends create keys identifying specific contents.
Not sure I fully understand your use case, but you could write an external special remote that, for a given git-annex key, stores "the location of possible local copies of the file" -- e.g. using
SETSTATE
orSETURIPRESENT
.What do you think about a new backend? There are some hash functions out there that aren't necessarily cryptographically secure but are more performant and certainly better than metadata checks at deduplication.
Modern filesystems like BTRFS implement checksumming at the filesystem level for both data and metadata to detect (but not repair, not without some sort of parity data or mirroring) bitrot. Based on BTRFS's nice checksumming overview, maybe CRC32C or xxHash would be good options? Their benchmarks suggest they're 60× and 40× faster than SHA256, respectively. According to the benchmark source, xxHash offers improved collision resistance over CRC32C, but the latter is more accessible on legacy systems (not much of a concern for git-annex, I think, given this would be a new, opt-in feature anyway). Since those benchmarks, the latest version, XXH3, has received a stable release, and is apparently able to keep pace with RAM sequential read.
I was even thinking there could be an option to tap into filesystem checksumming, but BTRFS does this at the block level or after compression¹, so that's off the table.
Please let me know if this is the right place for this, or if I should've opened a forum post.
See the "External backends" section on this page for info on adding your own backends.