backendsgit-annexhttp://git-annex.branchable.com/backends/git-annexikiwiki2024-02-01T11:53:54ZSHA performancehttp://git-annex.branchable.com/backends/comment_1_375bb1fb5973e8fa67b763f2dd6e404b/NanoTech2013-11-27T22:47:37Z2012-08-10T04:37:32Z
<p>It turns out that (at least on x86-64 machines) <code>SHA512</code> <a href="https://community.emc.com/community/edn/rsashare/blog/2010/11/01/sha-2-algorithms-when-sha-512-is-more-secure-and-faster">is faster than</a> <code>SHA256</code>. In some benchmarks I performed<sup>1</sup> <code>SHA256</code> was 1.8–2.2x slower than <code>SHA1</code> while <code>SHA512</code> was only 1.5–1.6x slower.</p>
<p><code>SHA224</code> and <code>SHA384</code> are effectively just truncated versions of <code>SHA256</code> and <code>SHA512</code> so their performance characteristics are identical.</p>
<p><sup>1</sup> <code>time head -c 100000000 /dev/zero | shasum -a 512</code></p>
Tracking remote copies not even stored locally / URL backend turned into a "special remote".http://git-annex.branchable.com/backends/comment_2_1f2626eca9004b31a0b7fc1a0df8027b/Stéphane2013-11-27T22:47:37Z2013-01-03T10:59:35Z
<p>In case you came here looking for the URL backend.</p>
<h2>The URL backend</h2>
<p>Several documents on the web refer to a special "URL backend", e.g. <a href="http://lwn.net/Articles/419241/">Large file management with git-annex [LWN.net]</a>. Historical content will never be updated yet it drives people to living places.</p>
<h2>Why a URL backend ?</h2>
<p>It is interesting because you can:</p>
<ul>
<li>let <code>git-annex</code> rest on the fact that some documents are available as extra copies available at any time (but from something that is not a git repository).</li>
<li>track these documents like your own with all git features, which opens up some truly marvelous combinations, which this margin is too narrow to contain (Pierre d.F. wouldn't disapprove ;-).</li>
</ul>
<h2>How/Where now ?</h2>
<p><code>git-annex</code> used to have a URL backend. It seems that the design changed into a "special remote" feature, not limited to the web. You can now track files available through plain directories, rsync, webdav, some cloud storage, etc, even clay tablets. For details see <a href="http://git-annex.branchable.com/special_remotes/">special remotes</a>.</p>
Please be more specific about what information goes into the keyhttp://git-annex.branchable.com/backends/comment_3_fdcbf8727fdefb9942a54689234b9698/Thomas2013-11-27T22:47:37Z2013-07-31T11:55:09Z
<p>It's a bit confusing to read that SHA256 does not include the file extension from which I can deduct that SHA256E does include it. What else does it include? I used to "seed" my git-annex with localy available data by "git-annex add"-ing it in a temporary folder without doing a commit and than to initiate a copy from the slow remote annex repo. My theory was that remote copy sees the pre-seeded files and does not need to copy them again.</p>
<p>But does this theory hold true for different file names, extensions, modification date, full path? Maybe you could also link to the code that implements the different backends so that curious readers can check for themselves.</p>
<p>Thank you!</p>
SHA256ehttp://git-annex.branchable.com/backends/comment_4_46591a3ba888fb686b1b319b80ca2c22/Michael2013-11-27T22:47:37Z2013-10-30T02:00:45Z
<p>I'd really like to have a SHA256e backend -- same as SHA256E but making sure that extensions of the files in .git/annex are converted to lower case. I normally try to convert filenames from cameras etc to lower case, but not all people that I share annex with do so consistently.
In my use case, I need to be able to find duplicates among files and .jpg vs .JPG throws git annex dedup off. Otherwise E backends are superior to non-E for me. Thanks, Michael.</p>
Non-E backend drawbacks?http://git-annex.branchable.com/backends/comment_5_2210c7ff2d5812fb3b778ac172291656/Jarno2013-11-27T22:47:37Z2013-10-30T21:25:00Z
The page states "[non-E backends] can confuse some programs". I like the ideal simplicity and recoverability of pure checksum backends but "confusion" sounds a bit worrying. Any practical examples of these problems to help me choose?
comment 6http://git-annex.branchable.com/backends/comment_6_82f239b58680a2681bd8074c7ef9584d/joeyh.name2013-11-27T22:47:37Z2013-11-01T15:47:26Z
Some examples of problems with the raw SHA backends include, IIRC, calibre, and many programs on OSX. These programs look at the extension of the filename the symlink points at.
Can annex use existing backends when amending existing files?http://git-annex.branchable.com/backends/comment_7_4aa8cfaec1090f79fed530720e4ddad4/Ævar Arnfjörð2014-08-05T21:35:34Z2014-08-05T21:35:34Z
<p>Related to the question posed in http://git-annex.branchable.com/forum/switching_backends/ can git annex be told to use the existing backend for a given file?</p>
<p>The use case for this is that you have an existing repo that started out e.g. with SHA256, but new files are being added with SHA256E since that's the default now.</p>
<p>But I was doing:</p>
<pre><code>git annex edit .
rsync /some/old/copy/ .
git annex add .
</code></pre>
<p>And was expecting it to show no changes for existing files, but it did, it would be nice if that was not the case.</p>
comment 8http://git-annex.branchable.com/backends/comment_8_c40d2c2c929ad3239ee5d529e307c746/joeyh.name2014-08-12T18:00:46Z2014-08-12T18:00:46Z
Ævar, you can use <code>git annex add --backend=SHA256</code> to temporarily override the backend.
Over-long pathnames?http://git-annex.branchable.com/backends/comment_9_5bef7f76f5e5b73a8e404fe6172b4368/Matthias2015-01-06T09:41:03Z2015-01-06T09:41:03Z
<p>the SHA* backends generate too-complicated paths:</p>
<p>lrwxrwxrwx 1 root root 193 Apr 22 2009 test.ogg -> ../../../.git/annex/objects/fX/pz/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890</p>
<p>I don't want the additional directory. What is it for?? It contains exactly one file and adds a couple of disk seeks to file lookup.</p>
comment 10http://git-annex.branchable.com/backends/comment_10_920cd139dfec8adb1089f5acf26de4d2/joey2015-01-06T22:00:56Z2015-01-06T17:58:28Z
<p>@Matthias, that directory structure is not controlled by the backend.
It is explained in <a href="http://git-annex.branchable.com/internals/">internals</a></p>
add MD5SUM (with E) backend?http://git-annex.branchable.com/backends/comment_11_f0f6316bbdc971a9ab157de9bbb9f74c/Yaroslav2015-01-29T22:07:40Z2015-01-29T22:07:40Z
<p>probably in many cases MD5SUM might be sufficient to cover the space of the available load and</p>
<ul>
<li>its size would be even smaller than SHA1 (thus smaller git-annex footprint)</li>
<li>immediate matching to often distributed MD5SUMs</li>
<li>matching to ETags (whenever wasn't a multipart upload) in S3 buckets</li>
</ul>
<p>or use of MD5SUM hash is really not recommended for non-encryption-critical cases too?</p>
MD5http://git-annex.branchable.com/backends/comment_12_da76dff5fe712318d7d4313f1d827883/joey2015-02-04T19:01:25Z2015-02-04T17:25:45Z
<p>I've added MD5 and MD5E. Of course, if you choose to use these, or the WORM
backend, you give up the cryptographic verification that the content
currently in your repository is the same content that was in it before.
Whether that matters in your application is up to you.</p>
THANK YOU JOEYhttp://git-annex.branchable.com/backends/comment_13_578423935bc71cdbdc23c3db06d1e870/Yaroslav2015-02-09T14:04:27Z2015-02-09T14:04:27Z
for the MD5/MD5E (and now I have found "email replies to me" - I will become a power user of branchable <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";)" /> )
Backend of specified filehttp://git-annex.branchable.com/backends/comment_14_57154dcd1041a33f220f9105b709be89/xelez02015-07-05T13:19:43Z2015-07-05T13:19:43Z
How can I determine backend of specified file? Looking over man pages and can't find it.
comment 15http://git-annex.branchable.com/backends/comment_15_b3445fd1f379346c642a27211c6c798b/CandyAngel2015-07-05T14:54:18Z2015-07-05T14:54:18Z
<p>It's not explicit, but 'git annex info $FILE' tells you the key, which has the backend as its first component:</p>
<pre><code>## git annex info CG\ Cookie/Compositing\ in\ Blender/01_CompositingInBlender_SourceFiles.zip
file: CG Cookie/Compositing in Blender/01_CompositingInBlender_SourceFiles.zip
size: 744.51 megabytes
key: SHA256E-s744506832--08d2daced60b5eb6509044d5eefca82e7a6899350f49adc0083014229739515e.zip
</code></pre>
<p>I don't think there are any situations where the first component of the key isn't the backend, but don't hold me to that, please <img src="http://git-annex.branchable.com/smileys/smile.png" alt=":)" /></p>
comment 16http://git-annex.branchable.com/backends/comment_16_c68dfaeee2ef18f420f7e11ff5f604b9/CandyAngel2015-07-05T15:00:46Z2015-07-05T15:00:46Z
<p>Or I could not be an idiot and tell you the command specifically looking up a key for a file: lookupkey</p>
<pre><code>## git annex lookupkey CG\ Cookie/Compositing\ in\ Blender/01_CompositingInBlender_SourceFiles.zip
SHA256E-s744506832--08d2daced60b5eb6509044d5eefca82e7a6899350f49adc0083014229739515e.zip
</code></pre>
<p>So to get the backend (if the first component is always the backend):</p>
<pre><code>## git annex lookupkey CG\ Cookie/Compositing\ in\ Blender/01_CompositingInBlender_SourceFiles.zip | cut -d- -f1
SHA256E
</code></pre>
comment 17http://git-annex.branchable.com/backends/comment_17_557a622b3304eb86fed52896e0b6cbda/joey2015-07-06T16:26:48Z2015-07-06T16:03:53Z
See <a href="http://git-annex.branchable.com/internals/key_format/">key format</a>.
Howto verify encrypted fileshttp://git-annex.branchable.com/backends/comment_18_784b29b086503a2b4913558350526ee1/junk2016-08-14T17:07:03Z2016-08-14T17:07:03Z
<p>Hi,</p>
<p>I'd like to be able verify the consistancy of the files on a rsync remote without having access to the git repository or the gpg-key. This can easily be done with unencrypted files by running "sha256sum filename". Is there a way to do the same thing with encrypted files?</p>
<p>Thank you very much!</p>
comment 19http://git-annex.branchable.com/backends/comment_19_cfcaeea6fdded241f5d36ef251ef8010/joey2016-09-05T19:49:29Z2016-09-05T19:47:48Z
<p>@junk, this page is not really the place to ask such an unrelated question.
Please use the <a href="http://git-annex.branchable.com/forum/">forum</a> for such questions.</p>
<p>(Anyway, git-annex uses gpg to encrypt data, so you can perhaps use gpg to
check the embedded checksum, but I have never done it, and git-annex
certianly doesn't support doing it.)</p>
Lower-case extension backends: +1http://git-annex.branchable.com/backends/comment_20_29e500adf7e8614a938b3b714c9bd112/stephane-gourichon-lpad2016-10-22T17:55:13Z2016-10-22T17:55:13Z
<blockquote><p>SHA256e
I'd really like to have a SHA256e backend -- same as SHA256E but making sure that extensions of the files in .git/annex are converted to lower case. I normally try to convert filenames from cameras etc to lower case, but not all people that I share annex with do so consistently. In my use case, I need to be able to find duplicates among files and .jpg vs .JPG throws git annex dedup off. Otherwise E backends are superior to non-E for me. Thanks, Michael.</p></blockquote>
<p>Hello,</p>
<p>TL;DR: <em>I second Michael's wish for hashing backends that aligns extensions to lowercase.</em></p>
<h2>Context, files with same content, extension have different case</h2>
<p>I realized a moment ago that git-annex basically automatically deduplicates with file granularity, which is very nice... <em>unless duplicates have varying case</em>, which does happen. For some cameras, if you download files through a cable you get one file name with one case, if you read the card directly with a card reader you get another case (and another filename, by the way).</p>
<p>In invite anyone interested to drop a line here.</p>
<h2>Workaround</h2>
<p>I understand I can align case after-the-fact with a bash shell command like below. Beware: man page of <code>rename</code> says there exist other versions that don't check for destination file, so the line below in some specific case (two files with same name, different content, file name only differs in case extension) might cause you to lose some information. Or perhaps other cases. Make sure you know what you do, I'm not responsible.</p>
<pre><code>for EXT in aac amr arw asf avi bup crw ctg dsc dv jpg lrv m4a m4v mov mp3 mp4 mpe mpg mrk ndf nef njb ogg pdf png pnm ppm ps psd thm tif tiff txt ufraw wav xcf xcfbz2 xmp
do find . -iname "*.${EXT}" -print0 | xargs -0 rename -v "s/${EXT}/${EXT,,}/i" ; done
</code></pre>
<p>If you prefer to align to upper-case, replace <code>,,</code> with <code>^^</code>. This is bash syntax.</p>
<h2>Please consider <code>SHA256e</code> backend (and others).</h2>
<p>Anyway the shell command above is a workaround. A case-insensitive hashing backend seems a natural thing to do. It would bring the best of both worlds: deduplicate efficiently while not confusing programs that depend on symlink target having a particular extension.</p>
keys from metadatahttp://git-annex.branchable.com/backends/comment_21_7f9cc5f966b28b7461e8ec42ceeb7165/Ilya S2018-09-07T15:57:37Z2018-09-07T15:57:37Z
"When a file is annexed, a key is generated from its content and/or metadata" -- the 'metadata' here just refers to the file name/size/mtime, not to https://git-annex.branchable.com/git-annex-metadata/ , correct?
comment 22http://git-annex.branchable.com/backends/comment_22_be75c669ed8e127de8b6364961a5c0cb/Ilya S2018-09-07T15:59:06Z2018-09-07T15:59:06Z
Is it possible to configure git-annex to use different backends based on file size? I.e. use a faster hash or even WORM for larger files.
comment 23http://git-annex.branchable.com/backends/comment_23_8c8b08d7e6409756da4f9f3b21d5ad01/joey2018-09-11T17:30:12Z2018-09-11T16:52:20Z
<p>@Ilya, indeed, this page is talking about filesystem metadata.
I've updated it for clarity.</p>
<p>There is not currently a way to switch backend based on file size,
although you can use annex.largefiles to make it check eg smaller files
directly into git rather than annexing them.</p>
comment 24http://git-annex.branchable.com/backends/comment_24_f90b8eeae8d150eca6a91416f20eb223/Ilya_Shlyakhter2018-09-19T16:50:05Z2018-09-19T16:50:05Z
<p>It seems that *E backends ignore file extensions longer than four chars: https://git-annex.branchable.com/bugs/file_extensions_of<em><strong>62</strong>4_chars_ignored_by</em><strong>42</strong>E_backends/
Is there some reason for doing it this way?</p>
comment 25http://git-annex.branchable.com/backends/comment_25_f6a697ac069ab018ccacf8f16b09bc7e/joey2019-01-21T15:42:51Z2018-09-24T15:42:24Z
<p>@Ilya_Shlyakhter, it's a heuristic, what consititues a file extension is
not very well defined (consider ".tar.gz" and ".not-an-extension").
The heuristic has been refined over the years, but will never be perfect.
But, I don't know of any 5+ character file extensions in common use!</p>
stable vs unstable keyshttp://git-annex.branchable.com/backends/comment_26_9e9e0f3631c60b0cdcaa9c6dd4ae0d4e/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-10-26T15:58:26Z
Backend.hs <a href="https://git.kitenet.net/index.cgi/git-annex.git/tree/Backend.hs?h=6.20181011&id=426f0f3f4bff378e7be402d905ed171c6f26e472#n16">classifies</a> keys as "stable" or not, with URL keys being unstable. How is this distinction used? I found only one <a href="https://git.kitenet.net/index.cgi/git-annex.git/tree/Remote/Helper/Chunked.hs?h=6.20181011&id=426f0f3f4bff378e7be402d905ed171c6f26e472#n111">place</a> where it's used, but couldn't quite understand it. If "stable" means "containing a hash of the content", then wouldn't WORM keys be unstable too?
re: stable vs unstable keyshttp://git-annex.branchable.com/backends/comment_27_c311551741a671bcb1f32b80431c1d66/joey2019-01-21T15:42:51Z2018-10-29T18:55:09Z
<p>It's only used to avoid uploading one chunk from one object that the key
points to, and then later upload a chunk from a different object.</p>
<p>While WORM keys could in theory "collide" and the same key point to
different content, that's no different than MD5 or SHA1 keys colliding;
it's a smallish risk, easily quantified, and you take that risk by
choosing to use those keys.</p>
<p>The risk that the content at an url might change varies over time or
something like that, so I think it makes sense to treat URL keys as specially
unstable.</p>
comment 28http://git-annex.branchable.com/backends/comment_28_44e8d93b1f0a88c36543438d6e33d702/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-11-01T17:12:59Z
<p>"The risk that the content at an url might change varies over time or something like that, so I think it makes sense to treat URL keys as specially unstable." -- but, if I understand correctly, a URL key does not actually represent a URL? Rather, a URL can be attached to <em>any</em> key, and if the contents of some URLs claimed by a remote is unstable, such remotes should be marked as untrusted; while if the contents of a URL key is stored in a trusted remote, that contents is not unstable. But URL and WORM keys are both "unstable" in that their contents can't be verified.<br />
[[todo/alternate_keys_for_same_content] could mitigate that.</p>
comment 29http://git-annex.branchable.com/backends/comment_29_9e14c32713b5607b245972b400286a45/joey2019-01-21T15:42:51Z2018-11-12T20:25:37Z
<p>The scenario that isStableKey is being used to guard against is two
repos downloading the content of an url and each getting different content,
followed by one repo uploading some chunks of its content and then the
other repo "finishing" the upload with chunks of its different content.
That would result in a mismash of chunks being stored in the remote.</p>
<p>It's true that it could also happen using WORM with an url attached to it.
(Not with other types of keys that verify a checksum.)
Though it seems much less likely, since the file size is at least checked
for WORM, while with URL keys there's often no recorded file size. And,
WORMs don't typically have urls attached (I can't think of a single time
I've ever done that, it just feels like asking for trouble),
while URL keys always do.</p>
<p>If this is a serious concern, I'd suggest you open a todo or bug report
about it, there are far too many comments to wade through here already.
We could think about, perhaps not allowing download of WORM keys from urls
or something like that..</p>
Backend which doesn't stoee files at all?http://git-annex.branchable.com/backends/comment_30_bea2a3f4f673e6e9d5ee99cf10d39460/annex23842020-06-17T01:18:32Z2020-02-20T14:52:21Z
<p>I'd like to be able to have a "thin" repo on a FAT32 filesystem. Since this precludes hardlinks, is there a way to make a backend that just keeps track of the file's hash so we can detect when it changes? This would obviously need to rely on having copies in other repos for backup purposes. I'm thinking a mode that behaves more like Unison, which just used its fingerprint file to detect changes that need to be synced.</p>
<p>There would still be a file in the backend named by SHA256, but instead of storing the content it would store the location of possible local copies of the file. This would obviously need to use a smudge filter. It could be the default backend for thin repos on filesystems that don't support hardlinks.</p>
backends vs special remoteshttp://git-annex.branchable.com/backends/comment_31_32df8ec41dd8a6516b88b1871c998eb9/Ilya_Shlyakhter2020-06-17T01:18:32Z2020-02-20T19:23:04Z
<p>@annex2384 "Backend which doesn't store files at all?" -- are you sure you're thinking of backends and not of <a href="http://git-annex.branchable.com/special_remotes/">special remotes</a>? Backends don't "store files", special remotes do. Backends create keys identifying specific contents.</p>
<p>Not sure I fully understand your use case, but you could write an <a href="http://git-annex.branchable.com/special_remotes/external/">external special remote</a> that, for a given git-annex key, stores "the location of possible local copies of the file" -- e.g. using <code>SETSTATE</code> or <code>SETURIPRESENT</code>.</p>
Improving on the WORM situationhttp://git-annex.branchable.com/backends/comment_32_b0a375dcd0932aa380ce01d693462f5e/lh2022-04-13T22:19:40Z2022-04-13T22:19:40Z
<p>What do you think about a new backend? There are some hash functions out there that aren't necessarily cryptographically secure but are more performant and certainly better than metadata checks at deduplication.</p>
<p>Modern filesystems like BTRFS implement checksumming at the filesystem level for both data and metadata to detect (but not repair, not without some sort of parity data or mirroring) bitrot. Based on BTRFS's nice <a href="https://btrfs.readthedocs.io/en/latest/Checksumming.html">checksumming overview</a>, maybe CRC32C or xxHash would be good options? Their benchmarks suggest they're 60× and 40× faster than SHA256, respectively. According to the <a href="https://kdave.github.io/selecting-hash-for-btrfs/">benchmark source</a>, xxHash offers improved collision resistance over CRC32C, but the latter is more accessible on legacy systems (not much of a concern for git-annex, I think, given this would be a new, opt-in feature anyway). Since those benchmarks, the latest version, <a href="https://github.com/Cyan4973/xxHash">XXH3</a>, has received a stable release, and is apparently able to keep pace with RAM sequential read.</p>
<p>I was even thinking there could be an option to tap into filesystem checksumming, but BTRFS does this at the block level or after compression<a href="https://unix.stackexchange.com/a/411289">¹</a>, so that's off the table.</p>
<p>Please let me know if this is the right place for this, or if I should've opened a forum post.</p>
new backendshttp://git-annex.branchable.com/backends/comment_33_4c5f9c75cbb8a546e8325b206369aba1/Ilya_Shlyakhter2022-04-17T19:02:10Z2022-04-17T19:02:09Z
<blockquote><p>What do you think about a new backend?</p></blockquote>
<p>See the "External backends" section on this page for info on adding your own backends.</p>
comment 34http://git-annex.branchable.com/backends/comment_34_3782e7561607c8ccf1741bd039c2c648/joey2022-04-19T18:05:09Z2022-04-19T16:17:30Z
<a href="http://git-annex.branchable.com/todo/add_xxHash_backend/">add xxHash backend</a> is tracking requirements for adding xxh3.
combine worm with partial hashhttp://git-annex.branchable.com/backends/comment_35_8ac9d37cc2a7b0b361cf686fe27192dd/windfish2024-02-01T11:53:54Z2024-02-01T11:53:54Z
would it be possible to provide a combined backend of worm + partial hash? i'd imagine that this would make the backend faster than merely hashes while also lower the probability of erroneously identifying two different, but worm-equivalent files.