Please describe the problem.
Any time git-annex reads a file (and presumably hashes it), it is about half as fast as just reading the file or sha256sum
ing it on my hardware.
The repo I'm reading from is inside a btrfs on top of an HDD but the same happens in a btrfs image inside a tmpfs and inside a tmpfs directly, just to a lesser degree as there is no IO or filesystem overhead.
My CPU is pretty slow but reading a 1.7GiB file normally or even checksumming it is about an order of magnitude faster:
Command | Time |
---|---|
git-annex fsck file |
21s |
sha256sum file |
5s |
cat file > /dev/null |
2s |
(Tested inside a btrfs image in tmpfs with same settings (compress etc.))
This also happens on add
, copy
, get
etc. but it's even worse there because of higher IO overhead which results in average speeds of ~70MiB/s.
I'm currently in the process of transferring a few terabytes worth from multiple relatively slower drives onto one very fast drive and would like to parallelise the transfer. Unfortunately though, this issue seems to scale anti-proportionally with the level of parallelism. If I'd get 70MiB/s from each drive individually at -J1
, I'd get ~35MiB/s from both at -J2
.
I had to resort to rsync
ing the objects dirs manually as that's faster than any method of git-annex-internal transfers.
What steps will reproduce the problem?
Compare runtime of git-annex fsck
vs. sha256sum
and cat
.
What version of git-annex are you using? On what operating system?
git-annex version: 10.20220127
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.29 DAV-1.3.4 feed-1.3.2.0 ghc-8.10.7 http-client-0.6.4.1 persistent-sqlite-2.13.1.0 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.2
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 8
NixOS 21.11
Please provide any additional information below.
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
# End of transcript or log.
To expand on this, results from my more powerful machine reading a 14G file (all in tmpfs):
git-annex fsck file
sha256sum file
cat file > /dev/null
(5800X with dual channel quad rank 3600MT/s CL16 RAM)
(Sorry, command used on the more powerful machine was
git annex add
, notfsck
.)The read speed here (~320MiB/s) is pretty close to the
openssl speed sha256
result for 16B block size (~360MiB/s) or 64B blocks with SHA256 acceleration disabled (~300MiB/s,OPENSSL_ia32cap=:~0x20000000
).What block size is used here and does git-annex use SHA instructions for hashing?
Cannot reproduce that:
(1 gb file; tmpfs)
git-annex simply uses cryptonite for hashing. Which is a thin wrapper around C code, and contains optimised routines for some processors, generally copied from the reference C implementations.
The block size is 32 kb. This can be changed in Utility/Metered.hs in defaultChunkSize.
I found the issue, cryptonite doesn't make use of SHA instructions which both of my machines have. I'm going to open an issue upstream.
Is there an alternative crypto library that can make use of these perhaps? The speedup/efficiency gain is >5x on my 5800X.
Awesome that you were able to reproduce it in cryptonite and file that. Thank you!
I'm inclined to close this, because it seems unlikely there will be a fix for it in git-annex. One fix that would be possible here is for git-annex to use the exteneral sha256sum command. But, that would prevent git-annex from displaying progress while hashing, which is why it moved away from it in the first place.
Of course, you could also try using a different kind of hash that is faster for new content.
I don't know of anything else in the haskell space that is likely to be better, cryptonite is basically the standard library for this. There is certianly precident in it for using optimised instructions, like SSE.
SHA256 is by far the fastest hash on my Celeron J4105 with acceleration. Seconds is BLAKE at around half the speed of accelerated SHA256 or over twice the speed of unaccelerated SHA256.
Seems like coreutils simply uses openssl when available and openssl handles HW acceleration. I just tried using two Haskell OpenSSL wrappers (hopenssl and HsOpenSSL) in my minimal example and it was actually faster than the openssl CLI utility by almost 100MiB/ (though still a bit slower than
sha256sum
).Would using those be an option?
Upstream does not seem interested in improving performance. Are there alternative libraries that could be used instead?
Would it be possible to use OpenSSL somehow perhaps?
AFAICT git-annex hooks into the hashing routine for things like progress monitoring. Perhaps those could be re-architected to work with an external "black box" library or process?
crytonite has been forked to crypton, which git-annex can build with as an option. It may be that the maintainer of that will be more open to this kind of optimisation. https://github.com/kazu-yamamoto/crypton/issues
My laptop now has SHA256 in hardware I assume, as I'm seeing similar speed differences. Eg:
I agree that this bug should be left open since even a relatively low-end laptop now has this, git-annex shouldn't leave so much performance on the table.
I've opened an issue on crypton: https://github.com/kazu-yamamoto/crypton/issues/31
If it's rejected from crypton, I'm open to considering another library. I'd rather avoid OpenSSL wrappers because adding a C library dependency on openssl will complicate building git-annex in some situations. It would be better to have a haskell library that, like cryptonite, embeds the necessary C code.
Also, whatever library git-annex uses needs to support incremental hashing, otherwise git-annex has to pay a performance penalty of re-reading a file to hash it after download, rather than hashing while downloading.
Thank you for looking into this again.
Would it be possible to make this a build-time option perhaps?
git-annex without SIMD hashing obviously still works fast enough for many purposes as its the status quo but having it would be a greatly appreciated optimisation by many. It'd be great to have the option to enable it wherever possible and simply fall back to non-SIMD where it isn't.
Agreed. Incremental hashing is too important to lose over a general optimisation like this.
https://hackage.haskell.org/package/botan-low is another possibility. There is a significant effort ongoing to build up this library stack in haskell. It does need the botan C library to be installed separately unfortunately (I've suggested they embed it).
I have not benchmarked it yet but the docs say it supports hardware accellerated SHA1, SHA2, SHA3 on x86 and also SHA1, SHA2 on arm64.