In neurophysiology we encounter HUGE files (HDF5 .nwb files). Sizes reach hundreds of GBs per file (thus exceeding any possible file system memory cache size). While operating in the cloud or on a fast connection it is possible to fetch the files with speeds up to 100 MBps. Upon successful download such files are then loaded back by git-annex for the checksum validation, and often at slower speeds (eg <60MBps on EC2 SSD drive). So, ironically, it does not just double, but rather nearly triples overall time to obtain a file.
I think ideally,
- (at minimum) for built-in special remotes (such as web), it would be great if git-annex was check-summing incrementally as data comes in;
- made it possible to for external special remotes to provide desired checksum on obtained content. First git-annex should of cause inform them on type (backend) of the checksum it is interested in, and may be have some information reported by external remotes on what checksums they support.
If needed example, here is http://datasets.datalad.org/allen-brain-observatory/visual-coding-2p/.git with >50GB files such as ophys_movies/ophys_experiment_576261945.h5 .
For external remotes can pass to the
TRANSFER
request, as theFILE
parameter, a named pipe, and usetee
to create a separate stream for checksumming.An external remote could also do its own checksum checking and then set remote..annex-verify=false. Could also make a “wrapper” external remote that delegates all requests to a given external remote but does checksum-checking in parallel with downloading (by creating a named pipe and passing that to the wrapped remote).
that is an interesting idea, thanks! Not sure if that makes it easy for mass consumption though since it is a feature of a external remote, not sure why it should be in the config. Ideally it should be a property of a remote.
Joey, what do you think in regard of built-in remotes?
When -J is used, recent versions use a separate pool of worker threads for the checksumming than the downloading. So even with -J1 checksum of the previous download will not block the next download.
I've thought about making this the default without -J.. It relies on concurrent-output working well, which it sometimes may not, eg when filenames are not valid unicode, or perhaps on a non-ANSI terminal, and so far it's been worth not defaulting to -J1 to avoid breaking in such edge cases.
Anyway, it seems to me using -J should avoid most of the overhead, except of course for the remaining checksumming after all downloads finish.
Incremental checksumming could be done for some of the built-in remotes, but not others like bittorrent which write out of order. Some transfers can resume, and the checksumming would have to somehow catch up to resume point, which adds significant complexity.
External remotes would need to send the content over a pipe for incremental checksumming, so it would need a protocol extension.
git-annex's remote API does have the concept that a remote can sufficiently verify the content of a file during transfer that additional checksumming is not necessary. Currently only used for git remotes when hard linking an object from a sibling remote. I don't think it actually matters what checksum a remote uses to do such verification, as long as it's cryptographically secure and runs on the local machine.
A protocol extension that let an external remote communicate to git-annex that it had done such verification at the end of transfer is worth thinking about.
Re Ilya's security concerns, as long as the external remote runs the verification on the local machine, it seems there is no added security impact.
I investigated feasability of a protocol extension that when an external special remote enables it, means it always verifies a checksum itself before sending
TRANSFER-SUCCESS RETRIEVE
.Which checksum would be up to the special remote. Question is, would it need to be cryptographically secure, or would a CRC, sha1, or md5 suffice?
annex.security.allow-unverified-downloads prevents download from special remotes when the content can't be verified, and that is to avoid a class of security holes (CVE-2018-10859). For the purposes of fixing that hole, sha1 and md5 were considered good enough. The attacker does not control the original content, so a preimage attack won't work. The attacker has a gpg encrypted file they want to get decrypted, and they might be able to modify the file (eg appending junk, or messing with the gpg data in some way?) and cause a collision. I think sha1 and md5 are secure enough to avoid this attack. CRC is certianly not good enough. I'd be wary of md4 since its preimage resistance is broken.
So, doing this as a protocol extension would need to document that the hash needs to have preimage resistance, or be generally cryptographically secure. And then if a external special remote was using sha1 or whatever and it got sufficiently broken, it would be up to the maintainer of it to update it to stop using the protocol extension.
I also looked at special remotes built into git-annex. Tahoe certianly does enough verification of downloads. Bittorrent doesn't because the torrent file is downloaded often w/o verification (magnet links could be verified enough maybe). Rsync usually uses a good enough checksum, but can fall back to md4 or perhaps no checksum at all, and the attacker might control the rsync server, so the rsync special remote and also ssh remotes still need their own verification. Bup uses sha1 so does enough verification. All the rest don't.
I think the question is, would this protocol extension really get used by any external special remotes? Anyone have an example of one? The alternative is to change the protocol so the downloaded content gets streamed to git-annex in some way, and have git-annex incrementally checksum as content comes in. Which really seems better all around, except probably a lot harder to implement, and using a small amount more CPU in cases where the external special remote is really doing its own checksumming.
The incrementalhash branch has a mostly complete implementation of incremental hashing on download, for retrieiving from git-annex remotes. And the Backend interface added for that should be easy to use in some special remotes as well. Let's start with that, worry about external special remotes only after the low-hanging fruit is picked.
Thank you Joey! In my particular/initial use case having it done only for "native" git annex downloads would already be great.
In the long term I do see us implementing it for
datalad
special remote. Having it somehow just "streaming" into git annex would probably be the "easiest on custom remotes" and would provide most flexibility since it would be up to annex to use corresponding to backend checksum.But I wonder if "streaming" is really needed -- through
TRANSFER STORE somekey tmpfile
annex already knows where the file downloaded so far is, most recently written blocks are likely to be in the FS cache, so may be it would suffice for a special remote to, in addition to PROGRESS, announce how many sequential bytes from the beginning have been downloaded already, and so if git-annex incrementally reads from the "growing" file it would just work? (I am not sure if Windows would allow for such "collaborative" use of the file where one process writes and another one reads) Although I guess it would mean thatgit-annex
would still need to trust external remote to not later change some earlier bytes in the file, so might be not sufficiently secure. Thus probably true streaming/named pipe Ilya suggested would be needed.Incremental hashing implemented for transfers over ssh and tor.
A good next step would be transfers to/from local git remotes. Currently those use rsync, or cp for CoW. It does not make sense to trust rsync's checksum verification here, because garbage in, garbage out -- rsync doesn't know what the hash should be and will happily transfer a corrupted file. So instead, this would need to stop using rsync, and instead implement our own file to file copying with resuming and incremental hashing. Which would not be hard, and gets rid of the dependency on rsync besides (except for talking with really old git-annex-ssh).
As for cp, CoW also suffers from GIGO, so I think the file will still need to be read, after its copied, to make sure it has the expected checksum.
Urk: Using rsync currently protects against URL key potential data loss, so the replacement would also need to deal with that. Eg, by comparing the temp file content with the start of the object when resuming.
Incremental hashing implemented for local git remotes.
Next step should be a special remote, such as directory, that uses byteRetriever. Chunking and encryption will complicate them..
Not a bad idea, yoh!
It could be taken further: Keep reading from the file as it grows and hash incrementally. Use inotify to know when more has been written, or poll periodically or when PROGRESS is received. If the hash is wrong at the end, rehash the file. Only remotes that write chunks out of order would pay a time penalty for that extra hashing.
That could be enabled by a protocol extension. Or disabled by a protocol extension for that matter, since remotes that write out of order are probably rare.
Some special remotes now support incremental update. So far, limited to ones that use the byteRetriever interface. Others, that use fileRetriever, including external special remotes, still need work.
I've implemented a tailVerify; all that remains to be done is hook all special remotes that retrieve to a file up to use it.
That's mostly ones using fileRetriever, which will come down to making retrieveChunks use it, probably. Although it may be that a few of them, like Remote.Directory, could avoid using it and feed the verifier themselves, more efficiently.
There are also a couple of special remotes that don't use fileRetriever or retrieveChunks but could also use tailVerify. Looks like only Remote.BitTorrent and Remote.Web. But bittorrent is just the kind of thing that tailVerify will not support well due to random access. And Remote.Web could feed the verifier itself.
Status update: This is implemented, and seems to work in limited testing.
The directory special remote's Cow probing has a (create, delete, create) sequence that prevents incremental hashing from being done. It should be possible to make that remote incrementally hash on its own, while copying, rather than using tailVerify. That would avoid the problem, and be more efficient besides.
The web special remote still needs to be made to do incremental hashing. Besides bittorrent, I think it's the only other one that won't do it now.
One thing I don't like: When the incremental checksumming falls behind the transfer, it has to catch up at the end, reading the remainder of the file. That currently runs in the transfer stage. If concurrency is enabled, it's possible for all jobs to be stuck doing that, rather than having any actually running transfers, and so fail to saturate bandwidth. This is kind of a reversion of a past optimisation that made checksums happen in a separate stage to avoid that problem. Although now it only happens when the transfer runs at the same speed as the disk (or perhaps when resuming an interrupted transfer), otherwise the checksumming won't fall behind. It would be better to defer the rest of the incremental checksumming to the transfer stage (by returning the action inside Verification and running it when the verification is checked), or perhaps switch to the checksum stage before doing it. (update: this is fixed)
The concurrency problem is fixed now.
All special remotes now do incremental hashing in most cases (on linux, subset on other OS's).
Only thing remaining is: Retrieval from export/import special remotes does not do incremental hashing (except for versioned ones, which sometimes use retrieveKeyFile).
Update: incremental hashing is also now done for all export remotes. Only import (and export+import) remotes don't support incremental hashing now.
Update 2: Now also done for import remotes. All done!