Would it be hard to add a variantion to checksumming backends, that would change how the checksum is computed: instead of computing it on the whole file, it would first be computed on file chunks of given size, and then the final checksum computed on the concatenation of the chunk checksums? You'd add a new key field, say cNNNNN, specifying the chunking size (the last chunk might be shorter). Then (1) for large files, checksum computation could be parallelized (there could be a config option specifying the default chunk size for newly added files); (2) I often have large files on a remote, for which I have md5 for each chunk, but not for the full file; this would enable me to register the location of these fies with git-annex without downloading them, while still using a checksum-based key.
Closing, because external backends is implemented, so you should be able to roll your own backend for your use case here. done --Joey
Parallizable checksums exist, I would rather add support for a standard one to git-annex than roll my own. In fact, I already did when I added BLAKE2SP224 etc which are 8-way parallel and also can be optimised by SIMD instructions on less parallel systems.
addurl --fast
does that, but creates a non-checksum key. If I can get an MD5 without downloading, I can usesetpresentkey
. But often I only have the MD5 for the fixed-size chunks of the file, not for the whole. Adding a backend variant computable from MD5s of the chunks would solve the problem. Maybe, there are other solutions?That seems like an unusual use case that would be unnecessary complication to add to git-annex, but that external backends could be used to implenent as needed.
Another theoretical use case (not available for now, but maybe for the future): verify with checksums parts of the file and re-download only those parts/chunks, that are bad. For this you need a checksum for each chunk and a "global" checksum in key, that somehow incorporates all these chunk checksums. An example of this is Tiger Tree Hash in file sharing.
When I used the SHA256 backend in my downloads, I often wondered that the long process of checksumming a movie or an OS installation .iso is not ideal. Because if the file download is not finished, I get the wrong checksum, and the whole process needs to be repeated.
And in the future git-annex can integrate a FUSE filesystem and literally store just chunks of files, but represent files as a whole in this virtual filesystem view.
"verify with checksums parts of the file and re-download only those parts/chunks, that are bad." -- if I understand correctly, git-annex doesn't checksum chunks, but can tell incompletely downloaded chunks based on size.
My original use case (registering the presence of a chunked file in a remote without downloading it) might be implementable with setpresentkey option to record chunked state. The checksums of the chunks would not be used though.