Please describe the problem.
With an external special remote that handles a custom URL scheme, I receive a "Verification of content failed" on the first git annex get
of a file (i.e. when git-annex cannot know a checksum for the file, yet).
Sorry that this is hidden in a bit of indirection in a datalad extension, what it does is effectively just implement an external special remote that handles cds:
URLs and then git annex addurl --fast --verifiable
those URLs. I get the same verification error even with --relaxed
instead of --fast
(though I would like to have the semantics of --fast
, i.e. record checksum on first download and then always check against that).
What steps will reproduce the problem?
Install datalad, and datalad-cds from this PR: https://github.com/matrss/datalad-cds/pull/16. Then:
datalad create test-ds
cd test-ds/
datalad download-cds --lazy --path download.grib '{
"dataset": "reanalysis-era5-pressure-levels",
"sub-selection": {
"variable": "temperature",
"pressure_level": "1000",
"product_type": "reanalysis",
"date": "2017-12-01/2017-12-31",
"time": "12:00",
"format": "grib"
}
}'
git annex get download.grib
What version of git-annex are you using? On what operating system?
git-annex version: 10.20240430
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.5 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL VURL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
on Ubuntu, installed from a recent version of nixpkgs. Also happens in CI (see PR in datalad-cds) where git-annex is installed from NeuroDebian.
Please provide any additional information below.
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
$ datalad create test-ds
create(ok): <...> (dataset)
$ cd test-ds/
$ datalad download-cds --lazy --path download.grib '{
"dataset": "reanalysis-era5-pressure-levels",
"sub-selection": {
"variable": "temperature",
"pressure_level": "1000",
"product_type": "reanalysis",
"date": "2017-12-01/2017-12-31",
"time": "12:00",
"format": "grib"
}
}'
save(ok): . (dataset)
cds(ok): <...> (dataset)
$ git annex info download.grib
file: download.grib
size: 0 bytes (+ 1 unknown size)
key: VURL--cds:v1-eyJkYXRhc2V0IjoicmVhbmFs-77566133ebfe9220aefbeed5a58b6972
present: false
$ git annex get download.grib
get download.grib (from cds...)
CDS request is submitted
CDS request is completed
Starting download from CDS
(checksum...)
Verification of content failed
Unable to access these remotes: cds
No other repository is known to contain the file.
failed
get: 1 failed
# End of transcript or log.
Looking at 55bf01b78888b410a8ac07e834ed7104ffa1f4d0 it talks about something like this:
I think at the time, I just punted on the question of how to register the equivilant key when downloading a VURL from other special remotes.
Thing is, git-annex just calls retrieveKeyFile at some point. At that point, it doesn't know if it's retrieving the key from the web or some other special remote that claims urls.
It would not do for a
git-annex get
that happens to get the VURL key from eg, a directory special remote to say hey, we've not populated the equivilant key log for this VURL yet, let's trust what we downloaded from the remote is good and populate it now.So that's why the code went into the web special remote. It seems to me that to move it out of the web special remote, there would need to be a way for git-annex to check if a given special remote claims the url associated with a VURL. Eg, look up the key's url(s) and check claimUrl.
Hmm, if it did that, there would still be a problem that retrieveKeyFile can try to verify the content it downloads, using verifyKeyContent. But when no equivilant key is yet registered for the VURL, retrieveKeyFile would find that anything it downloaded failed to verify. So it would fail. Chicken and egg problem.
But well, it would be easy enough to make this not be treated as a failure. In Backend.VURL, verifyKeyContent returns False when there are no equivilant keys. It could just return true.
That seems like a necessary first step. Just making that change will solve your problem to the extent that it's ok for no verification of the content of VURLs to be done when they originate with your special remote.
So I've done that. Whether it makes sense to do the rest I am not yet sure..
Thinking about consequences of generalizing this from the web special remote to all special remotes that claim urls some more, I came up with this scenario:
A special remote claims some urls. But it can also store arbitrary keys that are sent to it by git-annex.
At first,
git-annex addurl --verifiable --relaxed
is used to download one of the urls that the special remote claims.Later, that key gets copied back to the special remote.
Then the special remote corrupts the content that was stored on it.
Then,
git-annex get
is used to download the corrupted content from the special remote. Let's say that the special remote, in this case, sends the object file that was stored in it, rather than looking up the url and retrieving that.This special remote doesn't do checksum verification itself, so retreiveKeyFile succeeds despite the corruption, and returns UnVerified.
Then git-annex verifies the content. And it fails verification. But, since
--relaxed
was used when the VURL was generated, it has no size. Which means any content from the special remote should be accepted. Even though it's corrupted!The web special remote doesn't have this problem because it doesn't store arbitrary git-annex objects. My conclusion is that a special remote that does support storing arbitrary objects in it, and also supports claimUrl, cannot work properly with
--relaxed
for VURLs. They could support--fast
still, but this is making me wonder if learning equivilant key checksums for VURLs really should be generalized beyond the web special remote.Maybe special remotes where that does make sense should do the same kind of thing that the web special remote does. Then we would be looking at an extension to the external special remote protocol, or some utility command to use to register the content of a VURL key.