A colleague used a wrong config, which was pointing to minio console rather than the S3 endpoint. When they ran initremote, the console wrongfully replied 200-OK when PUTting the annex-uuid file, same when they then pushed the data. The minio console always redirect to a login page, and doesn't fail on PUT ( which is non-compliant ). So the dataset recorded all the data being present in that remote, while there was no trace of any buckets or objects in the S3.
steps to reproduce:
git init test_s3
cd test_s3/
git-annex init
export AWS_ACCESS_KEY_ID=john AWS_SECRET_ACCESS_KEY=doe
git annex initremote -d test_remote host=\"play.min.io\" bucket=\"test_bucket\" type=S3 encryption=none autoenable=true port=9443 protocol=https chunk=1GiB requeststyle=pathecho test > test_annexed_file
git-annex add test_annexed_file
git commit -m 'add annexed file'
git-annex copy --fast --to test_remote
I am showing it with --fast flag here, as this is what datalad uses by default. Without --fast, it fails with (HeaderException {headerErrorMessage = \"ETag missing\"}) failed which is better.
So to sum it up, the unfortunate circumstances are:
- the initremote PUT of annex-uuid is not performing check that the annex-uuid file was effectively pushed in a bucket.
- minio console replies with 200-OK for all http requests
- datalad uses
push --fastby default, which recorded files as being pushed without performing a HEAD after push. I guess that's for performance reason, but that is dangerous if a server or reverse-proxy ends-up responding 200-OK to all requests after init.
Thanks for your help!
It's arguably not git-annex's fault if it was pointed to an API endpoint that behaves enough like S3 to make it seems like it's stored data, but does not store the data.
With that said, for there to be data loss here, the file would need to be dropped from the local repository, relying on the copy "stored" in S3. So the S3 special remote's checkPresent could be improved to prevent such a bad endpoint from being treated as containing the content of an object.
For S3, checkPresent does pass in the VersionID when git-annex knows one. (Which that doesn't help if the API endpoint ignores that header.) What it does not do is check the response from S3 for a VersionID or ETag. Improving that seems like a possible way to avoid this kind of problem.
It does check the ETag when the S3 remote is configured with exporttree=true.
As for the idea of checking the annex-uuid write by reading the file back, the difficulty with that is S3
DEEP_ARCHIVEand similar can have an hours-long delay to get back out a file that is already stored in the bucket. Also, the annex-uuid file is not used for exporttree=yes remotes.