Please describe the problem.
I wanted to use S3 special remote to "crawl" S3 bucket in importtree=yes
mode. Bucket (dandiarchive) supports versioning, so it would be great to enable versioning here as well so URLs would use versionId. But unfortunately adding versioning=yes
makes git-annex
to try to establish versioning on the bucket (even if it is already enabled).
command to try with (should work for anyone since public bucket):
git annex --debug initremote s3-dandiarchive bucket=dandiarchive type=S3 encryption=none importtree=yes publicurl=https://dandiarchive.s3.amazonaws.com/ fileprefix=dandisets/000027/ signature=anonymous versioning=yes
to see that annex (I use 10.20240927) would try to enable versioning:
(enabling bucket versioning...) [2024-11-07 16:30:37.830416324] (Remote.S3) String to sign: "PUT\n\n\nThu, 07 Nov 2024 21:30:37 GMT\n/dandiarchive/?versioning"
[2024-11-07 16:30:37.830449238] (Remote.S3) Host: "dandiarchive.s3.amazonaws.com"
[2024-11-07 16:30:37.830459034] (Remote.S3) Path: "/"
[2024-11-07 16:30:37.830470676] (Remote.S3) Query string: "versioning"
[2024-11-07 16:30:37.830480666] (Remote.S3) Header: [("Date","Thu, 07 Nov 2024 21:30:37 GMT")]
[2024-11-07 16:30:37.830498329] (Remote.S3) Body: "<?xml version=\"1.0\" encoding=\"UTF-8\"?><VersioningConfiguration xmlns=\"http://s3.amazonaws.com/doc/2006-03-01/\"><Status>Enabled</Status></VersioningConfiguration>"
[2024-11-07 16:30:37.879924822] (Remote.S3) Response status: Status {statusCode = 403, statusMessage = "Forbidden"}
It seems to be easy to check if versioning enabled:
❯ curl -s "https://dandiarchive.s3.amazonaws.com/?versioning"
<?xml version="1.0" encoding="UTF-8"?>
<VersioningConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Status>Enabled</Status></VersioningConfiguration>
Unfortunately https://hackage.haskell.org/package/aws does not implement the versioning check, so it will need to be added there. And it tends to take some time for new versions of the build dependency to reach everywhere.
https://github.com/aristidb/aws/issues/290
I do think that is the only safe way to go though. I considered making git-annex assume that a bucket where versioning cannot be set is read-only. If git-annex is really never going to write to a bucket, it's safe to assume versioning is enabled. But, unfortunately, ACLs can sometimes prevent changing configs like versioning, but still allow other write operations. Also, a S3 remote might be initialized without permission to write to an existing bucket, but later S3 creds be used that do allow writing.
Made a pull request to aws https://github.com/aristidb/aws/pull/292
(As sometimes S3 maintainer of aws, I'll probably accept it if nobody objects to it.)
Wait though... We have signature=anonymous. So git-annex does in fact know that this special remote is read-only. git-annex will never try to write to it (even if the bucket somehow allowed anonymous writes) as long as it's configured with signature=anonymous.
So, it could just avoid trying to set versioning when signature=anonymous, and assume the bucket has versioning enabled.
Hmm, in lockContentS3, when versioning is enabled, it calls checkVersioning, which checks if a S3 version ID has been recorded for the file. What if the bucket did not actually have versioning enabled? Then an import from it would not record a S3 version ID. That would make this, and other places like checkKey that expect versioned buckets to have S3 version IDs fail in unexpected ways.
So, I guess I'm inclined to not go down this read-only path, and instead wait for aws to get updated and use that.
The
checkbucketversioning
branch has this implemented, to be merged once aws is released supporting it.