Given that git-annex has interactions with AWS S3 built-in, similar to my whining about ssh:// urls, I wondered if may be s3:// urls could be supported directly by git-annex. Unfortunately not the case, and messages are a tiny bit misleading (see below) that initially annex just says that configuration disallows access to S3 but when tried to allow, seems to offload that to libcurl which doesn't support it.
The reason I am asking, is that lots of data is on S3 and for now we either relied on our datalad special remote to provide access to S3:// so we could authenticate, but for public buckets it would be overkill to demand datalad. Although we could replace them with http urls, I thought it might be better if annex could just download s3:// directly.
$> git annex addurl s3://images.cocodataset.org/annotations/image_info_test2014.zip
addurl s3://images.cocodataset.org/annotations/image_info_test2014.zip Configuration does not allow accessing s3://images.cocodataset.org/annotations/image_info_test2014.zip
Configuration does not allow accessing s3://images.cocodataset.org/annotations/image_info_test2014.zip
failed
git-annex: addurl: 1 failed
$> git -c annex.security.allowed-url-schemes="http https s3" -c annex.security.allowed-http-addresses=all annex addurl s3://images.cocodataset.org/annotations/image_info_test2014.zip
addurl s3://images.cocodataset.org/annotations/image_info_test2014.zip
curl: (1) Protocol "s3" not supported or disabled in libcurl
failed
git-annex: addurl: 1 failed
$> git annex version
git-annex version: 6.20180913+git33-g2cd5a723f-1~ndall+1
Is a s3 uri scheme standardized somewhere?
AFAICS this would be identical to http..
great question. I think there is "none" or may be it is even incorrect to use
s3
in "s3://" because underlying protocol is probably http(s).I was using
s3://
because - that is what s3cmd consumes and probably it was the first cmdline helper I used for interaction with S3 -aws
from awscli from Amazon itself, describess3://mybucket/myprefix/myobject
as S3Uri; although I found no mentioning of it yet in https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html. Here is whataws s3 help
saysso it seems that in neither of those two cases there is any provisioning to specify the specific versionId (like pure http url would have), or identify the key/object by etag (somewhat volatile AFAIK, could change for the same version if recomputed). So even if s3:// gets supported, there is no standard way to point to the particular version of the file.
I suspect that Amazon's tools also allow uris where the bucket is not a domain name and is assumed to be on AWS. The documentation kind of reads that way. That would not fly in git-annex.
yes, for
s3://
urls there is only a bucket name, not a domain. Since bucket name is allowed to have.
, some folks started to use their project domain name as a bucket name (e.g.openneuro.org
,images.cocodataset.org
). Then if you are to access them directly via url, full domain name would be e.g. http://images.cocodataset.org.s3.amazonaws.com, which would start causing troubles if you try to access it via httpsfor which we started to provide workarounds in datalad.
My current planned solution is to write an external special remote that claims s3:// URLs and downloads them. Then can use addurl --fast . My use case is that, if I run a batch job that reads inputs from s3 and writes outputs to s3, what I get at the end are pointers to s3, and I want to check these results into git-annex. Ideally there'd be a way for me to tell the batch system to use git-annex to send things to s3, but currently that's not possible.
Question: if an external special remote claims a URL that a built-in special remote could handle, does the external special remote take priority?
@Ilya_Shlyakhter it is possible for git-annex to send things to S3 though, particularly
git-annex export
sounds like it may meet your use case. (Or the planned ?import tree.)Remote url claiming iterates through remotes ordered by cost, so the cheapest remote wins. I'm not decided if that really makes sense, it's just what fell out of the code.
As to s3:// stealthily hiding the fact that what look like real urls are forced to be hosted by Amazon -- ugh. Amazon may not want S3 to be a protocol, but it is one, and so that is inappropriate hard-coding.
(master_env_py27_v28) [11:52 AM /data/ilya-work]$ git annex version git-annex version: 6.20180926-gc906aaf build flags: Assistant Webapp Pairing S3(multipartupload)(storageclasses) WebDAV Inotify ConcurrentOutput TorrentParser MagicMime Feed\ s Testsuite dependency versions: aws-0.17.1 bloomfilter-2.0.1.0 cryptonite-0.23 DAV-1.3.1 feed-0.3.12.0 ghc-8.0.2 http-client-0.5.7.0 persistent-s\ qlite-2.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.4.5 key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_2\ 24 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE\ 2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256\ BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar hook external operating system: linux x86_64 supported repository versions: 3 5 6 upgrade supported from repository versions: 0 1 2 3 4 5 local repository version: 5
An external special remote can hook into addurl by implementing CLAIMURL and CHECKURL. So yes, an external special remote could implement s3 urls.
Looking over this thread, I think it's clear that git-annex should not implement support for these s3 urls itself.