todo/Natively support s3:// urls (for addurl, get, etc)yohhttp://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/git-annexikiwiki2023-03-01T19:59:39Zcomment 1http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_1_312f32bd35f06837463e541773986b4f/joey2019-01-21T15:42:51Z2018-10-03T16:37:37Z
<p>Is a s3 uri scheme standardized somewhere?</p>
<p>AFAICS this would be identical to http..</p>
comment 2http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_2_7ea1c35ce23a57fea52a593c41a8fb69/yarikoptic2019-01-21T15:42:51Z2018-10-03T17:39:48Z
<p>great question.
I think there is "none" or may be it is even incorrect to use <code>s3</code> in "s3://" because underlying protocol is probably http(s).</p>
<p>I was using <code>s3://</code> because
- that is what <a href="https://s3tools.org/s3cmd">s3cmd</a> consumes and probably it was the first cmdline helper I used for interaction with S3
- <code>aws</code> from <a href="http://aws.amazon.com/cli/">awscli</a> from Amazon itself, describes <code>s3://mybucket/myprefix/myobject</code> as S3Uri; although I found no mentioning of it yet in https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html. Here is what <code>aws s3 help</code> says</p>
<pre><code> S3Uri: represents the location of a S3 object, prefix, or bucket. This
must be written in the form s3://mybucket/mykey where mybucket is the
specified S3 bucket, mykey is the specified S3 key. The path argument
must begin with s3:// in order to denote that the path argument refers
to a S3 object. Note that prefixes are separated by forward slashes.
For example, if the S3 object myobject had the prefix myprefix, the S3
key would be myprefix/myobject, and if the object was in the bucket
mybucket, the S3Uri would be s3://mybucket/myprefix/myobject.
</code></pre>
<p>so it seems that in neither of those two cases there is any provisioning to specify the specific versionId (like pure http url would have), or identify the key/object by etag (somewhat volatile AFAIK, could change for the same version if recomputed). So even if s3:// gets supported, there is no standard way to point to the particular version of the file.</p>
comment 3http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_3_337217fc0ae8cfd432c7df45942b86ef/joey2019-01-21T15:42:51Z2018-10-04T16:41:00Z
<p>I suspect that Amazon's tools also allow uris where
the bucket is not a domain name and is assumed to be
on AWS. The documentation kind of reads that way.
That would not fly in git-annex.</p>
comment 4http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_4_8778b93dbc123c484727bb55e39608c9/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-10-05T19:56:24Z
If the s3 remote claims s3:// URLs, does the bucket name have to be a DNS domain? I thought, when a special remote claims a URL, it can interepret it however it wants?
comment 5http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_5_9c7bdcb483b371f421bcdca3d8c2073c/yarikoptic2019-01-21T15:42:51Z2018-10-05T20:20:42Z
<p>yes, for <code>s3://</code> urls there is only a bucket name, not a domain. Since bucket name is allowed to have <code>.</code>, some folks started to use their project domain name as a bucket name (e.g. <code>openneuro.org</code>, <code>images.cocodataset.org</code>). Then if you are to access them directly via url, full domain name would be e.g. http://images.cocodataset.org.s3.amazonaws.com, which would start causing troubles if you try to access it via https</p>
<div class="highlight-sh"><pre class="hl">$<span class="hl opt">></span> wget <span class="hl kwb">-S</span> https<span class="hl opt">://</span>images.cocodataset.org.s3.amazonaws.com
<span class="hl kwb">--2018-10-05</span> <span class="hl num">16</span><span class="hl opt">:</span><span class="hl num">19</span><span class="hl opt">:</span><span class="hl num">48</span><span class="hl kwb">--</span> https<span class="hl opt">://</span>images.cocodataset.org.s3.amazonaws.com<span class="hl opt">/</span>
Resolving images.cocodataset.org.s3.amazonaws.com <span class="hl opt">(</span>images.cocodataset.org.s3.amazonaws.com<span class="hl opt">)</span>... <span class="hl num">52.216.18.32</span>
Connecting to images.cocodataset.org.s3.amazonaws.com <span class="hl opt">(</span>images.cocodataset.org.s3.amazonaws.com<span class="hl opt">)</span>|<span class="hl num">52.216.18.32</span>|<span class="hl opt">:</span><span class="hl num">443</span>... connected.
The certificate<span class="hl str">'s owner does not match hostname ‘images.cocodataset.org.s3.amazonaws.com’</span>
</pre></div>
<p>for which we started to provide workarounds in datalad.</p>
comment 6http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_6_c94fdab4db1d5f94929ae068c546b533/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-10-05T21:31:33Z
<p>My current planned solution is to write an external special remote that claims s3:// URLs and downloads them. Then can use addurl --fast . My use case is that, if I run a batch job that reads inputs from s3 and writes outputs to s3, what I get at the end are pointers to s3, and I want to check these results into git-annex. Ideally there'd be a way for me to tell the batch system to use git-annex to send things to s3, but currently that's not possible.</p>
<p>Question: if an external special remote claims a URL that a built-in special remote could handle, does the external special remote take priority?</p>
comment 7http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_7_e3c50b7ab1c5cd49a3e5376e20b7d4c0/yarikoptic2019-01-21T15:42:51Z2018-10-05T21:41:45Z
FWIW datalad special remote already supports download from such s3:// URLs
comment 8http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_8_72eb9bc8d12bc8719f49f6528a086234/joey2019-01-21T15:42:51Z2018-10-08T16:14:23Z
<p>@Ilya_Shlyakhter it is possible for git-annex to send things to S3 though,
particularly <code>git-annex export</code> sounds like it may meet your use case.
(Or the planned <span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=todo%2FNatively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__%2Fcomment_8_72eb9bc8d12bc8719f49f6528a086234&page=todo%2Fimport_tree" rel="nofollow">?</a>import tree</span>.)</p>
<p>Remote url claiming iterates through remotes ordered by cost, so the
cheapest remote wins. I'm not decided if that really makes sense, it's
just what fell out of the code.</p>
<p>As to s3:// stealthily hiding the fact that what look like real urls are
forced to be hosted by Amazon -- ugh. Amazon may not want S3 to be a
protocol, but it is one, and so that is inappropriate hard-coding.</p>
comment 9http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_9_6d1a630fbf036144c9559c9c01fca845/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-10-09T15:33:59Z
@joey If I understand correctly, addurl requires a standard URL downloadable by curl? Would it be possible to add 'adduri' and 'registeruri' counterparts, that would be exactly like addurl/registerurl, except they would be for custom URIs not expected to be fetchable by curl? There seems to be an odd asymmetry, where the external special remote protocol has SETURIPRESENT/SETURLPRESENT, but the command-line only has the URL versions.
comment 10http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_10_ed3a47b10ce2303041d9b9271bb9a2c8/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-10-09T18:53:30Z
"Remote url claiming iterates through remotes ordered by cost" -- my web remote has lower cost than my dnanexus external special remote; the latter claims dx:// URLs. But git-annex never seems to ask dnanexus to process dx:// URLs, even when I've manually set the URL to be present there. Does the web remote always win, and if it can't handle a URL does git-annex then not try external special remotes?
comment 11http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_11_2afa6133bfa8932ea1fb02b2052e2939/yarikoptic2019-01-21T15:42:51Z2018-10-10T14:05:53Z
Ilya -- what version of annex you are using? Upgrade might help since there were some fixes related to the priorities in the past month or two (initially detected when using -J setting)
comment 12http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_12_6c9336be070b04d5736b943889370c13/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-10-10T16:04:38Z
<p>(master_env_py27_v28) [11:52 AM /data/ilya-work]$ git annex version
git-annex version: 6.20180926-gc906aaf
build flags: Assistant Webapp Pairing S3(multipartupload)(storageclasses) WebDAV Inotify ConcurrentOutput TorrentParser MagicMime Feed\
s Testsuite
dependency versions: aws-0.17.1 bloomfilter-2.0.1.0 cryptonite-0.23 DAV-1.3.1 feed-0.3.12.0 ghc-8.0.2 http-client-0.5.7.0 persistent-s\
qlite-2.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.4.5
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_2\
24 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE\
2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256\
BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar hook external
operating system: linux x86_64
supported repository versions: 3 5 6
upgrade supported from repository versions: 0 1 2 3 4 5
local repository version: 5</p>
comment 13http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_13_20184de38c2adeed7e20fa323d9a3ba4/Ilya_Shlyakhter2019-01-21T15:42:51Z2018-10-11T12:03:12Z
So, it was some odd misconfiguration — I initremote’d a new external special remote of the same externaltype, set the old one to annex-ignore, and things seem to work.
Re: comment 9http://git-annex.branchable.com/todo/Natively_support_s3__58____47____47___urls___40__for_addurl__44___get__44___etc__41__/comment_14_58965ac55e2e9aebdd0e38f8eef097a3/joey2023-03-01T19:59:39Z2023-03-01T16:30:42Z
<p>An external special remote can hook into addurl by implementing
CLAIMURL and CHECKURL. So yes, an external special remote could implement
s3 urls.</p>
<p>Looking over this thread, I think it's clear that git-annex should not
implement support for these s3 urls itself.</p>