design/exporting trees to special remotesgit-annexhttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/git-annexikiwiki2018-02-07T20:01:53Znote that some remotes could support files versioning "natively"http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_1_ea84ee9de604e05b8e02483ba8452186/yoh2017-07-11T21:59:49Z2017-07-11T21:59:49Z
<p>E.g. when exporting to the S3 bucket with versioning turned on, or OSF (AFAIK). So upon successful upload special remote could SETURLPRESENT to signal availability of any particular key (associated with the file).</p>
<p>Yet to grasp the cases you outlined better to see if I see any other applicable use-ase</p>
<p>I hope that export would be implemented through extending externals special remote protocol? <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";)" /></p>
couldn't STATE be used for KEY -> FILENAME(s) mapping?http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_2_d414fb575845770e003a3c8ca4a986be/yarikoptic2017-07-11T22:05:49Z2017-07-11T22:05:49Z
<p>just wondered...
at least in my attempt for zenodo special remote I did store zenodo's file deposition ID within the state to be able to request it back later on
alternative -- URL(s) I guess. Could be smth like exported:UUID/filename.</p>
does it really need to be a new command ("export") or could be the same old "copy"?http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_3_cb063cdc66df79c40039bce247b7170c/yarikoptic2017-07-11T22:14:39Z2017-07-11T22:14:39Z
<p>or it could be just a mode of operation for a special remote depending on "exporttree=true" being set, where in one (old) case it would operate based on keys associated with the files pointed on the cmdline (or just keys for --auto or pointed by metadata), whenever when "exporttree=true" -- it would operate on filenames pointed on command line (or files found to be associated with the keys as pointed by --auto or by metadata)?
Then the same 'copy --to' could be used in both cases, streamlining user experience <img src="http://git-annex.branchable.com/smileys/smile4.png" alt=";)" /></p>
comment 4http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_4_126ee5332ff88b3993d33d59328d4148/joey2017-07-12T16:56:27Z2017-07-12T16:45:51Z
<p>I've added a section with changes to the external special remote protocol.
I included the Key in each of the new protocol commands, although it's not
strictly neeed, to allow the implementation to use SETURLPRESENT, SETSTATE,
etc.</p>
<p><code>git annex copy $file --to myexport</code> could perhaps work; the difficulty
though is, what if you've exported branch foo, and then checked out bar,
and so you told it to export one version of the file, and are running
git-annex copy on a different version? It seems that git-annex would have
to cross-check in this and similar commands, to detect such a situation.
Unsure how much more work that would be, both CPU time and implementation
time.</p>
<p>I do think that <code>git annex get</code> could download files from exports easily
enough, but see the "location tracking" section for trust caveats.</p>
<p>I'm not clear about what you're suggesting be done with versioning support
in external special remotest?</p>
special remotes with versioning supporthttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_5_fcd9890013371dae6ffcd00561b8c625/yarikoptic2017-07-12T17:30:33Z2017-07-12T17:30:33Z
<p>thanks -- I will check those all out!</p>
<p>Meanwhile a quick one regarding "I'm not clear about what you're suggesting be done with versioning support in external special remotes?".</p>
<p>I meant that in some cases there might be no need for any custom/special tracking per exported file would be needed -- upon export we could just register a unique URL for that particular version of the file for the corresponding KEY so later on it could be 'annex get'ed even if a new version of the file gets uploaded or removed. So annex could just store those treeish(es) hexsha on what was exported last without any explicit additional tracking per file. URL might be some custom one to be handled by the special remote backend.</p>
<p>E.g. here is a list of versions (and corresponding urls) for a sample file on the s3 bucket</p>
<div class="highlight-sh"><pre class="hl">$<span class="hl opt">></span> datalad <span class="hl kwc">ls</span> <span class="hl kwb">-aL</span> s3<span class="hl opt">://</span>datalad-test0-versioned<span class="hl opt">/</span><span class="hl num">3</span>versions-allversioned.txt
Connecting to bucket<span class="hl opt">:</span> datalad-test0-versioned
<span class="hl opt">[</span>INFO <span class="hl opt">]</span> S3 session<span class="hl opt">:</span> Connecting to the bucket datalad-test0-versioned
Bucket info<span class="hl opt">:</span>
Versioning<span class="hl opt">: {</span><span class="hl str">'MfaDelete'</span><span class="hl opt">:</span> <span class="hl str">'Disabled'</span><span class="hl opt">,</span> <span class="hl str">'Versioning'</span><span class="hl opt">:</span> <span class="hl str">'Enabled'</span><span class="hl opt">}</span>
Website<span class="hl opt">:</span> datalad-test0-versioned.s3-website-us-east-1.amazonaws.com
ACL<span class="hl opt">: <</span>Policy<span class="hl opt">:</span> yoh@cs.unm.edu <span class="hl opt">(</span>owner<span class="hl opt">) =</span> FULL_CONTROL<span class="hl opt">></span>
<span class="hl num">3</span>versions-allversioned.txt ... http<span class="hl opt">://</span>datalad-test0-versioned.s3.amazonaws.com<span class="hl opt">/</span><span class="hl num">3</span>versions-allversioned.txt?versionId<span class="hl opt">=</span>Kvuind11HZh._dCPaDAb0OY9dRrQoTMn <span class="hl opt">[</span>OK<span class="hl opt">]</span>
<span class="hl num">3</span>versions-allversioned.txt ... http<span class="hl opt">://</span>datalad-test0-versioned.s3.amazonaws.com<span class="hl opt">/</span><span class="hl num">3</span>versions-allversioned.txt?versionId<span class="hl opt">=</span>b.qCuh7Sg58VIYj8TVHzbRS97EvejzEl <span class="hl opt">[</span>OK<span class="hl opt">]</span>
<span class="hl num">3</span>versions-allversioned.txt ... http<span class="hl opt">://</span>datalad-test0-versioned.s3.amazonaws.com<span class="hl opt">/</span><span class="hl num">3</span>versions-allversioned.txt?versionId<span class="hl opt">=</span>pNsV5jJrnGATkmNrP8.i_xNH6CY4Mo5s <span class="hl opt">[</span>OK<span class="hl opt">]</span>
<span class="hl num">3</span>versions-allversioned.txt_sameprefix ... http<span class="hl opt">://</span>datalad-test0-versioned.s3.amazonaws.com<span class="hl opt">/</span><span class="hl num">3</span>versions-allversioned.txt_sameprefix?versionId<span class="hl opt">=</span>Mvsc4FgJWc6gExwSw1d6wsLrnk6wdDVa <span class="hl opt">[</span>OK<span class="hl opt">]</span>
</pre></div>
comment 6http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_6_3217c2f852e5d9b1e4be2adff995dd24/joey2017-07-12T18:19:39Z2017-07-12T18:09:00Z
<p>That would almost work without any smarts on the git-annex side.
When it tells the special remote to <code>REMOVEEXPORT</code>, the special remote
could remove the file from the HEAD equivilant but retain the content in its
versioned snapshots, and keep the url to that registered. But, that
doesn't actually work, because the url is registered for that special
remote, not the web special remote. Once git-annex thinks the file has been
removed from the special remote, it will never try to use the url
registered for that special remote.</p>
<p>So, to support versioning-capable special remotes, there would need to be
an additional response to <code>REMOVEEXPORT</code> that says "I removed it from HEAD,
but I still have a copy in this url, which can be accessed using
the web special remote".</p>
side-note about WebDAV&DeltaVhttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_7_43a98b4b9d9eb54720a9c92cd8bb3a30/yarikoptic2017-07-12T21:54:49Z2017-07-12T21:54:49Z
<p>DAV = “Distributed Authoring and Versioning.”, but versioning was forgotten about in the original RFC. Only some servers/clients implement DeltaV spec (RFC 3253) which came later to fill that gap.
But in principle, any DeltaV-compliant WebDAV special remote could then be used for "export" while retaining access to all the versions.
References:
- <a href="http://archive.oreilly.com/pub/a/opensource/excerpts/9780596510336/webdav-and-autoversioning.html">WebDAV and Autoversioning - Version Control with Subversion</a>
- <a href="http://www.webdav.org/specs/rfc3253.html">RFC 3253</a></p>
<p>I have got interested whenever saw that box.com is supported through WebDAV but not sure if DeltaV is anyhow supported and apparently number of versions stored per file is anyways depends on type of the account (and no versions for a free personal one): https://community.box.com/t5/How-to-Guides-for-Managing/How-To-Track-Your-Files-and-File-Versions-Version-History/ta-p/329</p>
regarding setting a URL by custom special remotehttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_8_7e512ef81c529b0392071b8a6dfe853b/yarikoptic2017-07-12T22:04:38Z2017-07-12T22:04:38Z
<p>I also wonder if <code>SETURLPRESENT Key Url</code> could also be extended to be <code>SETURLPRESENT Key Url Remote</code>, i.e. that a custom remote could register a URL to Web remote?
In many cases I expect a "custom uploader/exporter" but then public URL being available, so demanding a custom external remote to fetch it would be a bit overkill.</p>
<p>N.B. I already was burnt once on a large scale with our custom remote truthfully replying to CLAIMURL (since it can handle them if needed) to public URLs, thus absorbing them into it instead of relaying responsibility to 'Web' remote. Had to traverse dozens of datasets and duplicate urls from 'datalad' to 'Web' remote.</p>
comments on protocolhttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_9_6c588170f0b53c74c3c28ff08ed3509d/yarikoptic2017-07-12T22:09:54Z2017-07-12T22:09:54Z
<ul>
<li><code>TRANSFEREXPORT STORE|RETRIEVE Key File Name</code> -- note that File could also contain spaces etc (not only the Name), so should be encoded somehow?</li>
<li><code>old external special remote programs ... need to handle an ERROR response</code> -- why not just to boost protocol <code>VERSION</code> to e.g. <code>2</code> so those which implement this would reply with a new version #?</li>
</ul>
export "each revision" -- thinking about quiltdatahttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_10_75ba45174d3c4b927113a6908061b742/yarikoptic2017-07-14T20:10:42Z2017-07-14T20:10:42Z
<p>In some cases, if remote supports versioning, might be cool to be able to export all versions (from previously exported point, assuming linear progression).
Having a chat with <a href="https://quiltdata.com/">https://quiltdata.com/</a> folks, project which I just got to know about.
1. They claim/hope to provide infinite storage for public datasets
2. They do support "File" model, so dataset could simply contain files. If we could (ab)use that -- sounds like a lovely free ride
3. They do support versioning. If we could export all the versions -- super lovely.</p>
<p>Might also help to establish interoperability between the tools</p>
Git Historyhttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_11_0827f5611c8e7e7ffa2633f8c06ae055/xloem2017-08-10T00:25:27Z2017-08-10T00:25:27Z
It would be great to have an option to include git history in the export, such that a special remote could be used both to rebuild a repository and to view the contents.
re: export "each revision"http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_12_1bb8d57383ca733f3a0069ff30181366/joey2017-08-15T18:42:39Z2017-08-15T18:13:38Z
<p>That sounds much more like a regular remote with <code>git annex copy --all</code>.</p>
<p>This entire design is preducated on exporting a single treeish. If you want
to make a single treeish containing all versions of every file ...</p>
re: Git Historyhttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_13_f857e15124b70ae1abc74669f63a2d68/joey2017-08-15T18:42:39Z2017-08-15T18:16:31Z
<p>That is entirely out of scope. You're looking for a way to store a git
<em>repository</em> someplace like S3. Such things exist already, and are not
git-annex and I'm not going to replicate them as part of this feature.</p>
re: comments on protocolhttp://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_14_41e8ef74c6e62a4321d2046b2571a246/joey2017-08-28T19:34:44Z2017-08-15T18:18:14Z
<p>In <code>TRANSFEREXPORT STORE|RETRIEVE Key File Name</code>, it should always
be possible for the File to not contain spaces in its name. But it could be
rather painful for git-annex to avoid spaces in some cases (would need to
link or copy the annexed file content). So well spotted.</p>
<p>Hmm, it's actually possible for a Key to contain spaces as well,
at least with the WORM backend.
<span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=design%2Fexporting_trees_to_special_remotes%2Fcomment_14_41e8ef74c6e62a4321d2046b2571a246&page=bugs%2Fexternal_special_remote_protocol_broken_by_key_with_spaces" rel="nofollow">?</a>external special remote protocol broken by key with spaces</span></p>
<p>The protocol <code>VERSION</code> is picked by the special remote, it's not
negotiated.</p>
comment 15http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_15_3fc518cee1a11b28da769c0915d33e3b/joey2017-08-28T19:34:44Z2017-08-28T19:00:10Z
<p>Since <span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=design%2Fexporting_trees_to_special_remotes%2Fcomment_15_3fc518cee1a11b28da769c0915d33e3b&page=bugs%2Fexternal_special_remote_protocol_broken_by_key_with_spaces" rel="nofollow">?</a>external special remote protocol broken by key with spaces</span>
was fixed, the Key can't contain spaces any longer.</p>
<p>The File could still contain spaces, eg when exporting from a direct mode
repository where the worktree filename contains spaces.</p>
<p>In <code>RENAMEEXPORT</code>, both OldName and NewName could contain spaces.</p>
comment 16http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_16_29f598eda413c0d5e17536d8f9438d31/joey2017-08-28T19:34:44Z2017-08-28T19:32:06Z
<p>I've updated the proposed external special remote protocol to avoid the
whitespace concerns. Not wild about needing a separate EXPORT request,
which will probably get shoved into a global variable in most
implementations. But it does avoid needing to use some kind of encoding,
which would complicate implementations more, I feel.</p>
protocol message http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_17_32a3240206c5ffff71a47dffa6950c48/yarikoptic2018-02-06T20:03:27Z2018-02-06T20:03:27Z
<p>joey wrote:</p>
<pre><code>The protocol VERSION is picked by the special remote, it's not negotiated.
</code></pre>
<p><code>VERSION</code> is provided to git-annex by the special remote to git-annex process. There is no need to 'negotiate' anything - you could make git-annex understand either of:</p>
<ul>
<li><p>higher <code>VERSION</code>, e.g.</p>
<ul>
<li><code>VERSION 2</code> which would support some new features which that special remote would need. If parent git-annex is old/doesn't support that version - would fail and demand git annex upgrade</li>
<li><code>VERSION 6.20171124</code> (where <code>6.20171124</code> is an example of git-annex version) so if git-annex parent process is older than that it could provide a meaningful message that <code>git annex >= 6.20171124</code> is needed</li>
</ul>
</li>
<li><p><code>VERSION 1 feature1 feature2 ...</code> where those features could be the ones needed (e.g. <code>INFO_MSG</code> for <a href="http://git-annex.branchable.com/todo/INFO_message_for_custom_special_remotes/#comment-4dcfb7d4e6db9d5ba8a1bfeb782346b1">recent addition</a>). And if parent git-annex doesn't know/support any particular feature, it could fail and inform user that a new annex with that feature support is needed.</p></li>
</ul>
<p>In either of those cases the custom special remotes page could outline added features/versions of git-annex supporting them, so may be even those above error messages could point to it.</p>
<p>Overall, it is just a minor change to be done on git-annex side while allowing for clear(er) specification, and I do not see any need for actual "negotiation" -- features are either supported or not by the parent process.</p>
comment 18http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_18_fe77370699b7ce0acd547fd1e045e254/joey2018-02-07T16:47:19Z2018-02-07T16:43:02Z
<p>Changing VERSION would prevent any older versions of git-annex from working
with that external special remote, since they would reject the unknown
version. (The current parsing of VERSION also happens to preclude adding
some fields after the number.)</p>
<p>Since it seems completely possible to make the protocol be changed in a way
that is backwards compatible both ways, while still letting new features to
be used, I'd rather reserve changing VERSION for whatever future thing
needs a full breaking bump.</p>
comment 19http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_19_00d28c758509939974e4583e9b1b9e12/yarikoptic2018-02-07T18:31:45Z2018-02-07T18:31:45Z
<blockquote><p>Changing VERSION would prevent any older versions of git-annex from working with that external special remote, since they would reject the unknown version. (The current parsing of VERSION also happens to preclude adding some fields after the number.)</p></blockquote>
<p>I still do not get it, sorry -- If there is an older git-annex, and a special remote requests some higher VERSION (thus stating that it needs some features older git-annex does not support), IMHO it would be perfectly fine to fail to use that remote since it wouldn't be usable anyways with that older git-annex (i.e. require some special features it does not provide). If special remote does not need any feature not present in version <code>1</code>, it (like all of them ATM) could still keep requesting <code>VERSION 1</code> thus staying compatible with whatever old git-annex is out there.</p>
comment 19http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_19_1eda1af40f6ca82a8bacd19afaa749bc/joey2018-02-07T19:13:22Z2018-02-07T19:04:03Z
<p>What if the remote wants to use some feature like NOTE, but can still
manage to work when an old git-annex does not support it? Hard bumping the
VERSION cannot support that. If the remote requires to be able to use NOTE
and sees it cannot, it can still throw an error.</p>
<p>There are a bunch of requests in the protocol that are optional for the
remote to support; git-annex deals with remotes that don't support them in
better ways than throwing up its hands because the special remote is too
old. It's very good that the protocol allowed adding those extensions
without bumping a version. The protocol is less extensible when it comes
replies and other messages sent by the special remote, and I want to get
the same extensibility for those.</p>
comment 21http://git-annex.branchable.com/design/exporting_trees_to_special_remotes/comment_21_062098e1f54b874467793e4487a45a9b/yarikoptic2018-02-07T20:01:53Z2018-02-07T20:01:53Z
Ok, gotcha. Shouldn't then EXTENSION entries also be somehow versioned per each one of them? or if needed a new extension would be born by appending a version to its name (e.g. as with all those imap, imap2, imap3, ... ;-))