exporting trees to special remotes

For publishing content from a git-annex repository, it would be useful to be able to export a tree of files to a special remote, using the filenames and content from the tree.

Note that this document was written with the assumption that only git-annex is writing to the special remote. But importing trees from special remotes invalidates that assumption, and needed to add some additional things to deal with it. See that link for details.

configuring a special remote for tree export
exporting a treeish
updating an export
tracking exports
changes to special remote interface
location tracking
recording exported filenames in git-annex branch
export conflicts
when to update export.log for efficient resuming of exports
handling renames efficiently
renames and export conflicts
dropping from exports and copying to exports

configuring a special remote for tree export

If a special remote already has files stored in it, switching it to be a tree export would result in a mix of files named by key and by filename. That's not desirable. So, the user should set up a new special remote when they want to export a tree. (It would also be possible to drop all content from an existing special remote and reuse it, but there does not seem much benefit in doing so.)

Add a new initremote configuration exporttree=yes, that cannot be changed by enableremote:

git annex initremote myexport type=... exporttree=yes

It does not make sense to encrypt an export, so exporttree=yes requires encryption=none.

Note that the particular tree to export is not specified yet. This is because the tree that is exported to a special remote may change.

exporting a treeish

To export a treeish, the user can run:

git annex export $treeish --to myexport

That does all necessary uploads etc to make the special remote contain the tree of files. The treeish can be a tag, a branch, or a tree.

If a file's content is not present, it won't be exported. Re-running the same export later should export files whose content has become present. (This likely means a second pass, and needs location tracking to track which files are in the export.)

Users may sometimes want to export multiple treeishes to a single special remote. For example, exporting several tags. This interface could be complicated to support that, putting the treeishes in subdirectories on the special remote etc. But that's not necessary, because the user can use git commands to graft trees together into a larger tree, and export that larger tree.

If an export is interrupted, running it again should resume where it left off.

updating an export

The user can at any time re-run git-annex export with a new treeish to change what's exported. While some use cases for git annex export involve publishing datasets that are intended to remain immutable, other use cases include eg, making a tree of files available to a computer that can't run git-annex, and in such use cases, the tree needs to be able to be updated.

To efficiently update an export, git-annex can diff the tree that was exported with the new tree. The naive approach is to upload new and modified files and remove deleted files.

With rename detection, if the special remote supports moving files, more efficient updates can be done. It gets complicated; consider two files that swap names.

If the special remote supports copying files, that would also make some updates more efficient.

tracking exports

This lets the user say, "I want to export the master branch", and have git-annex sync and the assistant automatically update the export when master changes.

git-annex export could do this by default (if the user doesn't want the export to track the branch, they could instead export a tree or a tag). Or it could be a --tracking parameter.

How to record the export tracking branch? It could be stored as refs/remotes/myexport/master. This says that the master branch is being exported to myexport, and the ref points to the last treeish that was exported.

But.. master:subdir is a valid treeish, referring to the subdir of the current master tree. This is a useful thing to want to export. But, that's not a legal ref name. So, perhaps better to record the export tracking branch some other way. Perhaps in git config?

changes to special remote interface

This needs some additional methods added to special remotes, and to the external special remote protocol.

Here's the changes to the latter:

EXPORTSUPPORTED
Used to check if a special remote supports exports. The remote responds with either EXPORTSUPPORTED-SUCCESS or EXPORTSUPPORTED-FAILURE
EXPORT Name
Comes immediately before each of the following requests, specifying the name of the exported file. It will be in the form of a relative path, and may contain path separators, whitespace, and other special characters.
TRANSFEREXPORT STORE|RETRIEVE Key File
Requests the transfer of a File on local disk to or from the previously provided Name on the special remote.
Note that it's important that, while a file is being stored, CHECKPRESENTEXPORT not indicate it's present until all the data has been transferred.
The remote responds with either TRANSFER-SUCCESS or TRANSFER-FAILURE, and a remote where exports do not make sense may always fail.
CHECKPRESENTEXPORT Key
Requests the remote to check if the previously provided Name is present in it.
The remote responds with CHECKPRESENT-SUCCESS, CHECKPRESENT-FAILURE, or CHECKPRESENT-UNKNOWN.
REMOVEEXPORT Key
Requests the remote to remove content stored by TRANSFEREXPORT with the previously provided Name.
The remote responds with either REMOVE-SUCCESS or REMOVE-FAILURE.
RENAMEEXPORT Key NewName
Requests the remote rename a file stored on it from the previously provided Name to the NewName.
The remote responds with RENAMEEXPORT-SUCCESS or with RENAMEEXPORT-FAILURE if an efficient rename cannot be done.

To support old external special remote programs that have not been updated to support exports, git-annex will need to handle an ERROR response when using any of the above.

location tracking

Since not all the files in an exported treeish may have content present when the export is done, location tracking will be needed so that getting the files and exporting again transfers their content.

Does a copy of a file exported to a special remote count as a copy of a file as far as ?numcopies goes? Should git-annex get download a file from an export?

The problem is that special remotes with exports are not key/value stores. The content of a file can change, and if multiple repositories can export a special remote, they can be out of sync about what files are exported to it.

Possible solution: Make exporttree=yes cause the special remote to be untrusted, and rely on annex.verify to catch cases where the content of a file on a special remote has changed. This would work well enough except for when the WORM or URL backend is used. So, prevent the user from exporting such keys. Also, force verification on for such special remotes, don't let it be turned off.

The same file contents may be in a treeish multiple times under different filenames. That complicates using location tracking. One file may have been exported and the other not, and location tracking says that the content is present in the export. A sqlite database is needed to keep track of this.

recording exported filenames in git-annex branch

In order to download the content of a key from a file exported to a special remote, the filename that was exported needs to somehow be recorded in the git-annex branch. How to do this? The filename could be included in the location tracking log or a related log file, or the exported tree could be grafted into the git-annex branch (under eg, exported/uuid/). Which way uses less space in the git repository?

Grafting in the exported tree records the necessary data, but the file-to-key map needs to be reversed to support downloading from an export. It would be too expensive to traverse the tree each time to hunt for a key; instead would need a database that gets populated once by traversing the tree.

On the other hand, for updating what's exported, having access to the old exported tree seems perfect, because it and the new tree can be diffed to find what changes need to be made to the special remote.

If the filenames are stored in the location tracking log, the exported tree could be reconstructed, but it would take O(N) queries to git, where N is the total number of keys git-annex knows about; updating exports of small subsets of large repositories would be expensive. So grafting in the exported tree seems the better approach.

export conflicts

What if different repositories can access the same special remote, and different trees get exported to it concurrently?

This would be very hard to untangle, because it's hard to know what content was exported to a file last, and thus what content the file actually has. The location log's timestamps might give a hint, but clocks vary too much to trust it.

Also, if the exported tree is grafted in to the git-annex branch, there would be a merge conflict. Union merging would scramble the exported tree, so even if a smart merge is added, old versions of git-annex would corrupt the exported tree.

To avoid that problem, add a log file export.log that contains the uuid of the remote that was exported to, and the sha1 of the exported tree. To avoid the exported tree being GCed, do graft it in to the git-annex branch, but follow that with a commit that removes the tree again, and only update refs/heads/git-annex after making both commits.

If export.log contains multiple active exports of different trees, there was an export conflict. Short of downloading the whole export to checksum it, or deleting the whole export, what can be done to resolve it?

In this case, git-annex knows both exported trees. Have the user provide a tree that resolves the conflict as they desire (it could be the same as one of the exported trees, or some merge of them or an entirely new tree). The UI to do this can just be another git annex export $tree --to remote. To resolve, diff each exported tree in turn against the resolving tree and delete all files that differ. Then, upload all missing files.

when to update export.log for efficient resuming of exports

When should export.log be updated? Possibilities:

Before performing any work, to set the goal.
After the export is fully successful, to record the current state.
After some mid-point.

Lots of things could go wrong during an export. A file might fail to be transferred or only part of it be transferred; a file's content might not be present to transfer at all. The export could be interrupted part way. Updating the export.log at the right point in time is important to handle these cases efficiently.

If the export.log is updated first, then it's only a goal and does not tell us what's been done already.

If the export.log is updated only after complete success, then the common case of some files not having content locally present will prevent it from being updated. When we resume, we again don't know what's been done already.

If the export.log is updated after deleting any files from the remote that are not the same in the new treeish as in the old treeish, and as long as TRANSFEREXPORT STORE is atomic, then when resuming we can trust CHECKPRESENTEXPORT to only find files that have the correct content for the current treeish. (Unless a conflicting export was made from elsewhere, but in that case, the conflict resolution will have to fix up later.)

handling renames efficiently

To handle two files that swap names, a temp name is required.

Difficulty with a temp name is picking a name that won't ever be used by any exported file.

Interrupted exports also complicate this. While a name could be picked that is in neither the old nor the new tree, an export could be interrupted, leaving the file at the temp name. There needs to be something to clean that up when the export is resumed, even if it's resumed with a different tree.

Could use something like ".git-annex-tmp-content-$key" as the temp name. This hides it from casual view, which is good, and it's not depedent on the tree, so no state needs to be maintained to clean it up. Also, using the key in the name simplifies calculation of complicated renames (eg, renaming A to B, B to C, C to A)

Export can first try to rename all files that are deleted/modified to their key's temp name (falling back to deleting since not all special remotes support rename), and then, in a second pass, rename from the temp name to the new name. Followed by deleting the temp name of all keys whose files are deleted in the diff. That is more renames and deletes than strictly necessary, but it will statelessly clean up an interruped export as long as it's run again with the same new tree.

But, an export of tree B should clean up after an interrupted export of tree A. Some state is needed to handle this. Before starting the export of tree A, record it somewhere. Then when resuming, diff A..B, and delete the temp names of the keys in the diff. (Can't rename here, because we don't know what was the content of a file when an export was interrupted.)

So, before an export does anything, need to record the tree that's about to be exported to export.log, not as an exported tree, but as a goal. Then on resume, the temp files for that can be cleaned up.

renames and export conflicts

What is there's an export conflict going on at the same time that a file in the export gets renamed?

Suppose that there are two git repos A and B, each exporting to the same remote. A and B are not currently communicating. A exports T1 which contains F. B exports T2, which has a different content for F.

Then A exports T3, which renames F to G. If that rename is done on the remote, then A will think it's successfully exported T3, but G will have F's content from T2, not from T1.

When A and B reconnect, the export conflict will be detected. To resolve the export conflict, it says above to:

To resolve, diff each exported tree in turn against the resolving tree and delete all files that differ. Then, upload all missing files.

Assume that the resolving tree is T3. So B's export of T2 is diffed against T3. F differs and is deleted (no change). G differs and is deleted, which fixes up the problem that the wrong content was renamed to G. G is missing so gets uploaded.

So, this works, as long as "delete all files that differ" means it deletes both old and new files. And as long as conflict resolution does not itself stash away files in the temp name for later renaming.

dropping from exports and copying to exports

It might be nice for git annex drop $file --from myexport and git annex copy $myfile --to export to work. However, there are some very difficult issues in supporting those, and they don't really seem necessary to use exports. Re-running git annex export to resume an export handles all the cases that copying to an export would need to. And, deleting a file from a tree and exporting the new tree is the thing to do if a file no longer should be exported.

Here's an example of the kind of problem supporting these needs to deal with:

In repo A, file F with content K is exported
In repo B, file F with content K' is exported, since F changed in the exported treeish.
In repo A, file F is removed from the export, which results in K being removed from the location log for the export.

But... did #3 happen before or after #2? If #3 occurred before #2, then K' is present in the export and the location log is correct. If #3 occurred after #2, and A and B's git-annex branches were not synced, then K' was accidentially removed from the export, and the location log is now wrong.

Is there any reason to allow removeKey from an export? Why would someone want to drop a single file from an export? Why not remove the file from a tree, and export the new tree?

(Alternatively, removeKey could itself update the exported tree, removing the file from it, and update the export log accordingly. This would avoid the problem. But that's complication and it would be rather slow and bloat the git repo with a lot of intermediate trees when dropping multiple keys.)

RSS Atom

note that some remotes could support files versioning "natively"

E.g. when exporting to the S3 bucket with versioning turned on, or OSF (AFAIK). So upon successful upload special remote could SETURLPRESENT to signal availability of any particular key (associated with the file).

Yet to grasp the cases you outlined better to see if I see any other applicable use-ase

I hope that export would be implemented through extending externals special remote protocol?

Comment by yarikoptic — Tue Jul 11 21:59:49 2017

Remove comment

couldn't STATE be used for KEY -> FILENAME(s) mapping?

just wondered... at least in my attempt for zenodo special remote I did store zenodo's file deposition ID within the state to be able to request it back later on alternative -- URL(s) I guess. Could be smth like exported:UUID/filename.

Comment by yarikoptic — Tue Jul 11 22:05:49 2017

Remove comment

does it really need to be a new command ("export") or could be the same old "copy"?

or it could be just a mode of operation for a special remote depending on "exporttree=true" being set, where in one (old) case it would operate based on keys associated with the files pointed on the cmdline (or just keys for --auto or pointed by metadata), whenever when "exporttree=true" -- it would operate on filenames pointed on command line (or files found to be associated with the keys as pointed by --auto or by metadata)? Then the same 'copy --to' could be used in both cases, streamlining user experience

Comment by yarikoptic — Tue Jul 11 22:14:39 2017

Remove comment

comment 4

I've added a section with changes to the external special remote protocol. I included the Key in each of the new protocol commands, although it's not strictly neeed, to allow the implementation to use SETURLPRESENT, SETSTATE, etc.

git annex copy $file --to myexport could perhaps work; the difficulty though is, what if you've exported branch foo, and then checked out bar, and so you told it to export one version of the file, and are running git-annex copy on a different version? It seems that git-annex would have to cross-check in this and similar commands, to detect such a situation. Unsure how much more work that would be, both CPU time and implementation time.

I do think that git annex get could download files from exports easily enough, but see the "location tracking" section for trust caveats.

I'm not clear about what you're suggesting be done with versioning support in external special remotest?

Comment by joey — Wed Jul 12 16:45:51 2017

Remove comment

special remotes with versioning support

thanks -- I will check those all out!

Meanwhile a quick one regarding "I'm not clear about what you're suggesting be done with versioning support in external special remotes?".

I meant that in some cases there might be no need for any custom/special tracking per exported file would be needed -- upon export we could just register a unique URL for that particular version of the file for the corresponding KEY so later on it could be 'annex get'ed even if a new version of the file gets uploaded or removed. So annex could just store those treeish(es) hexsha on what was exported last without any explicit additional tracking per file. URL might be some custom one to be handled by the special remote backend.

E.g. here is a list of versions (and corresponding urls) for a sample file on the s3 bucket

$> datalad ls -aL s3://datalad-test0-versioned/3versions-allversioned.txt 
Connecting to bucket: datalad-test0-versioned
[INFO   ] S3 session: Connecting to the bucket datalad-test0-versioned 
Bucket info:
  Versioning: {'MfaDelete': 'Disabled', 'Versioning': 'Enabled'}
     Website: datalad-test0-versioned.s3-website-us-east-1.amazonaws.com
         ACL: <Policy: yoh@cs.unm.edu (owner) = FULL_CONTROL>
3versions-allversioned.txt            ...  http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt?versionId=Kvuind11HZh._dCPaDAb0OY9dRrQoTMn [OK]
3versions-allversioned.txt            ...  http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt?versionId=b.qCuh7Sg58VIYj8TVHzbRS97EvejzEl [OK]
3versions-allversioned.txt            ...  http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt?versionId=pNsV5jJrnGATkmNrP8.i_xNH6CY4Mo5s [OK]
3versions-allversioned.txt_sameprefix ...  http://datalad-test0-versioned.s3.amazonaws.com/3versions-allversioned.txt_sameprefix?versionId=Mvsc4FgJWc6gExwSw1d6wsLrnk6wdDVa [OK]

Comment by yarikoptic — Wed Jul 12 17:30:33 2017

Remove comment

comment 6

That would almost work without any smarts on the git-annex side. When it tells the special remote to REMOVEEXPORT, the special remote could remove the file from the HEAD equivilant but retain the content in its versioned snapshots, and keep the url to that registered. But, that doesn't actually work, because the url is registered for that special remote, not the web special remote. Once git-annex thinks the file has been removed from the special remote, it will never try to use the url registered for that special remote.

So, to support versioning-capable special remotes, there would need to be an additional response to REMOVEEXPORT that says "I removed it from HEAD, but I still have a copy in this url, which can be accessed using the web special remote".

Comment by joey — Wed Jul 12 18:09:00 2017

Remove comment

side-note about WebDAV&DeltaV

DAV = “Distributed Authoring and Versioning.”, but versioning was forgotten about in the original RFC. Only some servers/clients implement DeltaV spec (RFC 3253) which came later to fill that gap. But in principle, any DeltaV-compliant WebDAV special remote could then be used for "export" while retaining access to all the versions. References: - WebDAV and Autoversioning - Version Control with Subversion - RFC 3253

I have got interested whenever saw that box.com is supported through WebDAV but not sure if DeltaV is anyhow supported and apparently number of versions stored per file is anyways depends on type of the account (and no versions for a free personal one): https://community.box.com/t5/How-to-Guides-for-Managing/How-To-Track-Your-Files-and-File-Versions-Version-History/ta-p/329

Comment by yarikoptic — Wed Jul 12 21:54:49 2017

Remove comment

regarding setting a URL by custom special remote

I also wonder if SETURLPRESENT Key Url could also be extended to be SETURLPRESENT Key Url Remote, i.e. that a custom remote could register a URL to Web remote? In many cases I expect a "custom uploader/exporter" but then public URL being available, so demanding a custom external remote to fetch it would be a bit overkill.

N.B. I already was burnt once on a large scale with our custom remote truthfully replying to CLAIMURL (since it can handle them if needed) to public URLs, thus absorbing them into it instead of relaying responsibility to 'Web' remote. Had to traverse dozens of datasets and duplicate urls from 'datalad' to 'Web' remote.

Comment by yarikoptic — Wed Jul 12 22:04:38 2017

Remove comment

comments on protocol

TRANSFEREXPORT STORE|RETRIEVE Key File Name -- note that File could also contain spaces etc (not only the Name), so should be encoded somehow?
old external special remote programs ... need to handle an ERROR response -- why not just to boost protocol VERSION to e.g. 2 so those which implement this would reply with a new version #?

Comment by yarikoptic — Wed Jul 12 22:09:54 2017

Remove comment

export "each revision" -- thinking about quiltdata

In some cases, if remote supports versioning, might be cool to be able to export all versions (from previously exported point, assuming linear progression). Having a chat with https://quiltdata.com/ folks, project which I just got to know about. 1. They claim/hope to provide infinite storage for public datasets 2. They do support "File" model, so dataset could simply contain files. If we could (ab)use that -- sounds like a lovely free ride 3. They do support versioning. If we could export all the versions -- super lovely.

Might also help to establish interoperability between the tools

Comment by yarikoptic — Fri Jul 14 20:10:42 2017

Remove comment

Git History

It would be great to have an option to include git history in the export, such that a special remote could be used both to rebuild a repository and to view the contents.

Comment by xloem — Thu Aug 10 00:25:27 2017

Remove comment

re: export "each revision"

That sounds much more like a regular remote with git annex copy --all.

This entire design is preducated on exporting a single treeish. If you want to make a single treeish containing all versions of every file ...

Comment by joey — Tue Aug 15 18:13:38 2017

Remove comment

re: Git History

That is entirely out of scope. You're looking for a way to store a git repository someplace like S3. Such things exist already, and are not git-annex and I'm not going to replicate them as part of this feature.

Comment by joey — Tue Aug 15 18:16:31 2017

Remove comment

re: comments on protocol

In TRANSFEREXPORT STORE|RETRIEVE Key File Name, it should always be possible for the File to not contain spaces in its name. But it could be rather painful for git-annex to avoid spaces in some cases (would need to link or copy the annexed file content). So well spotted.

Hmm, it's actually possible for a Key to contain spaces as well, at least with the WORM backend. ?external special remote protocol broken by key with spaces

The protocol VERSION is picked by the special remote, it's not negotiated.

Comment by joey — Tue Aug 15 18:18:14 2017

Remove comment

comment 15

Since ?external special remote protocol broken by key with spaces was fixed, the Key can't contain spaces any longer.

The File could still contain spaces, eg when exporting from a direct mode repository where the worktree filename contains spaces.

In RENAMEEXPORT, both OldName and NewName could contain spaces.

Comment by joey — Mon Aug 28 19:00:10 2017

Remove comment

comment 16

I've updated the proposed external special remote protocol to avoid the whitespace concerns. Not wild about needing a separate EXPORT request, which will probably get shoved into a global variable in most implementations. But it does avoid needing to use some kind of encoding, which would complicate implementations more, I feel.

Comment by joey — Mon Aug 28 19:32:06 2017

Remove comment

protocol message

joey wrote:

The protocol VERSION is picked by the special remote, it's not negotiated.

VERSION is provided to git-annex by the special remote to git-annex process. There is no need to 'negotiate' anything - you could make git-annex understand either of:

higher VERSION, e.g.
- VERSION 2 which would support some new features which that special remote would need. If parent git-annex is old/doesn't support that version - would fail and demand git annex upgrade
- VERSION 6.20171124 (where 6.20171124 is an example of git-annex version) so if git-annex parent process is older than that it could provide a meaningful message that git annex >= 6.20171124 is needed
VERSION 1 feature1 feature2 ... where those features could be the ones needed (e.g. INFO_MSG for recent addition). And if parent git-annex doesn't know/support any particular feature, it could fail and inform user that a new annex with that feature support is needed.

In either of those cases the custom special remotes page could outline added features/versions of git-annex supporting them, so may be even those above error messages could point to it.

Overall, it is just a minor change to be done on git-annex side while allowing for clear(er) specification, and I do not see any need for actual "negotiation" -- features are either supported or not by the parent process.

Comment by yarikoptic — Tue Feb 6 20:03:27 2018

Remove comment

comment 18

Changing VERSION would prevent any older versions of git-annex from working with that external special remote, since they would reject the unknown version. (The current parsing of VERSION also happens to preclude adding some fields after the number.)

Since it seems completely possible to make the protocol be changed in a way that is backwards compatible both ways, while still letting new features to be used, I'd rather reserve changing VERSION for whatever future thing needs a full breaking bump.

Comment by joey — Wed Feb 7 16:43:02 2018

Remove comment

comment 19

Changing VERSION would prevent any older versions of git-annex from working with that external special remote, since they would reject the unknown version. (The current parsing of VERSION also happens to preclude adding some fields after the number.)

I still do not get it, sorry -- If there is an older git-annex, and a special remote requests some higher VERSION (thus stating that it needs some features older git-annex does not support), IMHO it would be perfectly fine to fail to use that remote since it wouldn't be usable anyways with that older git-annex (i.e. require some special features it does not provide). If special remote does not need any feature not present in version 1, it (like all of them ATM) could still keep requesting VERSION 1 thus staying compatible with whatever old git-annex is out there.

Comment by yarikoptic — Wed Feb 7 18:31:45 2018

Remove comment

comment 19

What if the remote wants to use some feature like NOTE, but can still manage to work when an old git-annex does not support it? Hard bumping the VERSION cannot support that. If the remote requires to be able to use NOTE and sees it cannot, it can still throw an error.

There are a bunch of requests in the protocol that are optional for the remote to support; git-annex deals with remotes that don't support them in better ways than throwing up its hands because the special remote is too old. It's very good that the protocol allowed adding those extensions without bumping a version. The protocol is less extensible when it comes replies and other messages sent by the special remote, and I want to get the same extensibility for those.

Comment by joey — Wed Feb 7 19:04:03 2018

Remove comment

comment 21

Ok, gotcha. Shouldn't then EXTENSION entries also be somehow versioned per each one of them? or if needed a new extension would be born by appending a version to its name (e.g. as with all those imap, imap2, imap3, ... ;-))

Comment by yarikoptic — Wed Feb 7 20:01:53 2018

Remove comment

Add a comment