external special remote protocol

Communication between git-annex and a program implementing an external special remote uses this protocol.

starting the program
protocol overview
example session
git-annex request messages and replies
special remote messages
general messages
protocol versions
extensions
signals
long running network connections
claiming custom uri schemes for use with git-annex addurl
readonly mode for http downloads
TODO

starting the program

The external special remote program has a name like git-annex-remote-$bar. When git annex initremote foo type=external externaltype=$bar is run, git-annex finds the appropriate program in PATH.

The program is started by git-annex when it needs to access the special remote, and may be left running for a long period of time. This allows it to perform expensive setup tasks, etc. Note that git-annex may choose to start multiple instances of the program (eg, when multiple git-annex commands are run concurrently in a repository).

protocol overview

Communication is via stdin and stdout. Therefore, the external special remote must avoid doing any prompting, or outputting anything like eg, progress to stdout. (Such stuff can be sent to stderr instead.)

The protocol is line based. Messages are sent in either direction, from git-annex to the special remote, and from the special remote to git-annex.

In order to avoid confusing interactions, one or the other has control at any given time, and is responsible for sending requests, while the other only sends replies to the requests.

Each protocol line starts with a command, which is followed by the command's parameters (a fixed number per command), each separated by a single space. The last parameter may contain spaces. Parameters may be empty, but the separating spaces are still required in that case.

example session

The special remote is responsible for sending the first message, indicating the version of the protocol it is using.

VERSION 2

Recent versions of git-annex respond with a message indicating protocol extensions that it supports. Older versions of git-annex do not send this message.

EXTENSIONS INFO ASYNC GETGITREMOTENAME UNAVAILABLERESPONSE

The special remote can respond to that with its own EXTENSIONS message, listing any extensions it wants to use. (It's also fine to reply with UNSUPPORTED-REQUEST.)

EXTENSIONS

Next, git-annex will generally send a message telling the special remote to start up. (Or it might send an INITREMOTE or EXPORTSUPPORTED or LISTCONFIGS, or perhaps other things in the future, so don't hardcode this order.)

PREPARE

The special remote can now ask git-annex for its configuration, as needed, and check that it's valid. git-annex responds with the configuration values

GETCONFIG directory
VALUE /media/usbdrive/repo
GETCONFIG automount
VALUE true

Once the special remote is satisfied with its configuration and is ready to go, it tells git-annex that it's done with the PREPARE step:

PREPARE-SUCCESS

Now git-annex will make a request. Let's suppose it wants to store a key.

TRANSFER STORE somekey tmpfile

The special remote can then start reading the tmpfile and storing it. While it's doing that, the special remote can send messages back to git-annex to indicate what it's doing, or ask for other information. It will typically send progress messages, indicating how many bytes have been sent:

PROGRESS 10240
PROGRESS 20480

Once the key has been stored, the special remote tells git-annex the result:

TRANSFER-SUCCESS STORE somekey

Now git-annex will send its next request.

Once git-annex is done with the special remote, it will close its stdin. The special remote program can then exit.

git-annex request messages and replies

These are messages git-annex sends to the special remote program.

Once the special remote has finished performing the request, it should send one of the listed replies.

The following requests must all be supported by the special remote.

INITREMOTE
Requests the remote to initialize itself. This is where any one-time setup tasks can be done, for example creating an Amazon S3 bucket.
Note: This may be run repeatedly over time, as a remote is initialized in different repositories, or as the configuration of a remote is changed. (Both git annex initremote and git-annex enableremote run this.) So any one-time setup tasks should be done idempotently.
- INITREMOTE-SUCCESS
  Indicates the INITREMOTE succeeded and the remote is ready to use.
- INITREMOTE-FAILURE ErrorMsg
  Indicates that INITREMOTE failed.
PREPARE
Tells the remote that it's time to prepare itself to be used.
Only a few requests for details about the remote can come before this (EXTENSIONS, INITREMOTE, EXPORTSUPPORTED and LISTCONFIGS, but others may be added later).
- PREPARE-SUCCESS
  Sent as a response to PREPARE once the special remote is ready for use.
- PREPARE-FAILURE ErrorMsg
  Sent as a response to PREPARE if the special remote cannot be used.
TRANSFER STORE|RETRIEVE Key File
Requests the transfer of a key. This is the main thing a special remote does. For STORE, the File contains the content to upload; for RETRIEVE the File is where to store the content you download.
When retrieving, the File may already exist, if its retieval was interrupted before. That lets the remote resume downloading, if it's able to.
Note that the File should not influence the filename used on the remote; that filename should be based on the Key.
Note that in some cases, the File's name may include whitespace or other special characters.
While the transfer is running, the remote can send any number of PROGRESS messages to indicate its progress. It can also send any of the other special remote messages. Once the transfer is done, it finishes by sending one of these replies:
- TRANSFER-SUCCESS STORE|RETRIEVE Key
  Indicates the transfer completed successfully.
- TRANSFER-FAILURE STORE|RETRIEVE Key ErrorMsg
  Indicates the transfer failed.
CHECKPRESENT Key
Requests the remote to check if a key is present in it.
It's important that, while a key is being transferred to a remote, CHECKPRESENT not indicate it's present in the remote until all the data has been sent.
- CHECKPRESENT-SUCCESS Key
  Indicates that a key has been positively verified to be present in the remote.
- CHECKPRESENT-FAILURE Key
  Indicates that a key has been positively verified to not be present in the remote.
- CHECKPRESENT-UNKNOWN Key ErrorMsg
  Indicates that it is not currently possible to verify if the key is present in the remote. (Perhaps the remote cannot be contacted.)
REMOVE Key
Requests the remote to remove a key's contents.
- REMOVE-SUCCESS Key
  Indicates the key has been removed from the remote. May be returned if the remote didn't have the key at the point removal was requested.
- REMOVE-FAILURE Key ErrorMsg
  Indicates that the key was unable to be removed from the remote.

Special remotes can optionally support tree exports and imports, which makes the git-annex-export and git-annex-import commands work with them. See the export and import appendix for additional requests that git-annex will make when using special remotes in this way.

The following requests can optionally be supported. If not supported, the special remote can reply with UNSUPPORTED-REQUEST.

EXTENSIONS List
Sent to indicate protocol extensions which git-annex is capable of using. The list is a space-delimited list of protocol extension keywords. The remote can reply to this with its own EXTENSIONS list. See the section on extensions below for details.
- EXTENSIONS List
  Sent in response to a EXTENSIONS request, to indicate the protocol extensions that the special remote is using.
LISTCONFIGS
Requests the remote to return a list of settings it uses (with GETCONFIG and SETCONFIG). Providing a list makes git annex initremote work better, because it can check the user's input, and can also display a list of settings with descriptions. Note that the user is not required to provided all the settings listed here. A block of responses can be made to this, which must always end with CONFIGEND.
(Do not include config like "encryption" that are common to all external special remotes. Also avoid including a config named "versioning" unless using it as described in the export and import appendix.)
- CONFIG Name Description
  Indicates the name and description of a config setting. The description should be reasonably short. Example: "CONFIG directory store data here"
- CONFIGEND
  Indicates the end of the response block.
GETCOST
Requests the remote to return a use cost. Higher costs are more expensive. (See Config/Cost.hs for some standard costs.)
- COST Int
  Indicates the cost of the remote.
GETORDERED
Asks the remote if it will always write files in order when performing a TRANSFER RETRIEVE. Writing in order lets a proxy stream content from the remote. When this is not implemented, git-annex assumes the remote may write parts of the file out of order.
- ORDERED Indicates that files are written in order.
- UNORDERED Indicates that files are not written in order.
GETAVAILABILITY
Asks the remote if it is locally or globally available. (Ie stored in the cloud vs on a local disk.)
If the remote replies with UNSUPPORTED-REQUEST, its availability is assumed to be global. So, only remotes that are only reachable locally need to worry about implementing this.
This is queried at remote startup, so should avoid doing anything that can take long to run or is expensive. Checking if a directory where the remote stores files is currently mounted is the kind of thing it makes sense to do here.
- AVAILABILITY GLOBAL|LOCAL
  Indicates if the remote is globally or only locally available.
- AVAILABILITY UNAVAILABLE
  Indicates that the remote is not currently available.
  This will prevent some git-annex commands like git-annex sync from trying to use the remote.
  Older versions of git-annex do not support this response, so avoid sending it unless the UNAVAILABLERESPONSE extension is enabled.
ORDERED
CLAIMURL Url
Asks the remote if it wishes to claim responsibility for downloading an url.
- CLAIMURL-SUCCESS
  Indicates that the CLAIMURL url will be handled by this remote.
- CLAIMURL-FAILURE
  Indicates that the CLAIMURL url wil not be handled by this remote.
CHECKURL Url
Asks the remote to check if the url's content can currently be downloaded (without downloading it).
- CHECKURL-CONTENTS Size|UNKNOWN Filename
  Indicates that the requested url has been verified to exist.
  The Size is the size in bytes, or use "UNKNOWN" if the size could not be determined.
  The Filename can be empty (in which case a default is used), or can specify a filename that is suggested to be used for this url.
- CHECKURL-MULTI Url1 Size1|UNKNOWN Filename1 Url2 Size2|UNKNOWN Filename2 ...
  Indicates that the requested url has been verified to exist, and contains multiple files, which can each be accessed using their own url. Each triplet of url, size, and filename should be listed, one after the other. Note that since a list is returned, neither the Url nor the Filename can contain spaces.
- CHECKURL-FAILURE ErrorMsg
  Indicates that the requested url could not be accessed.
WHEREIS Key
Asks the remote to provide additional information about ways to access the content of a key stored in it, such as eg, public urls. This will be displayed to the user by eg, git annex whereis. Note that users expect git annex whereis to run fast, without eg, network access.
- WHEREIS-SUCCESS String
  Indicates a location of a key. Typically an url, the string can be anything that it makes sense to display to the user about content stored in the special remote.
- WHEREIS-FAILURE
  Indicates that no location is known for a key. This is not needed when SETURIPRESENT is used, since such uris are automatically displayed by git annex whereis.
GETINFO
Requests the remote to send some information describing its configuration, for display by git annex info. A block of responses can be made to this, which must always end with INFOEND.
- INFOFIELD Name
  Gives the name of an info field. The name can be anything you want to be displayed to the user. Must be immediately followed by INFOVALUE.
- INFOVALUE Value
  Gives the value of an info field.
- INFOEND
  Indicates the end of the response block.

More optional requests may be added, without changing the protocol version, so if an unknown request is seen, don't crash, just reply with UNSUPPORTED-REQUEST.

special remote messages

These messages may be sent by the special remote at any time that it's handling a request.

VERSION Int
Supported protocol version. Current version is 2. Must be sent first thing at startup, as until it sees this git-annex does not know how to talk with the special remote program!
(git-annex does not send a reply to this message, but may give up if it doesn't support the necessary protocol version.)
PROGRESS Int
Indicates the current progress of the transfer. The Int is the number of bytes from the beginning of the file that have been transferred.
May be repeated any number of times during the transfer process, but it's wasteful to update the progress too frequently. Bear in mind that this is used both to display a progress meter for the user, and for annex.stalldetection. So, sending an update on each 1% of the file may not be frequent enough, as it could appear to be a stall when transferring a large file.
This is highly recommended for STORE. (It is optional but good for RETRIEVE; git-annex will fall back to tracking the size of the file as it grows.)
(git-annex does not send a reply to this message.)
DIRHASH Key
Gets a two level hash associated with a Key. Something like "aB/Cd/". This is always the same for any given Key, so can be used for eg, creating hash directory structures to store Keys in. This is the same directory hash that git-annex uses inside .git/annex/objects/
(git-annex replies with VALUE followed by the value.)
DIRHASH-LOWER Key
Gets a two level hash associated with a Key, using only lower-case. Something like "abc/def/". This is always the same for any given Key, so can be used for eg, creating hash directory structures to store Keys in. This is the same directory hash that is used by eg, the directory special remote.
(git-annex replies with VALUE followed by the value.)
SETCONFIG Setting Value
Sets one of the special remote's configuration settings.
Normally this is sent during INITREMOTE, which allows these settings to be stored in the git-annex branch, so will be available if the same special remote is used elsewhere. (If sent after INITREMOTE, the changed configuration will only be available while the remote is running.)
See also GETGITREMOTENAME for a way to access git configuration of the remote.
(git-annex does not send a reply to this message.)
GETCONFIG Setting
Gets one of the special remote's configuration settings, which can have been passed by the user when running git annex initremote, or can have been set by a previous SETCONFIG. Can be run at any time.
It's recommended that special remotes that use this implement LISTCONFIGS.
(git-annex replies with VALUE followed by the value. If the setting is not set, the value will be empty.)
SETCREDS Setting User Password
When some form of user and password is needed to access a special remote, this can be used to securely store them for later use. (Like SETCONFIG, this is normally sent only during INITREMOTE.)
The Setting indicates which value in a remote's configuration can be used to store the creds.
Note that creds are normally only stored in the remote's configuration when it's surely safe to do so; when gpg encryption is used, in which case the creds will be encrypted using it. If creds are not stored in the configuration, they'll only be stored in a local file.
(embedcreds can be set to yes by the user or by SETCONFIG to force the creds to be stored in the remote's configuration).
(git-annex does not send a reply to this message.)
GETCREDS Setting
Gets any creds that were previously stored in the remote's configuration or a file. (git-annex replies with "CREDS User Password". If no creds are found, User and Password are both empty.)
GETUUID
Queries for the UUID of the special remote being used.
(git-annex replies with VALUE followed by the UUID.)
GETGITDIR
Queries for the path to the git directory of the repository that is using the external special remote. (git-annex replies with VALUE followed by the path.)
GETGITREMOTENAME
Gets the name of the git remote that represents this special remote. This can be used, for example, to look up git configuration belonging to that git remote. This name will often be the same as what is passed to git-annex initremote and enableremote, but it is possible for git remotes to be renamed, and this will provide the remote's current name.
(git-annex replies with VALUE followed by the name.)
This message is a protocol extension; it's only safe to send it to git-annex after it sent an EXTENSIONS that included GETGITREMOTENAME.
SETWANTED PreferredContentExpression
Can be used to set the preferred content of a repository. Normally this is not configured by a special remote, but it may make sense in some situations to hint at the kind of content that should be stored in the special remote. Note that if a unparsable expression is set, git-annex will ignore it.
(git-annex does not send a reply to this message.)
GETWANTED
Gets the current preferred content setting of the repository. (git-annex replies with VALUE followed by the preferred content expression.)
SETSTATE Key Value
Can be used to store some form of state for a Key. The state stored can be anything this remote needs to store, in any format. It is stored in the git-annex branch. Note that this means that if multiple repositories are using the same special remote, and store different state, whichever one stored the state last will win. Also, it's best to avoid storing much state, since this will bloat the git-annex branch. Most remotes will not need to store any state.
(git-annex does not send a reply to this message.)
GETSTATE Key
Gets any state that has been stored for the key.
(git-annex replies with VALUE followed by the state.)
SETURLPRESENT Key Url
Records an URL where the Key can be downloaded from.
Note that this does not make git-annex think that the url is present on the web special remote.
Keep in mind that this stores the url in the git-annex branch. This can result in bloat to the branch if the url is large and/or does not delta pack well with other information (such as the names of keys) already stored in the branch.
(git-annex does not send a reply to this message.)
SETURLMISSING Key Url
Records that the key can no longer be downloaded from the specified URL.
(git-annex does not send a reply to this message.)
SETURIPRESENT Key Uri
Records an URI where the Key can be downloaded from. Use with uris that cannot be downloaded with http. (git-annex does not send a reply to this message.)
SETURIMISSING Key Uri
Records that the key can no longer be downloaded from the specified URI.
(git-annex does not send a reply to this message.)
GETURLS Key Prefix
Gets the recorded urls where a Key can be downloaded from. Only urls that start with the Prefix will be returned. The Prefix may be empty to get all urls. (git-annex replies one or more times with VALUE for each url. The final VALUE has an empty value, indicating the end of the url list.)
DEBUG message
Tells git-annex to display the message if --debug is enabled.
(git-annex does not send a reply to this message.)
INFO message Tells git-annex to display the message to the user.
When git-annex is in --json mode, the message will be emitted immediately in its own json object, with an "info" field.
This message is a protocol extension; it's only safe to send it to git-annex after it sent an EXTENSIONS that included INFO.
(git-annex does not send a reply to this message.)

general messages

These messages can be sent at any time by either git-annex or the special remote.

ERROR ErrorMsg
Generic error. Can be sent at any time if things get too messed up to continue. When possible, use a more specific reply from the list above.
The special remote program should exit after sending this, as git-annex will not talk to it any further. If the program receives an ERROR from git-annex, it can exit with its own ERROR.

protocol versions

Currently git-annex supports VERSION 1 and VERSION 2. The two protocol versions are actually identical.

Old versions of git-annex that supported only VERSION 1 had a bug in their implementation of the part of the protocol documented in the export and import appendix. The bug could result in ontent being exported to the wrong file. External special remotes that implement that should use VERSION 2 to avoid talking to the buggy old version of git-annex.

extensions

These protocol extensions are currently supported.

INFO
This allows using the INFO message.
ASYNC
This lets multiple actions be performed at the same time by a single external special remote program, rather than starting multiple programs. See the async appendix for details.
GETGITREMOTENAME
This allows using the GETGITREMOTENAME message.
UNAVAILABLERESPONSE
This allows the AVAILABILITY UNAVAILABLE response to be used in reply to GETAVAILABILITY.

signals

The external special remote program should not block SIGINT, or SIGTERM. Doing so may cause git-annex to hang waiting on it to exit. Of course it's ok to catch those signals and do some necessary cleanup before exiting.

long running network connections

Since an external special remote is started only when git-annex needs to access the remote, and then left running, it's ok to open a network connection in the PREPARE stage, and continue to use that network connection as requests are made.

If you're unable to open a network connection, or the connection closes, perhaps because the network is down, it's ok to fail to perform any requests. Or you can try to reconnect when a new request is made.

Note that the external special remote program may be left running for quite a long time, especially when the git-annex assistant is using it. The assistant will detect when the system connects to a network, and will start a new process the next time it needs to use a remote.

claiming custom uri schemes for use with git-annex addurl

If a special remote has its own uri scheme, or some other way to identify a particular url as being content that is stored in the special remote, and can be downloaded by it, it can implement CLAIMURL and CHECKURL. This lets git-annex addurl be used with such urls.

For example, the ipfs special remote implements CLAIMURL and CHECKURL for "ipfs:ADDRESS" uris. And the bittorrent special remote implements them for http urls ending in ".torrent".

When a special remote has claimed an url, commands like git-annex addurl will use TRANSFER RETRIEVE to request it download the content of a key. To find out what url to download, the special remote can use GETURLS to find out what urls are recorded for the key.

For example, the ipfs special remote sends "GETURLS $KEY ipfs:", in order to get only the "ipfs:" uris.

The special remote can also use SETURIPRESENT or SETURLPRESENT, eg after transferring content to the remote it might know the uri or url that can be used to download it. And SETURIMISSING or SETURLMISSING can be used after removing content from the remote. This information can then be looked up using GETURLS. But it's not necessary to do this in order to simply claim an url, because git-annex addurl takes care of it.

For example, the ipfs special remote sends "SETURIPRESENT $KEY ipfs:ADDRESS" after storing each key in ipfs. It can later look up that uri when downloading the key, and the ipfs uri is also displayed by git-annex whereis.

readonly mode for http downloads

Some storage services allow downloading the content of a file using a regular http connection, with no authentication. An external special remote for such a storage service can support a readonly mode of operation.

It works like this:

When a key's content is stored on the remote, use SETURLPRESENT to tell git-annex the public url from which it can be downloaded.
When a key's content is removed from the remote, use SETURLMISSING.
Document that this external special remote can be used in readonly mode.

The user doesn't even need to install your external special remote program to use such a remote! All they need to do is run: git annex enableremote $remotename readonly=true
The readonly=true parameter makes git-annex download content from the urls recorded earlier by SETURLPRESENT.

TODO

When storing encrypted files stream the file up/down the pipe, rather than using a temp file. Will probably involve less space and disk IO, and makes the progress display better, since the encryption can happen concurrently with the transfer. Also, no need to use PROGRESS in this scenario, since git-annex can see how much data it has sent/received from the remote. However, \n and probably \0 need to be escaped somehow in the file data, which adds complication.
uuid discovery during INITREMOTE.
Hook into webapp. Needs a way to provide some kind of prompt to the user in the webapp, etc.

RSS Atom

not useful for "plain directory" special remote?

It says: "Note that the File should not influence the filename used on the remote. The filename used should be derived from the Key."

Thus this interface might not be useful to implement ?New special remote suggeston - clean directory? The clean directory special remote would just do that: save $key content on the remote under the filename $file.

Comment by Thomas — Mon Dec 16 20:10:16 2013

Remove comment

comment 2

That's not how special remotes work -- they have nothing to do with the symlinks in the work tree, which are managed by git (or git-annex in direct mode). They only store values for keys.

Comment by joeyh.name — Mon Dec 16 20:42:23 2013

Remove comment

Feature requests

PREPARE-Failure ErrorMsg (matching INITREMOTE-FAILURE ErrorMsg)

Also, i'd like for the following to overwrite existing credentials/configs MYFOLDER="testfolder" MYLOGIN="login" MYPASSWORD="pword" MYURL="http://webdav/" git annex enableremote owncloud type=external externaltype=owncloud --debug

This would also be needed for refreshing oauth.

Comment by TobiasTheViking — Sat Dec 28 13:57:35 2013

Remove comment

comment 4

Added PREPARE-FAILURE

git-annex enableremote causes INITREMOTE to be called, so any credentials can be stored etc. (Note that, as with built-in special remotes, credentials are only stored in the git-annex branch when the remote is encrypted. Otherwise, they're stored locally in a .git/annex/creds/ file.)

Also, I'd recommend using environment variables for passing credentials to initremote/enableremote, because that avoids leaking them in ps.. but it probably doesn't make sense to use environment variables for other settings, but instead pass them as parameters of initremote/enableremote, which can be looked up using GETCONFIG. Only exception might be if the setting needs to vary between different machines.

Comment by joeyh.name — Sun Dec 29 17:50:23 2013

Remove comment

Feature requests

Hook should be able to set default configuration for itself.

For instance, clean flickr hook will only upload some files(notably pictures). The user shouldn't have to manage that.

Other hooks have a maximum filesize(though i guess that doesn't matter once splitting works).

Comment by TobiasTheViking — Tue Dec 31 14:05:07 2013

Remove comment

comment 6

I suppose you're talking about preferred content settings.

I think that it makes sense for hooks to use existing git-annex plumbing etc when it's available. So a hook could just run git annex wanted to manage its preferred content.

The only problem is that a hook does not currently have a way to discover the UUID of the repository! So I've added a GETUUID to cover this and other use cases.

Comment by joeyh.name — Tue Dec 31 17:56:34 2013

Remove comment

comment 7

I think the hook running anything in shell, to interact with git annex, is a mistake.

I see a lot more potential pitfalls and mistakes(especially crossplatform).

It should be the existing git annex plumbing (preferred content) as you say. I just really think it should be configurable in the protocol, instead of a having to run a shell command.

Since you have made this advanced protocol i really see it as a mistake to do anything between the hook and git-annex outside of the protocol, it makes much more sense to have all their interactions happen within the protocol.

IMO anyways.

Comment by TobiasTheViking — Tue Dec 31 18:20:32 2013

Remove comment

comment 8

It makes sense to only implement one interface to things, unless there is a reason such as performance to do otherwise.

Comment by joeyh.name — Tue Dec 31 19:16:16 2013

Remove comment

comment 9

Tobias made some good points:

git-annex may not be in PATH depending on installation method
It would in theory be bad if a special remote ran some git-annex command that used the special remote and ran some git-annex command [...].
git-annex would need to tell the special remote what git repo it was being used with.

So, added GETWANTED and SETWANTED. However, if I find myself recapitulating a lot of git-annex's command-line plumbing stuff in this protocol, I will need to revisit this decision and find a better way. Particularly, I narrowly escaped an intractable dependency loop in 8e3032df2d5c6ddf07e43de4b3bb89cb578ae048.

Comment by joeyh.name — Thu Jan 2 00:15:28 2014

Remove comment

Feature request

The ability to mark a remote as being a "cloud" remote. To silence the "Unable to download files from your other devices. Add a cloud repository" message in the webapp.

Maybe as simple as "SETCONFIG cloud true", if that is a viable implementation.

Comment by TobiasTheViking — Sat Jan 11 15:41:48 2014

Remove comment

comment 11

I've handled the cloud repo check by making external special remotes be assumed by default to be globally available via the cloud. So no need to do anything in most cases. For remotes that are only available locally, the remote can reply with "AVAILABILITY LOCAL" when git-annex sends an AVAILABILITY request.

Comment by joeyh.name — Mon Jan 13 18:45:34 2014

Remove comment

Chunk it

TODO: stream the file up/down the pipe, rather than using a temp file

You might want to use chunked transfer, i.e. a series of "EXPECT 65536" followed by that many bytes of binary data and an EOF marker (EXPECT-END or EXPECT 0), instead of escaping three characters (newline, NUL, and the escape prefix) and the additional unnecessary tedious per-character processing that would require.

Comment by Matthias — Mon Jan 20 16:22:09 2014

Remove comment

comment 13

First things first: in the documentation, I think SETCONFIG Setting should be SETCONFIG Setting Value.

Now a few questions:

why have SETCREDS and GETCREDS have both a username and a password? I'd like to use them to store a OAuth token, but because of this I also have to store a dummy value, which seems weird to me. Is it possible to just do SETCREDS oauth_token XXXYYY123456?
about PROGRESS: my remote is sending PROGRESS xxx every 64kb uploaded or downloaded, but no upload/download progress is displayed by git-annex. Is this normal? Should I do it myself, or will it be done by a future version of git-annex?

Joey, thanks a lot for all your work on git-annex

Comment by Schnouki — Mon Feb 10 18:22:55 2014

Remove comment

comment 14

Schnouki, fixed SETCONFIG docs. Note that this is a wiki.

I agree it's a little weird for SETCREDS to have a username and a password. This is just exposing git-annex's existing credential storage which has a tuple of values rather than using, say, a multivalue map. If you only need one it's fine to put in a dummy value for the other one.

Re PROGRESS, it seems I hooked up the progress stuff, so it was visible in the webapp, but forgot to put up a progress display at the command line. Fixed in git.

I look forward to seeing your special remote implementation!

Comment by joeyh.name — Tue Feb 11 01:36:51 2014

Remove comment

comment 15

Joey, thanks for fixing that so quickly. Indeed it works in the webapp; I'll check the CLI version as soon as possible

I just released the new remote on https://github.com/Schnouki/git-annex-remote-hubic. It's for hubiC, a French personal cloud storage service made by OVH that has free 25GB accounts and only charges €10/month for 10TB (VAT included). It's really experimental (hacked it over the weekend), but it seems to work for me so far.

Comment by Schnouki — Tue Feb 11 13:44:10 2014

Remove comment

Stream encoding

What encoding is used for the stdin/stdout streams used to communicate with remotes?

Comment by sjvdwalt — Tue Aug 25 00:36:24 2015

Remove comment

WHEREIS -- is it better to just report failure to avoid duplicates?

I wonder how should I utilize this new API (WHEREIS) in my case: it seems just to lead to duplication of whereis information in my case of a special remote to support extracting of content from archives. If I make it to reply with the same url (which is not "public" per se, i.e. can't be used by annex directly) I just get it duplicated:

$> git annex whereis simple.txt
whereis simple.txt (1 copy) 
    82025765-5cac-4571-91ed-637620ec6fc7 -- [annexed-archives]

  annexed-archives: dl+archive:SHA256E-s173--5df2eeab61ea7d6479533d4e6b07c6bcfae46e040cad8cb1fc579f9f18c90790.tar.gz/a/d/%20%22%27%3Ba%26b%26cd%20%60%7C%20
  annexed-archives: dl+archive:SHA256E-s173--5df2eeab61ea7d6479533d4e6b07c6bcfae46e040cad8cb1fc579f9f18c90790.tar.gz/a/d/%20%22%27%3Ba%26b%26cd%20%60%7C%20
ok

if I "explain" it a bit, also somewhat duplicate:

annexed-archives: file a/d/%20%22%27%3Ba%26b%26cd%20%60%7C%20 within archive SHA256E-s173--5df2eeab61ea7d6479533d4e6b07c6bcfae46e040cad8cb1fc579f9f18c90790.tar.gz
annexed-archives: dl+archive:SHA256E-s173--5df2eeab61ea7d6479533d4e6b07c6bcfae46e040cad8cb1fc579f9f18c90790.tar.gz/a/d/%20%22%27%3Ba%26b%26cd%20%60%7C%20

But if I just reply with "WHEREIS-FAILURE" it becomes more sensible (no duplicates), but I feel that then better documentation for this feature get adjusted to describe that it is only to complement information already known to annex, and not really to "provide any information about ways to access the content of a key stored in it". Or have I missed the point?

Comment by EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a] — Wed Aug 26 14:22:49 2015

Remove comment

re: Stream encoding

@sjvdwalt, git-annex does not specify or expect any character encoding to be used for this protocol. A robust external special remote shouldn't assume any particular character encoding, either.

Lines will be terminated with '\n' (0xA), and words in lines are delimited by an ascii space (0x20). The keywords in the protocol are ascii too of course. Any values can contain an arbitrary sequence of bytes that may or may not be able to be decoded using the current character encoding.

IIRC, character encodings like UTF8 that encode a character to multiple bytes avoid ever using 0x0 to 0xFF when doing so. So, every ascii space and newline are unambiguously such, and it's safe to split on them even though no encoding is specified.

Comment by joey — Wed Sep 9 20:44:38 2015

Remove comment

re: WHEREIS -- is it better to just report failure to avoid duplicates?

There's no point in implementing WHEREIS if it's going to reply with the same values that are passed to SETURIPRESENT.

Some special remotes may not need to use SETURIPRESENT to work at all, and yet storing data on the remote makes it available from some public url. This is the kind of situation where it makes sense to implement WHEREIS.

Comment by joey — Wed Sep 9 21:03:13 2015

Remove comment

but which of the 3?

could you express your expert choice explicitly among those 3 choices how I should react to WHEREIS in my (archives) case

report the same url
"spell it out"
WHEREIS-FAILURE

or really not implement it at all? (we are still at VERSION 1, so I thought that not implementing it might lead to some undesired side-effects)

Comment by EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a] — Thu Sep 10 03:16:17 2015

Remove comment

comment 21

As the documentation says, it's fine to not implement WHEREIS if you don't need it.

Comment by joey — Thu Sep 10 16:33:10 2015

Remove comment

Local storage of creds

I have a question about local storage of credentials. I assumed that when creds were stored in the repo (because the remote is encrypted or because embedcreds=yes), they wouldn't be stored locally in .git/annex/creds. But it seems they are stored locally, in plaintext, regardless.

Is there a way to prevent this? Ideally, credentials should not be stored plaintext at all...but maybe there's a technical issue I'm not seeing.

Comment by szrc — Mon Apr 4 03:59:01 2016

Remove comment

comment 23

@szrc, it's pretty expensive to pull encrypted creds out of the git repository and run gpg to decrypt them. Doing so also tends to result in a gpg password prompt.

Rather than do that every time git-annex needs the creds to access the remote, it maintains a local cache file, which has its permissions set so only the local user (and root, naturally) can read it.

Comment by joey — Mon Apr 4 20:17:05 2016

Remove comment

comment 24

Thanks for the response, Joey. It seems to me that many/most operations for which credentials are needed will require a gpg prompt anyway, but I can see why it might be too expensive in some cases.

Anyway, if you ever saw fit to add the option to disable or limit local caching, I would definitely use it -- and I'm guessing I'm not the only one who would prefer not to store credentials in plaintext.

Thanks for all of your great work on git-annex.

Comment by szrc — Tue Apr 5 01:18:53 2016

Remove comment

comment 25

Is there a reason that DIRHASH in type=external uses a different format (mixed-case) from that used by type=directory (lower-case)? Or, could there be e.g. DIRHASH LOWER to select between the formats?

I'm writing an external remote for SMB filesystem access and I'd like its storage to be usable via type=directory in case of emergencies or other reasons, but with different dirhash layouts that's not going to work, I assume...

Comment by grawity — Tue May 3 06:25:02 2016

Remove comment

comment 26

I don't think there's any particularly good reason why DIRHASH uses the mixed case format. However, it can't be changed without busting existing stuff.

So yeah, I've gone ahead and added a DIRHASH_LOWER.

Comment by joey — Tue May 3 17:29:02 2016

Remove comment

comment 27

Thanks. It'll probably be safer for my use case of storing data on Windows network shares than the mixed-case version.

(Speaking of "too late to change", DIRHASH-LOWER with a dash might be more consistent with the existing responses?)

Comment by grawity — Tue May 3 18:02:03 2016

Remove comment

comment 28

Agreed, DIRHASH-LOWER is more consistent, changed it to that.

Comment by joey — Tue May 3 18:09:50 2016

Remove comment

Retrieval progress message helpers

It would be nice if - progress could be provided in percentage rather than bytes, when that's all that's available - git-annex could inform the special remote how many bytes a file to be downloaded is, if that information is already available

Comment by xloem — Fri May 20 19:42:23 2016

Remove comment

comment 30

You can find out the size of a key by using git-annex examinekey $key --format='${bytesize}\n' (There's a --batch option to avoid needing to spin up repeated such processes.)

I suppose I could add a PROGRESSPRECENT, but any version of git-annex that didn't support it would fail with a protocol error if a special remote tried to use that. So, the special remote would need to check git-annex version to use it.

So, maybe better to get the size of the key yourself, and convert the percentage to bytes for PROGRESS.

Comment by joey — Mon May 23 18:54:35 2016

Remove comment

comment 31

What exacly is the difference between SETURIPRESENT and SETURLPRESENT?

Comment by Ilya_Shlyakhter — Wed Sep 19 11:01:31 2018

Remove comment

comment 32

Some questions about CHECKPRESENT Key: (1) if Key is a URL backend key, should this return true if CHECKURL on the URL would return CHECKURL-CONTENTS? (2) Should the external special remote implementation call GETURLS on the key and return true if CHECKURL would return CHECKURL-CONTENTS for any of the URLs? (3) Calling GETURLS on a URL key returns an empty list; shouldn't it return a one-element list containing the included URL (at least if a CHECKURL call on that URL would return CHECKURL-CONTENTS)?

Comment by Ilya_Shlyakhter — Wed Sep 19 18:19:22 2018

Remove comment

comment 33

@Ilya_Shlyakhter,

CHECKURL is only used by git-annex add/importfeed when adding a new url. So it does not need to be consistent with CHECKPRESENT, though it would probably make sense for it to be in most cases.
I guess you're asking if it should do that in its CHECKPRESENT implementation. CHECKPRESENT needs to use some method to actively verify that the remote currently contains the content of the key. It doesn't necessarily need to use a recorded url.
GETURLS looks at information stored in the git-annex branch. If the url key has been added to the repository with git annex add then its url will be stored there, but if you just generated an url key, it doesn't necessarily have anything stored about it in the git-annex branch.

Comment by joey — Mon Sep 24 15:35:54 2018

Remove comment

question about special remote protocol

What exacly is the difference between SETURIPRESENT and SETURLPRESENT?

Comment by Ilya_Shlyakhter — Wed Sep 26 17:16:13 2018

Remove comment

comment 35

Use SETURLPRESENT when it's a regular http url. This allows git-annex to download from the remote in readonly mode without needing the external special remote program to be installed at all.

Use SETURIPRESENT for things that are not http urls, but that it makes sense for git annex whereis to display.

Comment by joey — Thu Oct 4 18:17:57 2018

Remove comment

comment 36

What is the intended use case for CHECKURL-MULTI? Are there examples of external special remote implementations that use this response?

Comment by Ilya_Shlyakhter — Fri Oct 19 15:52:21 2018

Remove comment

comment 37

@Ilya, the uses for CHECKURL-MULTI are open-ended, bittorrent files that can list several files is one example of a way it's used.

Comment by joey — Mon Oct 29 19:39:01 2018

Remove comment

PREPARE-LOCAL

"Note that users expect git annex whereis to run fast, without eg, network access"

Currently, git-annex spins up a remote process for every git annex whereis command that involves a file present on the remote (w/o chunking & encryption). As most remotes establish their network connection during the PREPARE phase, the command is slowed down, especially with bad internet connection. So I propose an extension PREPARE-LOCAL that tells the remote to get all necessary config information but skip the networking.

Alternatively, the remotes could wait to establish network connection until the first transfer command is sent but I think something like PREPARE-LOCAL would be the cleaner solution.

Comment by lykos — Tue Jan 15 15:47:39 2019

Remove comment

Re: PREPARE-LOCAL

There is a difference between a WHEREIS that for some reason itself hit the network, and a single network connection in PREPARE. The language was really talking about the former, which would make whereis on a large number of files painful. Not saying it wouldn't be better to avoid the latter too; if the user is only running whereis on 1 file the overhead is equally as bad.

Hmm, there is that "long running network connections" section that encourages using PREPARE that way, I think the idea was to make it as simple as possible to implement an external remote. All of git-annex's built-in remotes defer anything like that until it's needed.

In a way the real problem here is that WHEREIS is something most remotes will never need to implement, but it's queried of all of them. If only the few remotes that implement it needed to avoid network connections in PREPARE, that would not be much trouble to do.

PREPARE-LOCAL would need to be a protocol extension, so special remotes would have to be modified to request it, and those that are not modified would still have the overhead. Would that be any more likely to happen/easier to do than modifying all special remotes to defer network connections until needed?

Comment by joey — Wed Jan 16 18:24:08 2019

Remove comment

Re: PREPARE-LOCAL

I've thought up a way to solve this problem, it's in external remote querying transition.

Comment by joey — Thu Jan 17 15:49:05 2019

Remove comment

Empty lines sent by git-annex to an external special remote

I am implementing a special remote using https://github.com/Lykos153/AnnexRemote

I found that the annexremote leaves the readline() loop once it receives an empty line from git-annex over stdin (https://github.com/Lykos153/AnnexRemote/blob/master/annexremote/annexremote.py#L410). Given that the protocol description says nothing about empty lines this may or may not be sensible. However, I also found that git-annex does (sometimes?) sends empty lines. Here is an excerpt:

% git annex --debug copy --to inm7 down/cope1.feat/report_log.html
...
[2019-05-14 07:45:43.350358695] git-annex-remote-inm7[1] <-- TRANSFER STORE MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html .git/annex/objects/mj/Kx/MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html/MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html
<<INCOMING ''
(repeated a total of 22 times)
...
<<INCOMING 'TRANSFER STORE MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html .git/annex/objects/mj/Kx/MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html/MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html\n'

where '<<INCOMING' reports anything that is read via readline() from the special remote's stdin. So 22 empty lines are read from stdin before the TRANSFER STORE reported by git-annex to have been sent actually appears.

If I drive the remote implementation "by hand" I don't see anything wrong, and no non-protocol output:

% git-annex-remote-inm7
VERSION 1
EXTENSIONS INFO
EXTENSIONS
PREPARE
GETCONFIG dataset_uuid
VALUE dda42d6c-7231-11e9-a901-0050b6902ef0
PREPARE-SUCCESS
CHECKPRESENT MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html
DIRHASH MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html
VALUE mj/Kx/
CHECKPRESENT-FAILURE MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html
TRANSFER STORE MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html .git/annex/objects/mj/Kx/MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html/MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html
DIRHASH MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html
VALUE mj/Kx/
TRANSFER-SUCCESS STORE MD5E-s48796--33a9884cc35c891f693958b9dd7fccd6.html

If git-annex runs it, I see empty lines appearing only after CHECKPRESENT-FAILURE.

Which purpose do these empty lines serve? Under what conditions would an empty response from git-annex occur?

Thx!

Comment by michael.hanke — Tue May 14 06:36:11 2019

Remove comment

Re: Empty lines sent by git-annex to an external special remote

Sorry for the noise. The reason was that my special remote was calling SSH internally and the parent's stdin was connected to the SSH process -- which confused all involved parties.

Comment by michael.hanke — Tue May 14 08:48:27 2019

Remove comment

comment 43

Notice that the --debug flag shows how this protocol is flowing in both directions.

<-- TRANSFER STORE ...
--> TRANSFER-SUCCESS STORE ...

The first line was sent from git-annex to the external special remote program, and the second line was its reply to git-annex.

This is very handy for understanding when something strange seems to be happening with the protocol.

Comment by joey — Tue May 14 19:27:11 2019

Remove comment

should there be FSCK and/or CLEAN?

external remote could crash/get interrupted in the middle of its operation thus leaving some corrupt/temporary files behind. How/when should it run some diagnostic and clean up after itself?

Comment by yarikoptic — Sat Sep 28 00:31:51 2019

Remove comment

re: should there be FSCK and/or CLEAN?

You never need to worry about cleaning up the files git-annex asks content be downloaded to.

git-annex also has a per-key temporary work directory that could perhaps be useful for this interface to expose; it manages cleanup of those temporary directories at the right times.

You can detect when git-annex closes the external program's stdin, and perform whatever shutdown cleanup you want to then. But of course there could be other git-annex processes also running other instances of the same external program, so you'd have to find a way to avoid deleting any files those are using.

Comment by joey — Sat Sep 28 16:38:14 2019

Remove comment

Triggering which stage and when?

Hi,

I would like to better understand how all stages interact together in a special remote. Once a special remote is initialized, how is the PREPARE stage triggered? I could not succeed int this for now.

Moreover, is the program continuously running after initialization? it seems to exit with an 'ok, recording state in git...'

So, if I run the program 'by hand' and talk to my special remote via VERSION,INFO,PREPARE..the remote succeeds in what it is meant to. I am having troubles in making the same happen with annex under the cover.

Thank you for your help,

Best Giulia

Comment by giuly.ippoliti — Thu Oct 24 20:30:10 2019

Remove comment

Resuming an interrupted download

It would be helpful to allow special remotes to take advantage of git annex's ability to resume interrupted downloads for large files, especially on slow/unreliable connections. One way to implement this would be to allow the special remote to send a message asking git-annex what offset it intends to read at, then write a sparse file with only the needed data. I notice the testremote suite includes tests for resuming downloads at an offset, so it is possible no other changes would be needed.

Sparse files could be avoided by allowing the special remote to send a command indicating the offset at which the target file starts.

Does that sound like a reasonable design?

Comment by alex — Fri Oct 1 23:27:31 2021

Remove comment

Re: Resuming an interrupted download

It's already possible, and in fact very easy to support resuming downloads. When the filename that you're asked to download to already exists, you can simply check its size and resume downloading to it where the previous download left off.

When you send PROGRESS, the value should be the same as the current size of the file. That is always the case really, but the distinction between file size and amount you've downloaded only matters when resuming.

Another way to support resuming, when talking to an API that does not, is to encourage users of your remote to configure chunking with a small enough chunk size. git-annex will then handle resuming by re-starting on the last incomplete chunk. In this case, you'll be downloading each chunk to a separate file, so you will not need to do anything to support resuming.

If a remote does any kind of out of order downloading (like bittorrent does), it needs to avoid writing to the file out of order, with holes in the middle of it. Such holes would mess up a resume of the download of the same object by another remote.

Comment by joey — Tue Oct 5 16:01:10 2021

Remove comment

Re: Resuming an interrupted download

Thanks, that works great!

Comment by alex — Fri Oct 8 03:06:57 2021

Remove comment

DIRHASH ending in slash?

DIRHASH-LOWER (and I assume the other DIRHASH commands as well) seem to respond with a path ending with a slash. So VALUE abc/def/ instead of the VALUE abc/def example mentioned. git-annex-remote-rclone actually assumes the response ends with a slash. Is this indeed what git-annex guarantees? If so, it should probably be documented in this specification.

Comment by jeroen — Wed Sep 28 11:58:56 2022

Remove comment

Re: DIRHASH ending in slash?

Hmm, so it does. This would not have been my choice for ideal behavior, but I don't want to break things that depend on it now. So I've updated the documentation.

Comment by joey — Fri Sep 30 17:31:25 2022

Remove comment

Re: Status of the import/export protocol implementation

It's still only a draft. I've been waiting for someone to have a use case that needs it, so I can work with them in implementing it and making sure it all makes sense.

I don't expect that implementing it on the git-annex side will be very hard.

Comment by joey — Fri Apr 7 16:53:17 2023

Remove comment

Multi-line string in WHEREIS-SUCCESS?

Is it possible to somehow make git annex whereis show the response of the special remote to WHEREIS over multiple lines? Just including newlines obviously results in an error, since that ends the WHEREIS-SUCCESS message.

I am implementing a special remote for which the data is fully described by what is essentially a json-encoded request to a third-party API, and I would like to show this json string pretty-printed over multiple lines in the whereis output, instead of as a single line.

Comment by matrss — Wed Feb 21 12:29:14 2024

Remove comment

support for bulk write/read/test remote

Hi,

I'm wondering whether there an any easy way to delay "progress reporting" (a.k.a. "report progress for ALL transfer_store operations ONCE", a.k.a. "bulk transfer") for a special remote?

What I'm trying to achieve: there is an archiver called dar, which I would like to implement a special remote for. It can write many files into a single archive and also supports incremental/differential backups. A one can create an archive with this utility, by providing a list of files or directories as params.

The problem with the current git annex special remote API is that it does not allows to report transfer progress for ALL key/files for a special remote (e.g. with transfer_store), and then check the progress at ONCE for ALL files at the end of the process. Ideally, the protocol should have some kind of "write test" command to check the written archive for errors, and only then report the progress as "successful".

What I was thinking of is to just write all files into a temporarily list during transfer_store, and then externally archive this list of files after git annex copy --to dar-remote is done. But seems like git annex will think that the process of writing files to that remote was successful, while it may not (e.g. file access error happened, or an archive was corrupted, etc).

How can it be achieved? Do we need to extend git annex with another protocol extension? How difficult it may be, and where to start? I suppose there is no way Joey or anyone else will work on it any time soon if there is no workaround, and I have to submit a patch?

P.S.: I've seen async extensions but it seems like it's tied to a threads, which most likely won't allow to achieve the described goals.

Comment by psxvoid — Tue Apr 2 06:41:25 2024

Remove comment

Add a comment