motivation

The P2P protocol is a custom protocol that git-annex speaks over a ssh connection (mostly). This is a design working on supporting the P2P protocol over HTTP.

Upload of annex objects to git remotes that use http is currently not supported by git-annex, and this would be a generally very useful addition.

For use cases such as OpenNeuro's javascript client, ssh is too difficult to support, so they currently use a special remote that talks to a http endpoint in order to upload objects. Implementing this would let them talk to git-annex over http.

With the passthrough proxy, this would let clients configure a single http remote that accesses a more complicated network of git-annex repositories.

integration with git

A webserver that is configured to serve a git repository either serves the files in the repository with dumb http, or uses the git-http-backend CGI program for url paths under eg /git/.

To integrate with that, git-annex would need a git-annex-http-backend CGI program, that the webserver is configured to run for url paths under /git/.*/annex/.

So, for a remote with an url http://example.com/git/foo, git-annex would use paths under http://example.com/git/foo/annex/ to run its CGI.

But, the CGI interface is a poor match for the P2P protocol.

A particular problem is that LOCKCONTENT would need to be in one CGI request, followed by another request to UNLOCKCONTENT. Unless git-annex-http-backend forked a daemon to keep the content locked, it would not be able to retain a file lock across the 2 requests. While the 10 minute retention lock would paper over that, UNLOCKCONTENT would not be able to delete the retention lock, because there is no way to know if another LOCKCONTENT was received later. So LOCKCONTENT would always lock content for 10 minutes. Which would result in some undesirable behaviors.

Another problem is with proxies and clusters. The CGI would need to open ssh (or http) connections to the proxied repositories and cluster nodes each time it is run. That would add a lot of latency to every request.

And running a git-annex process once per CGI request also makes git-annex's own startup speed, which is ok but not great, add latency. And each time the CGI needed to change the git-annex branch, it would have to commit on shutdown. Lots of time and space optimisations would be prevented by using the CGI interface.

So, rather than having the CGI program do anything in the repository itself, have it pass each request through to a long-running server. (This does have the downside that files would get double-copied through the CGI, which adds some overhead.) A reasonable way to do that would be to have a webserver speaking a HTTP version of the git-annex P2P protocol and the CGI just talks to that.

The CGI program then becomes tiny, and just needs to know the url to connect to the git-annex HTTP server.

Alternatively, a remote's configuration could include that url, and then we don't need the complication and overhead of the CGI program at all. Eg:

git config remote.origin.annex-url http://example.com:8080/

So, the rest of this design will focus on implementing that. The CGI program can be added later if desired, so avoid users needing to configure an additional thing.

Note that, one nice benefit of having a separate annex-url is it allows having remote.origin.url on eg github, but with an annex-url configured that remote can also be used as a git-annex repository.

approach 1: websockets

The client connects to the server over a websocket. From there on, the protocol is encapsulated in websockets.

This seems nice and simple to implement, but not very web native. Anyone wanting to talk to this web server would need to understand the P2P protocol. Just to upload a file would need to deal with AUTH, AUTH-SUCCESS, AUTH-FAILURE, VERSION, PUT, ALREADY-HAVE, PUT-FROM, DATA, INVALID, VALID, SUCCESS, and FAILURE messages. Seems like a lot.

Some requests like LOCKCONTENT do need full duplex communication like websockets provide. But, it might be more web native to only use websockets for that request, and not for everything.

approach 2: web-native API

Another approach is to define a web-native API with endpoints that correspond to each action in the P2P protocol.

Something like this:

> POST /git-annex/v1/AUTH?clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925 HTTP/1.0
< AUTH-SUCCESS ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6

> POST /git-annex/v1/CHECKPRESENT?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
> SUCCESS

> POST /git-annex/v1/PUT-FROM?key=SHA1--foo&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
< PUT-FROM 0

> POST /git-annex/v1/PUT?key=SHA1--foo&associatedfile=bar&put-from=0&clientuuid=79a5a1f4-07e8-11ef-873d-97f93ca91925&serveruuid=ecf6d4ca-07e8-11ef-8990-9b8c1f696bf6 HTTP/1.0
> Content-Type: application/octet-stream
> Content-Length: 20
> foo
> {"valid": true}
< {"stored": true}

(In the last example above "foo" is the content, it is followed by a line of json. This seems better than needing an entire other request to indicate validitity.)

This needs a more complex spec. But it's easier for others to implement, especially since it does not need a session identifier, so the HTTP server can be stateless.

A full draft protocol for this is being developed at draft1.

HTTP GET

It should be possible to support a regular HTTP get of a key, with no additional parameters, so that annex objects can be served to other clients from this web server.

> GET /git-annex/key/SHA1--foo HTTP/1.0
< foo

Although this would be a special case, not used by git-annex, because the P2P protocol's GET has the complication of offsets, and of the server sending VALID/INVALID after the content, and of needing to know the client's UUID in order to update the location log.

Problem: CONNECT

The CONNECT message allows both sides of the P2P protocol to send DATA messages in any order. This seems difficult to encapsulate in HTTP.

Probably this can be not implemented, it's probably not needed for a HTTP remote? This is used to tunnel git protocol over the P2P protocol, but for a HTTP remote the git repository can be accessed over HTTP as well.

security

Should support HTTPS and/or be limited to only HTTPS.

Authentication via http basic auth?