Being able to connect repositories peer to peer is nice, but only having tor as an option is quite limiting, especially considering that tor isn't that suitable for large file transfers. It would be nice if git-annex could be taught to use other transports as well (what I have in mind is yggstack or fowl (for fowl I had already opened a todo in the past: https://git-annex.branchable.com/todo/Peer_to_peer_connection_purely_over_magic-wormhole/), but there are probably others that could be used as well).
What I am thinking would be nice to have for this is:
- Something like
git annex enable-p2p-socket
, which would configure the repository such thatgit annex remotedaemon
listens on a unix socket somewhere under .git/annex for incoming p2p connections, which would be authenticated using the pairing process fromgit annex p2p
just like when using the tor transport. - A git remote helper
p2p-annex::<path-to-socket-file>
, which would connect to the unix socket and speak the p2p protocol with it.
With these two things in place it would be possible to use any transport to connect the socket files on two systems, including yggstack, fowl, or just netcat or socat (though unencrypted communication would be a bad idea).
My understanding is that the current tor p2p support is essentially a special case of the above, using a socket file in /var/lib/tor-annex and requiring a hidden service configuration in torrc on the server-side, while being limited to onion addresses on the client-side. In that sense this would just be a generalization and I think most of the code to support this is already there, and just needs to be wired differently.
This should also make it possible to build e.g. a git annex enable-yggstack
and yggstack-annex::<pubkey>.pk.ygg
remote in terms of enable-p2p-socket and p2p-annex::
, even outside of git-annex itself.
What do you think?
git annex shell p2pstdio
, but without skipping authentication, instead using thegit annex p2p
pairing process. Something like socat could then be used to connect those stdin/stdout's to a unix socket, tcp port, or whatever else.This was all designed to be generalizable to some degree, but has so far really only been used for tor.
Making it generic may be a good idea. Or it may be that there are really too many complications around how different p2p networks and addresses work and how authentication is done, that would complicate a generic command, but that can be transparently handled when implementing support for a specific p2p transport, as was done for tor.
Working from the client end, the git remote has an url, which needs to be identified as a p2p address to use a p2p transport to talk to it. Currently that is an url starting with "tor-annex:". Like you suggest, the generic one could be "p2p-annex::". Or it could be "p2p-annex::foo+" which causes git-annex to run a command like
git-annex-p2p-foo <bar>
and talk to its stdin and stdout.That's for outgoing connections. For incoming connections, for tor, the remotedaemon looks to see if the socket file exists and if so it accepts connections from it. (That tor socket is not used for outgoing connections.) It would be easy to generalize this to additional socket filenames. Eg, a remote with uuid U could use
.git/annex/p2p/U
as its socket file.BTW, that git-annex-p2p-foo command is different from the git remote helper you suggest, which corresponds to git-remote-tor-annex. But, git-remote-tor-annex would easily generalize to a git-remote-p2p-annex git remote helper, if there was a generic p2p-annex url type and a way to connect to it.
If the P2P protocol's AUTH is provided with an AuthToken, there would need to be an interface to record the one to use for a given p2p connection.
git-annex p2p
handles setting up AuthTokens, but its approach may or may not make sense for a given p2p protocol. It does look like, if there's a generic way implemented to connect to a given p2p-annex url,git-annex p2p
would mostly work. But there would need to be a way to generate an address using such an url, likegit-annex enable-tor
does.Seems pretty close to a workable design to me, but I don't know how well it will match up with these various kinds of P2P networks.
Your comment seems to be wrongly formatted. It was shown correctly in the notification mail, but doesn't show up here.
Just to document what I have tried out, for completeness: with what is already in place it is possible to connect two repositories over yggstack, it is just very awkward.
On one system you can do:
sudo mkdir /etc/tor && sudo touch /etc/tor/torrc
(without actually having tor installed)sudo git annex enable-tor $(id -u)
yggstack -genconf > yggstack.conf
echo tor-annex::<pubkey>.pk.ygg:12345
(take the pubkey out of yggstack.conf)socat TCP-LISTEN:12345,fork,reuseaddr UNIX-CONNECT:/var/lib/tor-annex/<uid>_<repo-uuid>/s
yggstack -useconffile yggstack.conf -remote-tcp 12345:127.0.0.1:12345
git annex p2p --gen-addresses
On the other system do:
yggstack -autoconf -socks 127.0.0.1:9050
git annex p2p --link
and paste in the generated address when asked (it should have the formtor-annex::<pubkey>.pk.ygg:12345:<auth-token>
)On the server side this simply exposes the p2p socket generated for tor through a different means, and on the client side this works because yggstack can be used similarly enough to tor (doing name resolution through the socks proxy at port 9050 and then connecting the supplied port).
I really like your proposal of a
p2p-annex::foo+<whatever>
remote; together with a way to tell remotedaemon to start a process exposing the socket it would make for an easily extendable mechanism. Imagine this:Client side:
p2p-annex::foo+<addr>
would startgit-annex-p2p-foo <addr>
and talk to its stdin/stdout.Server side:
annex.start-p2psocket=true
would instruct remotedaemon to listen on .git/annex/p2psocket (I think a hardcoded location is fine, as there only really needs to be one such socket even with multiple networks, and somewhere under .git/annex is a good location to associate it with the repository and will always be writable by the user).annex.expose-p2p-via=foo
that could be supplied zero, one, or multiple times, and each of these configurations would instruct remotedaemon to start the external program git-annex-p2ptransport-foo after the p2p socket is ready (this configuration could also just point to a command to execute, but I thought it might be nice to stay with the theme of commonly prefixed programs).With these things in place a third-party package git-annex-p2p-yggstack could provide a simple set of shell scripts to implement transport over yggstack:
For the server side there would be a
git-annex-p2ptransport-yggstack
along these lines (modulo proper process cleanup of course):and a
git-annex-p2ptransport-enable-yggstack
like this:For the client-side it would provide
git-annex-p2p-yggstack
along these lines:With that package installed one could then do
git annex p2ptransport enable-yggstack
followed bygit annex p2p --gen-addresses
. Agit annex remotedaemon
would now start everything on the server-side, and the client-side could connect usinggit annex p2p --link
with the address from--gen-addresses
.I think this would be sufficiently flexible for most kinds of p2p transport one could come up with. E.g. a transport over fowl or even plain magic-wormhole (though the transit relay wouldn't appreciate it) could use
p2p-annex::fowl+<code>
where the code is a pre-generated token instead of the usual passphrases used by magic-wormhole. The server side would be a script that repeatedly waits for connections to that code, the client side just connects to it.Even for more traditional p2p setups (tinc, wireguard, yggdrasil, etc.) where the transport is pre-set up at the system level this would just work if there was a helper for
p2p-annex::tcpip+<hostname>:<port>
(effectively just netcat again).Configuration, program, and subcommand names etc. are of course open to bike-shedding. Some of the hardcoded ports above should be dynamically chosen, or completely avoided if the transport can do so (yggstack and fowl can't expose unix sockets directly yet, so the digression through the loopback device is needed for now).
What do you think?
One more thought: the proposed
p2p-annex::foo+<addr>
remote makes one assumption that I don't think holds for all thinkable p2p transports. That assumption is that there is a public address for the server-side that can be trusted to be the expected other side.For tor and yggstack this does hold: the public address (onion address of the hidden service for tor and the IPv6 derived from the public key of the yggstack peer (potentially resolved from a .pk.ygg DNS entry like above), respectively) ensures that the server side is who they are expected to be. There is no way for a third-party to pretend that they were the server-side, even if they knew the git remote string, because they would need to have the servers private key to do so.
This is not the case for fowl: with fowl one would essentially do
fowl <psk> ...
on both sides to create a tunnel between server and client. If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server (when the server side is currently not waiting for a connection itself) and steal the auth token from the client. So under the assumption that the remote string is not a secret this would be a problem.But this problem can be overcome: with fowl both sides could simply derive the psk from the p2p auth token to establish the connection, essentially like so:
fowl <number derived from auth token>-<auth token> ...
. The git remote string would only need to contain the information to use fowl and some unique identifier for the remote then, so that the right auth token can be taken from .git/annex/creds.Likewise, for other p2p transports that don't have stable and secure public addresses, necessary information exchange could also happen over magic-wormhole using the auth tokens, or the auth tokens could be used as PSKs between both sides if that's what the transport needs. This would e.g. apply for a hypothetical transport over webrtc data channels, where some kind of "SDP" has to be exchanged between both sides to establish a connection.
All that to say: I think
p2p-annex::foo+
would indeed be general enough for many conceivable means of transport, if a re-use of the auth tokens in the above fashion would be acceptable. And I can't think of anything against it, yet.I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.