generic p2p socket transport

Being able to connect repositories peer to peer is nice, but only having tor as an option is quite limiting, especially considering that tor isn't that suitable for large file transfers. It would be nice if git-annex could be taught to use other transports as well (what I have in mind is yggstack or fowl (for fowl I had already opened a todo in the past: https://git-annex.branchable.com/todo/Peer_to_peer_connection_purely_over_magic-wormhole/), but there are probably others that could be used as well).

What I am thinking would be nice to have for this is:

Something like git annex enable-p2p-socket, which would configure the repository such that git annex remotedaemon listens on a unix socket somewhere under .git/annex for incoming p2p connections, which would be authenticated using the pairing process from git annex p2p just like when using the tor transport.
A git remote helper p2p-annex::<path-to-socket-file>, which would connect to the unix socket and speak the p2p protocol with it.

With these two things in place it would be possible to use any transport to connect the socket files on two systems, including yggstack, fowl, or just netcat or socat (though unencrypted communication would be a bad idea).

My understanding is that the current tor p2p support is essentially a special case of the above, using a socket file in /var/lib/tor-annex and requiring a hidden service configuration in torrc on the server-side, while being limited to onion addresses on the client-side. In that sense this would just be a generalization and I think most of the code to support this is already there, and just needs to be wired differently.

This should also make it possible to build e.g. a git annex enable-yggstack and yggstack-annex::<pubkey>.pk.ygg remote in terms of enable-p2p-socket and p2p-annex::, even outside of git-annex itself.

What do you think?

done --Joey

RSS Atom

comment 1

I suppose this wouldn't have to communicate over unix sockets either, it could also use stdin/stdout like git annex shell p2pstdio, but without skipping authentication, instead using the git annex p2p pairing process. Something like socat could then be used to connect those stdin/stdout's to a unix socket, tcp port, or whatever else.

Comment by matrss — Sun Dec 8 17:27:59 2024

comment 2

This was all designed to be generalizable to some degree, but has so far really only been used for tor.

Making it generic may be a good idea. Or it may be that there are really too many complications around how different p2p networks and addresses work and how authentication is done, that would complicate a generic command, but that can be transparently handled when implementing support for a specific p2p transport, as was done for tor.

Working from the client end, the git remote has an url, which needs to be identified as a p2p address to use a p2p transport to talk to it. Currently that is an url starting with "tor-annex:". Like you suggest, the generic one could be "p2p-annex::". Or it could be "p2p-annex::foo+" which causes git-annex to run a command like git-annex-p2p-foo <bar> and talk to its stdin and stdout.

That's for outgoing connections. For incoming connections, for tor, the remotedaemon creates the socket file that tor is configured to use for the hidden service, and listens to it to accept connections from tor. (That tor socket is not used for outgoing connections.) It would be easy to generalize this to additional socket filenames. Eg, a remote with uuid U could use .git/annex/p2p/U as its socket file.

BTW, that git-annex-p2p-foo command is different from the git remote helper you suggest, which corresponds to git-remote-tor-annex. But, git-remote-tor-annex would easily generalize to a git-remote-p2p-annex git remote helper, if there was a generic p2p-annex url type and a way to connect to it.

If the P2P protocol's AUTH is provided with an AuthToken, there would need to be an interface to record the one to use for a given p2p connection. git-annex p2p handles setting up AuthTokens, but its approach may or may not make sense for a given p2p protocol. It does look like, if there's a generic way implemented to connect to a given p2p-annex url, git-annex p2p would mostly work. But there would need to be a way to generate an address using such an url, like git-annex enable-tor does.

Seems pretty close to a workable design to me, but I don't know how well it will match up with these various kinds of P2P networks.

Comment by joey — Thu Dec 12 17:01:42 2024

comment 3

Your comment seems to be wrongly formatted. It was shown correctly in the notification mail, but doesn't show up here.

Just to document what I have tried out, for completeness: with what is already in place it is possible to connect two repositories over yggstack, it is just very awkward.

On one system you can do:

sudo mkdir /etc/tor && sudo touch /etc/tor/torrc (without actually having tor installed)
sudo git annex enable-tor $(id -u)
yggstack -genconf > yggstack.conf
echo tor-annex::<pubkey>.pk.ygg:12345 (take the pubkey out of yggstack.conf)
socat TCP-LISTEN:12345,fork,reuseaddr UNIX-CONNECT:/var/lib/tor-annex/<uid>_<repo-uuid>/s
yggstack -useconffile yggstack.conf -remote-tcp 12345:127.0.0.1:12345
git annex p2p --gen-addresses

On the other system do:

yggstack -autoconf -socks 127.0.0.1:9050
git annex p2p --link and paste in the generated address when asked (it should have the form tor-annex::<pubkey>.pk.ygg:12345:<auth-token>)

On the server side this simply exposes the p2p socket generated for tor through a different means, and on the client side this works because yggstack can be used similarly enough to tor (doing name resolution through the socks proxy at port 9050 and then connecting the supplied port).

I really like your proposal of a p2p-annex::foo+<whatever> remote; together with a way to tell remotedaemon to start a process exposing the socket it would make for an easily extendable mechanism. Imagine this:

Client side:

p2p-annex::foo+<addr> would start git-annex-p2p-foo <addr> and talk to its stdin/stdout.

Server side:

A configuration option annex.start-p2psocket=true would instruct remotedaemon to listen on .git/annex/p2psocket (I think a hardcoded location is fine, as there only really needs to be one such socket even with multiple networks, and somewhere under .git/annex is a good location to associate it with the repository and will always be writable by the user).
A configuration option annex.expose-p2p-via=foo that could be supplied zero, one, or multiple times, and each of these configurations would instruct remotedaemon to start the external program git-annex-p2ptransport-foo after the p2p socket is ready (this configuration could also just point to a command to execute, but I thought it might be nice to stay with the theme of commonly prefixed programs).

With these things in place a third-party package git-annex-p2p-yggstack could provide a simple set of shell scripts to implement transport over yggstack:

For the server side there would be a git-annex-p2ptransport-yggstack along these lines (modulo proper process cleanup of course):

socat TCP-LISTEN:12345,fork,reuseaddr UNIX-CONNECT:.git/annex/p2psocket &
yggstack -useconffile .git/annex/p2ptransport/yggstack/yggstack.conf -remote-tcp 12345:127.0.0.1:12345

and a git-annex-p2ptransport-enable-yggstack like this:

git config --local annex.start-p2psocket true
git config --local --add annex.expose-p2p-via yggstack
if [ ! -f .git/annex/p2ptransport/yggstack/yggstack.conf ]; then
    yggstack -genconf > .git/annex/p2ptransport/yggstack/yggstack.conf
fi
echo "p2p-annex::yggstack+<pubkey-from-yggstack.conf>.pk.ygg:12345" >> .git/annex/creds/p2paddrs

For the client-side it would provide git-annex-p2p-yggstack along these lines:

yggstack -autoconf -socks 127.0.0.1:1080
nc -X 5 -x 127.0.0.1:1080 <pubkey>.pk.ygg 12345

With that package installed one could then do git annex p2ptransport enable-yggstack followed by git annex p2p --gen-addresses. A git annex remotedaemon would now start everything on the server-side, and the client-side could connect using git annex p2p --link with the address from --gen-addresses.

I think this would be sufficiently flexible for most kinds of p2p transport one could come up with. E.g. a transport over fowl or even plain magic-wormhole (though the transit relay wouldn't appreciate it) could use p2p-annex::fowl+<code> where the code is a pre-generated token instead of the usual passphrases used by magic-wormhole. The server side would be a script that repeatedly waits for connections to that code, the client side just connects to it.

Even for more traditional p2p setups (tinc, wireguard, yggdrasil, etc.) where the transport is pre-set up at the system level this would just work if there was a helper for p2p-annex::tcpip+<hostname>:<port> (effectively just netcat again).

Configuration, program, and subcommand names etc. are of course open to bike-shedding. Some of the hardcoded ports above should be dynamically chosen, or completely avoided if the transport can do so (yggstack and fowl can't expose unix sockets directly yet, so the digression through the loopback device is needed for now).

What do you think?

Comment by matrss — Fri Dec 13 22:02:14 2024

comment 4

One more thought: the proposed p2p-annex::foo+<addr> remote makes one assumption that I don't think holds for all thinkable p2p transports. That assumption is that there is a public address for the server-side that can be trusted to be the expected other side.

For tor and yggstack this does hold: the public address (onion address of the hidden service for tor and the IPv6 derived from the public key of the yggstack peer (potentially resolved from a .pk.ygg DNS entry like above), respectively) ensures that the server side is who they are expected to be. There is no way for a third-party to pretend that they were the server-side, even if they knew the git remote string, because they would need to have the servers private key to do so.

This is not the case for fowl: with fowl one would essentially do fowl <psk> ... on both sides to create a tunnel between server and client. If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server (when the server side is currently not waiting for a connection itself) and steal the auth token from the client. So under the assumption that the remote string is not a secret this would be a problem.

But this problem can be overcome: with fowl both sides could simply derive the psk from the p2p auth token to establish the connection, essentially like so: fowl <number derived from auth token>-<auth token> .... The git remote string would only need to contain the information to use fowl and some unique identifier for the remote then, so that the right auth token can be taken from .git/annex/creds.

Likewise, for other p2p transports that don't have stable and secure public addresses, necessary information exchange could also happen over magic-wormhole using the auth tokens, or the auth tokens could be used as PSKs between both sides if that's what the transport needs. This would e.g. apply for a hypothetical transport over webrtc data channels, where some kind of "SDP" has to be exchanged between both sides to establish a connection.

All that to say: I think p2p-annex::foo+ would indeed be general enough for many conceivable means of transport, if a re-use of the auth tokens in the above fashion would be acceptable. And I can't think of anything against it, yet.

Comment by matrss — Sun Dec 15 18:13:00 2024

Re: comment 4

If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server

I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.

Comment by joey — Tue Jan 7 18:29:44 2025

comment 6

If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server

I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.

My thinking was that git remote URLs usually aren't sensitive information that inherently grant access to a repository, so a construct where the remote URL contains the credentials is just unexpected. A careless user might e.g. put it into a type=git special remote or treat it in some other way in which one wouldn't treat a password, without considering the implications. I am not aware of a way in which they could be leaked without user intervention, though.

Having separate credentials explicitly named as such just seems safer. But in the end this would be the responsibility of the one implementing the p2p transport, anyway.

Comment by matrss — Mon Jan 27 15:26:15 2025

iroh

Please take a look at iroh. It started as an IPFS implementation in rust, realized that IPFS is slow and overengineered and now pivoted to providing p2p connections with quic.

Peers/nodes/endpoints use ed25519 keys as identities.
The iroh project hosts relay servers for initial NAT hole punching and as connection fallbacks.
So far there are 4 initial discovery implementations: DNS, Local (mDNS), Pkarr or Bittorrents Mainline DHT

I'm waiting for their FOSDEM talk. But there is also a good presentation on YT: A tour of iroh.

Comment by thk — Sat Feb 8 06:56:31 2025

AuthTokens

I wrote:

If the P2P protocol's AUTH is provided with an AuthToken, there would need to be an interface to record the one to use for a given p2p connection.

But, as implemented git-annex remotedaemon will accept any of the authtokens in its list for any p2p connection. So if there are 2 onion services for the same repository for some reason, there will be 2 authtokens, but either can be used with either.

If there are 2 P2P connections and you decide to stop listening to one of them, it does mean that authtoken needs to be removed from the list, otherwise someone could still use it with the other P2P connection. If we think about 2 different P2P protocols, one might turn out to be insecure, so you stop using it. But then if the insecurity allowed someone else to observe the authtoken that was used with it, and you didn't remove it from the list, they could use that to connect via the other P2P service.

And the user does not know about authtokens, they're an implementation detail currently. So expecting the user to remove them from the list isn't really sufficient.

So it seems better for each P2P address to have its own unique authtoken, that is not accepted for any other address. Or at least each P2P address that needs an authtoken; perhaps some don't. (I don't think it's a problem that for tor each hidden service accepts all listed authtokens though.)

@matrrs wrote:

A configuration option annex.start-p2psocket=true would instruct remotedaemon to listen on .git/annex/p2psocket (I think a hardcoded location is fine, as there only really needs to be one such socket even with multiple networks

That single socket wouldn't work if each P2P address has its own unique authtoken. Because remotedaemon would have no way to know what P2P address that socket was connected with.

It also could be that some P2P protocol is 100% certain not to need an authtoken for security. That would need a separate socket where remotedaemon does not require AUTH with a valid authtoken. Or, setting up a P2P connection for such a network would need to exchange authtokens, even though there is no security benefit in doing so.

I don't know if I would want to make the determination of whether or not some P2P protocol needs an authtoken or not. It may be that the security situation of a P2P protocol evolves over time. Consider the case of tor, where it used to be fairly trivially possible to enumerate onion addresses. See for example this paper. (Which is why I made tor use AuthTokens in the first place IIRC.) Apparently changes were later made to tor to prevent that. I don't know how secure it is considered to be in this area now though.

If git-annex p2p is used to set up the P2P connection, it handles generating the authtokens and exchanging them, fairly transparently to the user. So maybe it would be simplest to always require authtokens.

There is another reason for the authtoken: The socket file may be accessible by other users of the system. This is the case with the tor socket, since tor runs as another user, and so the socket file is made world writable.

Comment by joey — Mon Jul 7 14:54:36 2025

comment 9

A configuration option annex.expose-p2p-via=foo that could be supplied zero, one, or multiple times, and each of these configurations would instruct remotedaemon to start the external program git-annex-p2ptransport-foo after the p2p socket is ready

Hmm, I don't know if it would generally make sense for remotedaemon to start up external programs that run P2P networks. That might be something that runs system-wide, like tor (often) does. Or the user might expect to run it themselves and only have git-annex use it when it's running.

It seems to me that in your yggstack example, there's no real need for remotedaemon to be responsible for running git-annex-p2ptransport-yggstack. You could run that yourself first. Then the remotedaemon can create the socket file and listen to it.

If a tcp connection comes in before the socket file exists, socat handles it by closing that connection, and keeps listening for further connections.

Comment by joey — Mon Jul 7 16:00:45 2025

comment 10

I had suggested using the remote's configuration to determine the socket that remotedaemon listens on.

Eg, a remote with uuid U could use .git/annex/p2p/U as its socket file.

But it may be that only incoming connections are wanted to be served, without having any remotes configured that use a P2P network. (And there could be multiple remotes that use the same P2P network.)

Instead, I think that remotedaemon should use socket files in the form .git/annex/p2p/$address, for each P2P address that loadP2PAddresses returns (except tor ones).

There could be a git-annex p2p --enable command, which is passed the P2P address to enable. Eg:

git-annex p2p --enable p2p-annex::yggstack+somepubkey.pk.ygg

That is similar to git-annex enable-tor in that it would run storeP2PAddress. And so configure remotedaemon to listen on the socket file for that address.

It could also generate an AuthToken and output a version of the address with the AuthToken included, similar to git-annex p2p --gen-addresses.

That would let its output be communicated to the remote users, who can feed it into git-annex p2p --link. For that matter, I think that git-annex p2p --pair would also work.

The address passed to git-annex p2p --enable could be anything, but using a p2p-annex::foo address makes a git-annex-p2p-foo command be used when connecting to the address.

Comment by joey — Mon Jul 7 17:29:16 2025

comment 11

I did some necessary groundwork for this in 46ee651c9438a5dfc430b231089d3ac1e0d09e3c.

I am about ready to really start implementing this, I think. The design seems to be ready.

Comment by joey — Mon Jul 7 19:20:36 2025

comment 12

I have started a design document at generic p2p transport, to collect all the scattered decisions here into a coherent document that can be used by someone implementing support for one of these networks.

In the process, I realized that rather than defining a path where git-annex expects the socket file to be, it could run the same git-annex-p2p-foo command in a mode where the command outputs the path to the socket file. That also lets the command run socat to connect up the socket, for example.

I also put in there that git-annex p2p --enable <netname> will run git-annex-p2p-<netname> address, which will output the P2P address that peers will use to connect the the repository. That seemed nicer than requiring the user to somehow come up with the P2P address on their own.

I have started writing the user-level documentation too, on the genericp2p branch.

Comment by joey — Tue Jul 29 16:41:07 2025

comment 13

The genericp2p branch now has what seems to be a fully working implementation of this. I have not fully tested it because I don't have a real git-annex-p2p-foo command to test it with. Still I can see the remotedaemon handling connections, and also making outgoing connections seems to work.

There have been some changes to the design, especially around how the socket file is set up.

Comment by joey — Thu Jul 31 19:16:43 2025

comment 14

This is merged!

See p2p for the end-user documentation, including a list where you can add whatever scripts you implement for different P2P networks.

Comment by joey — Fri Aug 1 17:41:53 2025

Add a comment