Being able to connect repositories peer to peer is nice, but only having tor as an option is quite limiting, especially considering that tor isn't that suitable for large file transfers. It would be nice if git-annex could be taught to use other transports as well (what I have in mind is yggstack or fowl (for fowl I had already opened a todo in the past: https://git-annex.branchable.com/todo/Peer_to_peer_connection_purely_over_magic-wormhole/), but there are probably others that could be used as well).
What I am thinking would be nice to have for this is:
- Something like
git annex enable-p2p-socket
, which would configure the repository such thatgit annex remotedaemon
listens on a unix socket somewhere under .git/annex for incoming p2p connections, which would be authenticated using the pairing process fromgit annex p2p
just like when using the tor transport. - A git remote helper
p2p-annex::<path-to-socket-file>
, which would connect to the unix socket and speak the p2p protocol with it.
With these two things in place it would be possible to use any transport to connect the socket files on two systems, including yggstack, fowl, or just netcat or socat (though unencrypted communication would be a bad idea).
My understanding is that the current tor p2p support is essentially a special case of the above, using a socket file in /var/lib/tor-annex and requiring a hidden service configuration in torrc on the server-side, while being limited to onion addresses on the client-side. In that sense this would just be a generalization and I think most of the code to support this is already there, and just needs to be wired differently.
This should also make it possible to build e.g. a git annex enable-yggstack
and yggstack-annex::<pubkey>.pk.ygg
remote in terms of enable-p2p-socket and p2p-annex::
, even outside of git-annex itself.
What do you think?
git annex shell p2pstdio
, but without skipping authentication, instead using thegit annex p2p
pairing process. Something like socat could then be used to connect those stdin/stdout's to a unix socket, tcp port, or whatever else.This was all designed to be generalizable to some degree, but has so far really only been used for tor.
Making it generic may be a good idea. Or it may be that there are really too many complications around how different p2p networks and addresses work and how authentication is done, that would complicate a generic command, but that can be transparently handled when implementing support for a specific p2p transport, as was done for tor.
Working from the client end, the git remote has an url, which needs to be identified as a p2p address to use a p2p transport to talk to it. Currently that is an url starting with "tor-annex:". Like you suggest, the generic one could be "p2p-annex::". Or it could be "p2p-annex::foo+" which causes git-annex to run a command like
git-annex-p2p-foo <bar>
and talk to its stdin and stdout.That's for outgoing connections. For incoming connections, for tor, the remotedaemon creates the socket file that tor is configured to use for the hidden service, and listens to it to accept connections from tor. (That tor socket is not used for outgoing connections.) It would be easy to generalize this to additional socket filenames. Eg, a remote with uuid U could use
.git/annex/p2p/U
as its socket file.BTW, that git-annex-p2p-foo command is different from the git remote helper you suggest, which corresponds to git-remote-tor-annex. But, git-remote-tor-annex would easily generalize to a git-remote-p2p-annex git remote helper, if there was a generic p2p-annex url type and a way to connect to it.
If the P2P protocol's AUTH is provided with an AuthToken, there would need to be an interface to record the one to use for a given p2p connection.
git-annex p2p
handles setting up AuthTokens, but its approach may or may not make sense for a given p2p protocol. It does look like, if there's a generic way implemented to connect to a given p2p-annex url,git-annex p2p
would mostly work. But there would need to be a way to generate an address using such an url, likegit-annex enable-tor
does.Seems pretty close to a workable design to me, but I don't know how well it will match up with these various kinds of P2P networks.
Your comment seems to be wrongly formatted. It was shown correctly in the notification mail, but doesn't show up here.
Just to document what I have tried out, for completeness: with what is already in place it is possible to connect two repositories over yggstack, it is just very awkward.
On one system you can do:
sudo mkdir /etc/tor && sudo touch /etc/tor/torrc
(without actually having tor installed)sudo git annex enable-tor $(id -u)
yggstack -genconf > yggstack.conf
echo tor-annex::<pubkey>.pk.ygg:12345
(take the pubkey out of yggstack.conf)socat TCP-LISTEN:12345,fork,reuseaddr UNIX-CONNECT:/var/lib/tor-annex/<uid>_<repo-uuid>/s
yggstack -useconffile yggstack.conf -remote-tcp 12345:127.0.0.1:12345
git annex p2p --gen-addresses
On the other system do:
yggstack -autoconf -socks 127.0.0.1:9050
git annex p2p --link
and paste in the generated address when asked (it should have the formtor-annex::<pubkey>.pk.ygg:12345:<auth-token>
)On the server side this simply exposes the p2p socket generated for tor through a different means, and on the client side this works because yggstack can be used similarly enough to tor (doing name resolution through the socks proxy at port 9050 and then connecting the supplied port).
I really like your proposal of a
p2p-annex::foo+<whatever>
remote; together with a way to tell remotedaemon to start a process exposing the socket it would make for an easily extendable mechanism. Imagine this:Client side:
p2p-annex::foo+<addr>
would startgit-annex-p2p-foo <addr>
and talk to its stdin/stdout.Server side:
annex.start-p2psocket=true
would instruct remotedaemon to listen on .git/annex/p2psocket (I think a hardcoded location is fine, as there only really needs to be one such socket even with multiple networks, and somewhere under .git/annex is a good location to associate it with the repository and will always be writable by the user).annex.expose-p2p-via=foo
that could be supplied zero, one, or multiple times, and each of these configurations would instruct remotedaemon to start the external program git-annex-p2ptransport-foo after the p2p socket is ready (this configuration could also just point to a command to execute, but I thought it might be nice to stay with the theme of commonly prefixed programs).With these things in place a third-party package git-annex-p2p-yggstack could provide a simple set of shell scripts to implement transport over yggstack:
For the server side there would be a
git-annex-p2ptransport-yggstack
along these lines (modulo proper process cleanup of course):and a
git-annex-p2ptransport-enable-yggstack
like this:For the client-side it would provide
git-annex-p2p-yggstack
along these lines:With that package installed one could then do
git annex p2ptransport enable-yggstack
followed bygit annex p2p --gen-addresses
. Agit annex remotedaemon
would now start everything on the server-side, and the client-side could connect usinggit annex p2p --link
with the address from--gen-addresses
.I think this would be sufficiently flexible for most kinds of p2p transport one could come up with. E.g. a transport over fowl or even plain magic-wormhole (though the transit relay wouldn't appreciate it) could use
p2p-annex::fowl+<code>
where the code is a pre-generated token instead of the usual passphrases used by magic-wormhole. The server side would be a script that repeatedly waits for connections to that code, the client side just connects to it.Even for more traditional p2p setups (tinc, wireguard, yggdrasil, etc.) where the transport is pre-set up at the system level this would just work if there was a helper for
p2p-annex::tcpip+<hostname>:<port>
(effectively just netcat again).Configuration, program, and subcommand names etc. are of course open to bike-shedding. Some of the hardcoded ports above should be dynamically chosen, or completely avoided if the transport can do so (yggstack and fowl can't expose unix sockets directly yet, so the digression through the loopback device is needed for now).
What do you think?
One more thought: the proposed
p2p-annex::foo+<addr>
remote makes one assumption that I don't think holds for all thinkable p2p transports. That assumption is that there is a public address for the server-side that can be trusted to be the expected other side.For tor and yggstack this does hold: the public address (onion address of the hidden service for tor and the IPv6 derived from the public key of the yggstack peer (potentially resolved from a .pk.ygg DNS entry like above), respectively) ensures that the server side is who they are expected to be. There is no way for a third-party to pretend that they were the server-side, even if they knew the git remote string, because they would need to have the servers private key to do so.
This is not the case for fowl: with fowl one would essentially do
fowl <psk> ...
on both sides to create a tunnel between server and client. If the PSK were fully contained in the remote string then a third-party getting hold of that string could pretend to be the server (when the server side is currently not waiting for a connection itself) and steal the auth token from the client. So under the assumption that the remote string is not a secret this would be a problem.But this problem can be overcome: with fowl both sides could simply derive the psk from the p2p auth token to establish the connection, essentially like so:
fowl <number derived from auth token>-<auth token> ...
. The git remote string would only need to contain the information to use fowl and some unique identifier for the remote then, so that the right auth token can be taken from .git/annex/creds.Likewise, for other p2p transports that don't have stable and secure public addresses, necessary information exchange could also happen over magic-wormhole using the auth tokens, or the auth tokens could be used as PSKs between both sides if that's what the transport needs. This would e.g. apply for a hypothetical transport over webrtc data channels, where some kind of "SDP" has to be exchanged between both sides to establish a connection.
All that to say: I think
p2p-annex::foo+
would indeed be general enough for many conceivable means of transport, if a re-use of the auth tokens in the above fashion would be acceptable. And I can't think of anything against it, yet.I agree this would be a problem, but how would a third-party get ahold of the string though? Remote urls don't usually get stored in the git repository, perhaps you were thinking of some other way.
My thinking was that git remote URLs usually aren't sensitive information that inherently grant access to a repository, so a construct where the remote URL contains the credentials is just unexpected. A careless user might e.g. put it into a
type=git
special remote or treat it in some other way in which one wouldn't treat a password, without considering the implications. I am not aware of a way in which they could be leaked without user intervention, though.Having separate credentials explicitly named as such just seems safer. But in the end this would be the responsibility of the one implementing the p2p transport, anyway.
Please take a look at iroh. It started as an IPFS implementation in rust, realized that IPFS is slow and overengineered and now pivoted to providing p2p connections with quic.
I'm waiting for their FOSDEM talk. But there is also a good presentation on YT: A tour of iroh.
I wrote:
But, as implemented
git-annex remotedaemon
will accept any of the authtokens in its list for any p2p connection. So if there are 2 onion services for the same repository for some reason, there will be 2 authtokens, but either can be used with either.If there are 2 P2P connections and you decide to stop listening to one of them, it does mean that authtoken needs to be removed from the list, otherwise someone could still use it with the other P2P connection. If we think about 2 different P2P protocols, one might turn out to be insecure, so you stop using it. But then if the insecurity allowed someone else to observe the authtoken that was used with it, and you didn't remove it from the list, they could use that to connect via the other P2P service.
And the user does not know about authtokens, they're an implementation detail currently. So expecting the user to remove them from the list isn't really sufficient.
So it seems better for each P2P address to have its own unique authtoken, that is not accepted for any other address. Or at least each P2P address that needs an authtoken; perhaps some don't. (I don't think it's a problem that for tor each hidden service accepts all listed authtokens though.)
@matrrs wrote:
That single socket wouldn't work if each P2P address has its own unique authtoken. Because remotedaemon would have no way to know what P2P address that socket was connected with.
It also could be that some P2P protocol is 100% certain not to need an authtoken for security. That would need a separate socket where remotedaemon does not require AUTH with a valid authtoken. Or, setting up a P2P connection for such a network would need to exchange authtokens, even though there is no security benefit in doing so.
I don't know if I would want to make the determination of whether or not some P2P protocol needs an authtoken or not. It may be that the security situation of a P2P protocol evolves over time. Consider the case of tor, where it used to be fairly trivially possible to enumerate onion addresses. See for example this paper. (Which is why I made tor use AuthTokens in the first place IIRC.) Apparently changes were later made to tor to prevent that. I don't know how secure it is considered to be in this area now though.
If
git-annex p2p
is used to set up the P2P connection, it handles generating the authtokens and exchanging them, fairly transparently to the user. So maybe it would be simplest to always require authtokens.There is another reason for the authtoken: The socket file may be accessible by other users of the system. This is the case with the tor socket, since tor runs as another user, and so the socket file is made world writable.
Hmm, I don't know if it would generally make sense for remotedaemon to start up external programs that run P2P networks. That might be something that runs system-wide, like tor (often) does. Or the user might expect to run it themselves and only have git-annex use it when it's running.
It seems to me that in your yggstack example, there's no real need for remotedaemon to be responsible for running
git-annex-p2ptransport-yggstack
. You could run that yourself first. Then the remotedaemon can create the socket file and listen to it.If a tcp connection comes in before the socket file exists, socat handles it by closing that connection, and keeps listening for further connections.
I had suggested using the remote's configuration to determine the socket that remotedaemon listens on.
But it may be that only incoming connections are wanted to be served, without having any remotes configured that use a P2P network. (And there could be multiple remotes that use the same P2P network.)
Instead, I think that remotedaemon should use socket files in the form
.git/annex/p2p/$address
, for each P2P address that loadP2PAddresses returns (except tor ones).There could be a
git-annex p2p --enable
command, which is passed the P2P address to enable. Eg:That is similar to
git-annex enable-tor
in that it would runstoreP2PAddress
. And so configure remotedaemon to listen on the socket file for that address.It could also generate an AuthToken and output a version of the address with the AuthToken included, similar to
git-annex p2p --gen-addresses
.That would let its output be communicated to the remote users, who can feed it into
git-annex p2p --link
. For that matter, I think thatgit-annex p2p --pair
would also work.The address passed to
git-annex p2p --enable
could be anything, but using a p2p-annex::foo address makes agit-annex-p2p-foo
command be used when connecting to the address.I did some necessary groundwork for this in 46ee651c9438a5dfc430b231089d3ac1e0d09e3c.
I am about ready to really start implementing this, I think. The design seems to be ready.