Please describe the problem.
P2phttp can deadlock with multiple concurrent clients talking to it.
What steps will reproduce the problem?
- Create a git-annex repository with a bunch of annexed files served via p2phttp like so:
git-annex --debug p2phttp -J2 --bind 127.0.0.1 --wideopen - Create multiple different clones of that repository connected via p2phttp all doing
while true; do git annex drop .; git annex get --in origin; done - Observe a deadlock after an indeterminate amount of time
This deadlock seems to occur faster the more repos you use. I've tried increasing -J to 3 and had it deadlock with two client repos once, but that seems to happen much less often.
What version of git-annex are you using? On what operating system?
$ git annex version
git-annex version: 10.20250929-g33ab579243742b0b18ffec2ea4ce1e3a827720b4
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV Servant OsPath
dependency versions: aws-0.24.4 bloomfilter-2.0.1.2 crypton-1.0.4 DAV-1.3.4 feed-1.3.2.1 ghc-9.10.2 http-client-0.7.19 persistent-sqlite-2.13.3.1 torrent-10000.1.3 uuid-1.3.16 yesod-1.6.2.1
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external compute mask
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 10
Please provide any additional information below.
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
# End of transcript or log.
I think that --debug output from the p2phttp server would be helpful in narrowing down if there is particular operation that causes this hang.
p2phttp has a pool of worker threads, so if a thread stalls out, or potentially crashes in some way that is not handled, it can result in all subsequent operations hanging. 91dbcf0b56ba540a33ea5a79ed52f33e82f4f61b is one recent example of that; I remember there were some similar problems when initially developing it.
-J2 also seems quite low though. With the http server itself using one of those threads, all requests get serialized through the second thread. If there is any situation where request A needs request B to be made and finish before it can succeed, that would deadlock.
I was able to reproduce this fairly quickly with 2 clones, each running the loop on the same 5 files, which I made each be 1 mb in size.
Both hung on get, of different files. The tail of the --debug:
A second hang happened with the loops each running on the same 2 files. This time, one clone was doing "get 1" and the other clone "drop 1 (locking origin...)" when they hung.
A third hang, again with 2 files, and both hung on "drop 1 (locking origin...)"
Here the P2P protocol client inside the http server got a worker thread, but then apparently the http response handler stalled out. That's different from the other 2 debug logs where the protocol client was able to send a response. I think in the other 2 debug logs, the P2P protocol client then stalls getting a worker thread.
Aha! We have two things both calling inAnnexWorker:
For each http request, both of these run asyncronously.
So, with -J2, if two http requests happen at the same time, and localConnection wins both races, the two worker threads are both stalled waiting for a response from the P2P server side. Which is blocked waiting for a worker thread. Or perhaps both of the http response handlers win, similar deadlock.
Maybe it could even happen that the localConnection for one request wins, as well as the response handler for the other request?
(And higher -J numbers would still have the same problem, when there are more clients. The docs for -J are also a bit wrong, they say that the http server uses 1 thread itself, but it can really use any number of threads since localConnection does run inAnnexWorker in an async action.)
Anyway, if this analysis is correct, the fix is surely to have 2 worker thread pools, once for the P2P protocol client side, and one for the P2P protocol server side.
Tested a modified p2phttp that uses 2 worker pools, one for the P2P client side and one for server side. This means that -J2 actually runs up to 4 threads, although with only 2 capabilities, so the change won't affect CPU load. So I tried with -J2 and 4 clients running the loop.
It still stalls much as before, maybe after a bit longer.
It still seems likely that the two workers used per http request is the root of the problem. When there are more than annex.jobs concurrent requests, each http response handler calls inAnnexWorker, and so one will block. If the corresponding localConnection successfully gets a worker, that means one of the other localConnections is blocked. Resulting in a running http response handler whose corresponding localConnection is blocked. The inverse also seems possible.
If 2 worker pools is not the solution, it seems it would need to instead be solved by rearchitecting the http server to not have that separation. Or to ensure that getP2PConnection doesn't return until the localConnection has allocated its worker. I'll try that next.
Pushed a preliminary fix in the
p2phttp_deadlockbranch.That has some known problems, documented in the commit. But it does avoid p2phttp locking up like this.
I should have been a bit more clear, I also saw the deadlock sometimes with concurrent get's, sometimes with drop's, and sometimes with a mix of both, so there wasn't one particular operation that seemed to be the issue.
This is for Forgejo-aneksajo, where there is still one p2phttp process being started per repository. Since there could potentially be 1000's of concurrent processes at any given time I thought it might be wise to start with the bare minimum by default. Due to how p2phttp and proxying is supposed to interact I've also realized that the current integration is not working as it should (https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo/issues/96) and that I probably won't be able to make use of the single p2phttp process for all repositories (because of ambiguity with authorization when there are multiple different repositories with differing permissions that proxy for the same remote).
I've landed a complete fix for this. The server no longer locks up when I run the test case for a long time.
Additionally, there was a bug that caused p2phttp to leak lock file descriptors, which gets triggered by the same test case. I've fixed that.
There are two problems I noticed in testing though.
git-annex getsometimes slows down to just bytes per second, or entirely stalls. This is most apparent with-J10, but I've seen it happen even when there is no concurrency or other clients. This should probably be treated as a separate bug, but it does cause the test case to eventually hang, unless git-annex is configured to do stall detection. The server keeps responding to other requests though.Running
git-annex dropand interrupting it at the wrong moment while it's locking content on the server seems to cause a P2P protocol worker to not get returned to the worker pool. When it happens enough times, this can cause the server to stop responding to new requests. Which seems closely related to this bug.Fixed the problem with interrupted
git-annex drop.Opened a new bug report about sometimes stalling
git-annex drop: get from p2phttp sometimes stallsI think I've fully addressed this bug report now, so will close it.
After fixing the other bug, I have successfully run the test for several hours without any problems.