Recent comments posted to this site:

comment 13

I might have some misunderstandings about what the -J flag does exactly... So far I assumed that it just sets the number of OS threads that are used as a worker pool to handle requests. In Forgejo-aneksajo it is set to -J2 because of that assumption and there being one p2phttp process per repository (if p2phttp has recently been used with the repository, it is started on demand and stopped after a while of non-usage), so larger values could multiply pretty fast. Your description sounds like it should actually just be a limit on the number of requests that can be handled concurrently, independent of the size of the worker pool. What I am observing when I increase it is that htop shows two new threads when I increment the value by one though.

Could there be a fixed-size (small) worker pool, and a higher number of concurrent requests allowed? I agree that limiting the total resource usage makes a lot of sense, but does it have to be tied to the thread count?

Maybe I am overthinking this a bit though, and I should just bump the number up by one or more factors of 2 and see what happens.

Comment by matrss
comment 12

Re serving more requests than workers, the point of limiting the number workers is that each worker can take a certian amount of resources. The resource may only be a file descriptor and a bit of cpu and memory usually; with proxying it could also include making outgoing connections, running gpg, etc. The worker limit is about being able to control the total amount of resources used.


It would be possible to have an option where p2phttp does not limit the number of workers at all, and the slowloris attack prevention could be left enabled in that mode. Of course then enough clients could overwhelm the server, but maybe that's better for some use cases.

IIRC forgejo-aneksajo runs one p2phttp per repository and proxies requests to them. If so, you need a lower worker limit per p2phttp. I suppose it would be possible to make the proxy enforce its own limits to the number of concurrent p2phttp requests, and then it might make sense to not have p2phttp limit the number of workers.


p2phttp (or a proxy in front if it) could send a 503 response if it is unable to get a worker. That would avoid this slowlaris attack prevention problem. It would leave it up to the git-annex client to retry. Which depends on the annex.retry setting currently. It might make sense to have some automatic retrying on 503 in the p2phttp client.

One benefit of the way it works now is a git-annex get -J10 will automatically use as many workers as the p2phttp server has available, and if 2 people are both running that, it naturally balances out fairly evenly between them, and keeps the server as busy as it wants to be in an efficient way. Client side retry would not work as nicely, there would need to be retry delays, and it would have to time out at some point.

Comment by joey
comment 11

The kill -SIGINT was my mistake; I ran the script using dash and it was its builtin kill that does not accept that.

So, your test case was supposed to interrupt it after all. Tested it again with interruption and my fix does seem to have fixed it as best I can tell.

Comment by joey
comment 10

kill -s SIGINT is not valid syntax (at least not with procps's kill), so kill fails to do anything and a bunch of git-annex processes stack up all trying to get the same files. Probably you meant kill -s INT

That's weird, I checked and both the shell built-in kill that I was using as well as the kill from procps-ng (Ubuntu's build: procps/noble-updates,now 2:4.0.4-4ubuntu3.2 amd64) installed on my laptop accept -s SIGINT.

Anyway, thank you for investigating! I agree being susceptible to DoS attacks is not great, but better than accidentally DoS'ing ourselves in normal usage...

I wonder, would it be architecturally possible to serve multiple requests concurrently with less workers than requests? E.g. do some async/multitasking magic between requests? If that was the case then I suspect this issue wouldn't come up, because all requests would progress steadily instead of waiting for a potentially long time.

Comment by matrss
comment 9

Disabled the slowloris protection. :-/

I also checked with the original test case, fixed to call kill -s INT, and it also passed. I'm assuming this was never a bug about interruption..

Comment by joey
comment 8

Seems likely that getP2PConnection is run by serveGet, and the worker slot is allocated. Then a ThreadKilled exception arrives before the rest of serveGet's threads are started up. So the worker slot never gets freed. It's even possible that getP2PConnection is itself not cancellation safe.

So, I made all of serveGet be inside an uninterruptibleMask. That did seem to make the test case get past more slowloris cancellations than before. But, it still eventually hung.

Given the inversion of control that servant and streaming response body entails, it seems likely that an ThreadKilled exception could arrive at a point entirely outside the control of git-annex, leaving the P2P connection open with no way to close it.

I really dislike that this slowloris attack prevention is making me need to worry about the server threads getting cancelled at any point. That requires significantly more robust code, if it's even possible.

So, I think disabling the slowloris attack prevention may be the way to go, at least until warp is fixed to allow only disabling it after the Request is received.

Doing so will make p2phttp more vulnerable to DDOS, but as it stands, it's vulnerable to locking up due to entirely legitimate users just running a few git-annex gets. Which is much worse!

Comment by joey
comment 7

786360cdcf7f784847715ec79ef9837ada9fa649 catches an exception that the slowloris attack prevention causes. It does prevent the server locking up... but only sometimes. So the test case gets further, but eventually still locks up.

Since slowloris attack prevention can cancel the thread at any point, it seems likely that there is some other point there a resource is left un-freed.

Comment by joey
comment 6

Developed the below patch to use pauseTimeout after the Request is consumed.

Unfortunatelty, I then discovered that pauseTimeout does not work!

This leaves only the options of waiting for a fixed version of warp, or disabling slowloris prevention entirely, or somehow dealing with the way that the Response handler gets killed by the timeout.

diff --git a/P2P/Http/Server.hs b/P2P/Http/Server.hs
index b7f773301a..8c8ae96c06 100644
--- a/P2P/Http/Server.hs
+++ b/P2P/Http/Server.hs
@@ -40,14 +40,27 @@ import qualified Servant.Types.SourceT as S
 import qualified Data.ByteString as B
 import qualified Data.ByteString.Lazy as L
 import qualified Data.ByteString.Lazy.Internal as LI
+import qualified Network.Wai.Handler.Warp as Warp
 import Control.Concurrent.Async
 import Control.Concurrent.STM
 import Control.Concurrent
 import System.IO.Unsafe
 import Data.Either

+-- WAI middleware that disables warp's usual Slowloris protection after the
+-- Request is received. This is needed for the p2phttp server because
+-- after a client connects and makes its Request, and when the Request
+-- includes valid authentication, the server waits for a worker to become
+-- available to handle it. During that time, no traffic is being sent,
+-- which would usually trigger the Slowloris protection.
+avoidResponseTimeout :: Application -> Application
+avoidResponseTimeout app req resp = do
+   liftIO $ Warp.pauseTimeout req
+   app req resp
+
 p2pHttpApp :: TMVar P2PHttpServerState -> Application
-p2pHttpApp = serve p2pHttpAPI . serveP2pHttp
+p2pHttpApp st = avoidResponseTimeout $ serve p2pHttpAPI $ serveP2pHttp st

 serveP2pHttp :: TMVar P2PHttpServerState -> Server P2PHttpAPI
 serveP2pHttp st
Comment by joey
comment 5

So, can the slowloris attack prevention just be disabled in p2phttp, without exposing it to problems due to that attack?

Well, the slowloris attack is a DDOS that tries to open as many http connections to the server as possible, and keep them open with as little bandwidth used as possible. It does so by sending partial request headers slowly, so the server is stuck waiting to see the full request.

Given that the p2phttp server is serving large objects, and probably runs with a moderately low -J value (probably < 100), just opening that many connections to the server each requesting an object, and consuming a chunk of the response once per 30 seconds would be enough to work around Warp's protections against the slowloris attack. Which needs little enough bandwidth to be a viable attack.

The client would need authentication to do that though. A slowloris attack though just sends requests, it does not need to successfully authenticate.

So it would be better to disable the slowloris attack prevention only after the request has been authenticated.

warp provides pauseTimeout that can do that, but I'm not sure how to use it from inside a servant application.

Comment by joey
comment 4

Warp's Slowloris attack prevention seems to be causing this problem. I was able to get the test case to not hang by applying Warp.setTimeout 1000000000 to the warp settings.

I guess that, when Warp detects what it thinks is a slowloris attack, it kills the handling thread in some unusual way. Which prevents the usual STM exception from being thrown?

This also explains the InvalidChunkHeaders exception, because the http server has hung up on the client before sending the expected headers.

git-annex get is triggering the slowloris attack detection because it connects to the p2phttp server, sends a request, and then is stuck waiting some long period of time for a worker slot to become available.

Warp detects a slowloris attack by examining how much network traffic is flowing. And in this case, no traffic is flowing.

So the reason this test case triggers the problem is because it's using 1 GB files! With smaller files, the transfers happen too fast to trigger the default 30 second timeout.

Comment by joey