Recent comments posted to this site:
kill -s SIGINT is not valid syntax (at least not with procps's kill), so kill fails to do anything and a bunch of git-annex processes stack up all trying to get the same files. Probably you meant kill -s INT
That's weird, I checked and both the shell built-in kill that I was using as well as the kill from procps-ng (Ubuntu's build: procps/noble-updates,now 2:4.0.4-4ubuntu3.2 amd64) installed on my laptop accept -s SIGINT.
Anyway, thank you for investigating! I agree being susceptible to DoS attacks is not great, but better than accidentally DoS'ing ourselves in normal usage...
I wonder, would it be architecturally possible to serve multiple requests concurrently with less workers than requests? E.g. do some async/multitasking magic between requests? If that was the case then I suspect this issue wouldn't come up, because all requests would progress steadily instead of waiting for a potentially long time.
Disabled the slowloris protection. 
I also checked with the original test case, fixed to call kill -s INT,
and it also passed. I'm assuming this was never a bug about interruption..
Seems likely that getP2PConnection is run by serveGet, and the worker slot is allocated. Then a ThreadKilled exception arrives before the rest of serveGet's threads are started up. So the worker slot never gets freed. It's even possible that getP2PConnection is itself not cancellation safe.
So, I made all of serveGet be inside an uninterruptibleMask. That did seem to make the test case get past more slowloris cancellations than before. But, it still eventually hung.
Given the inversion of control that servant and streaming response body entails, it seems likely that an ThreadKilled exception could arrive at a point entirely outside the control of git-annex, leaving the P2P connection open with no way to close it.
I really dislike that this slowloris attack prevention is making me need to worry about the server threads getting cancelled at any point. That requires significantly more robust code, if it's even possible.
So, I think disabling the slowloris attack prevention may be the way to go, at least until warp is fixed to allow only disabling it after the Request is received.
Doing so will make p2phttp more vulnerable to DDOS, but as it stands, it's
vulnerable to locking up due to entirely legitimate users just running
a few git-annex gets. Which is much worse!
786360cdcf7f784847715ec79ef9837ada9fa649 catches an exception that the slowloris attack prevention causes. It does prevent the server locking up... but only sometimes. So the test case gets further, but eventually still locks up.
Since slowloris attack prevention can cancel the thread at any point, it seems likely that there is some other point there a resource is left un-freed.
Developed the below patch to use pauseTimeout after the Request is consumed.
Unfortunatelty, I then discovered that pauseTimeout does not work!
This leaves only the options of waiting for a fixed version of warp, or disabling slowloris prevention entirely, or somehow dealing with the way that the Response handler gets killed by the timeout.
diff --git a/P2P/Http/Server.hs b/P2P/Http/Server.hs
index b7f773301a..8c8ae96c06 100644
--- a/P2P/Http/Server.hs
+++ b/P2P/Http/Server.hs
@@ -40,14 +40,27 @@ import qualified Servant.Types.SourceT as S
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Lazy.Internal as LI
+import qualified Network.Wai.Handler.Warp as Warp
import Control.Concurrent.Async
import Control.Concurrent.STM
import Control.Concurrent
import System.IO.Unsafe
import Data.Either
+-- WAI middleware that disables warp's usual Slowloris protection after the
+-- Request is received. This is needed for the p2phttp server because
+-- after a client connects and makes its Request, and when the Request
+-- includes valid authentication, the server waits for a worker to become
+-- available to handle it. During that time, no traffic is being sent,
+-- which would usually trigger the Slowloris protection.
+avoidResponseTimeout :: Application -> Application
+avoidResponseTimeout app req resp = do
+ liftIO $ Warp.pauseTimeout req
+ app req resp
+
p2pHttpApp :: TMVar P2PHttpServerState -> Application
-p2pHttpApp = serve p2pHttpAPI . serveP2pHttp
+p2pHttpApp st = avoidResponseTimeout $ serve p2pHttpAPI $ serveP2pHttp st
serveP2pHttp :: TMVar P2PHttpServerState -> Server P2PHttpAPI
serveP2pHttp st
So, can the slowloris attack prevention just be disabled in p2phttp, without exposing it to problems due to that attack?
Well, the slowloris attack is a DDOS that tries to open as many http connections to the server as possible, and keep them open with as little bandwidth used as possible. It does so by sending partial request headers slowly, so the server is stuck waiting to see the full request.
Given that the p2phttp server is serving large objects, and probably runs with a moderately low -J value (probably < 100), just opening that many connections to the server each requesting an object, and consuming a chunk of the response once per 30 seconds would be enough to work around Warp's protections against the slowloris attack. Which needs little enough bandwidth to be a viable attack.
The client would need authentication to do that though. A slowloris attack though just sends requests, it does not need to successfully authenticate.
So it would be better to disable the slowloris attack prevention only after the request has been authenticated.
warp provides pauseTimeout that can do that, but I'm not sure how
to use it from inside a servant application.
Warp's Slowloris attack prevention seems to be causing this problem.
I was able to get the test case to not hang by applying
Warp.setTimeout 1000000000 to the warp settings.
I guess that, when Warp detects what it thinks is a slowloris attack, it kills the handling thread in some unusual way. Which prevents the usual STM exception from being thrown?
This also explains the InvalidChunkHeaders exception, because the http server has hung up on the client before sending the expected headers.
git-annex get is triggering the slowloris attack detection because
it connects to the p2phttp server, sends a request, and then is stuck
waiting some long period of time for a worker slot to become available.
Warp detects a slowloris attack by examining how much network traffic is flowing. And in this case, no traffic is flowing.
So the reason this test case triggers the problem is because it's using 1 GB files! With smaller files, the transfers happen too fast to trigger the default 30 second timeout.
The hang happens here:
-- Wait for everything to be transferred before
-- stopping the annexworker. The finalv will usually
-- be written to at the end. If the client disconnects
-- early that does not happen, so catch STM exception.
alltransferred <- isRight
<$> tryNonAsync (liftIO $ atomically $ takeTMVar finalv)
I think what is happening is finalv is never getting filled, but for whatever reason, STM also is not detecting a deadlock, so this does not fail with an exception and waits forever.
kill -s SIGINT is not valid syntax (at least not with procps's kill), so
kill fails to do anything and a bunch of git-annex processes stack up all
trying to get the same files. Probably you meant kill -s INT
With that said, the busted test case does work in exposing a problem, since the git-annex get processes hang.
This behaves the same:
for x in $(seq 1 5); do git annex get & done
That's without anything interrupting git-annex get at any point.
This error is displayed by some of the git-annex get processes, and once this has happened as many times as the number of jobs, the server is hung:
HttpExceptionRequest Request {
host = "localhost"
port = 3001
secure = False
requestHeaders = [("Accept","application/octet-stream")]
path = "/git-annex/8dd1a380-3785-4285-b93d-994e1ccb9fbf/v4/key/SHA256E-s1073741824--52fc7ce3067ad69f3989f7fef817670096f00eab7721884fe606d17b9215d6f5.bin"
queryString = "?clientuuid=2ab2859b-d423-4427-bac2-553e18c02197&associatedfile=test1.bin"
method = "GET"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
proxySecureMode = ProxySecureWithConnect
}
InvalidChunkHeaders
So this seems very similar to the bug that f2fed42a090e081bf880dcacc9a25bfa8a0f7d8f was supposed to fix. Same InvalidChunkHeaders exception indicating the http server response thread probably crashed.
And I've verified that when this happens, serveGet's waitfinal starts and never finishes, which is why the job slot remains in use.
BTW, InvalidChunkHeaders is a http-client exception, so it seems this might involve a problem at the http layer, so with http-client or warp? Looking in http-client, it is thrown in 3 situations. 2 are when a read from the server yields 0 bytes. The 3rd is when a line is read from the server, and the result cannot be parsed as hexidecimal. So it seems likely that the http server is crashing in the middle of servicing a request. Possibly due to a bug in the http stack.
This is extremely similar to the bug that f2fed42a090e081bf880dcacc9a25bfa8a0f7d8f was supposed to fix. But that had a small amount of gets hang, without any interruptions being needed to cause it. So I think is different.
There was also the similar 1c67f2310a7ca3e4fce183794f0cff2f4f5d1efb where an interrupted drop caused later hangs.