Recent comments posted to this site:
After fixing the other bug, I have successfully run the test for several hours without any problems.
Using DebugLocks, found that the deadlock is in checkvalidity,
the second time it calls putTMVar endv ().
That was added in 7bd616e169827568c4ca6bc6e4f8ae5bf796d2d8 "a bugfix to serveGet, it hung at the end".
Looks like a race between checkvalidity and waitfinal, which both fill endv. waitfinal does not deadlock when endv is already full, but checkvalidity does.
I saw this bug with git-annex built using haskell packages from current debian unstable.
On a hunch, I tried a stack build, and it does not stall. However, I am
seeing this from the http server at about the same frequency as the stall,
and occuring during the git-annex get:
thread blocked indefinitely in an STM transaction
And at the same time, this is reported on the client side:
get 27 (from origin...)
HttpExceptionRequest Request {
host = "localhost"
port = 9417
secure = False
requestHeaders = [("Accept","application/octet-stream")]
path = "/git-annex/a697daef-f8c3-4e64-a3e0-65927e36d06b/v4/k
queryString = "?clientuuid=9bc0478c-a0ff-4159-89ab-14c13343beb9&ass
method = "GET"
proxy = Nothing
rawBody = False
redirectCount = 10
responseTimeout = ResponseTimeoutDefault
requestVersion = HTTP/1.1
proxySecureMode = ProxySecureWithConnect
}
IncompleteHeaders
ok
(I assume that it succeeded because it did an automatic retry when the first download was incomplete.)
I also tried using the stack build for the server, and the cabal build for the client, with the same result. With the cabal build for the server and stack build for the client, it stalls as before.
So it's a bug on the server side, and whatever it is causes one of the threads to get killed in a way that causes another STM transaction to deadlock. And the runtime happenes to detect the deadlock and resolve it when built with stack.
Fixed the problem with interrupted git-annex drop.
Opened a new bug report about sometimes stalling git-annex
drop: get from p2phttp sometimes stalls
I think I've fully addressed this bug report now, so will close it.
The deletion could be handled by a cron job that the user is responsible for setting up, which avoids needing to configure a time limit in git-annex, and also avoids the question of what git-annex command(s) would handle the clean up.
Agreed, that makes sense.
An alternative way to handle this would be to use the "appendonly" config of git-annex p2phttp (and git-annex-shell has something similar). Then the repository would refuse to drop. And instead you could have a cron job that uses git-annex unused to drop old objects.
While realistically most force drops probably would be unused files those two things aren't necessarily the same.
I think there are some benefits to that path, it makes explicit to the user that they data they wanted to drop is not immediately going away from the server.
I think I would deliberately want this to be invisible to the user, since I wouldn't want anyone to actively start relying on it.
Which might be important for legal reasons (although the prospect of backups of annexed files makes it hard to be sure if a server has really deleted something anyway).
That's a tradeoff for sure, but the expectation should already be that a hosted service like a Forgejo-aneksajo instance will retain backups at least for disaster recovery purposes. But that's on the admin(s) to communicate, and within a personal setting it doesn't matter at all.
And if the repository had a disk quota, this would make explicit to the user why dropping content from it didn't free up quota.
Actually for that reason I would not count this soft-deleted data towards quotas for my own purposes.
A third approach would be to have a config setting that makes dropped objects be instead moved to a remote. So the drop would succeed, but whereis would indicate that the object was being retained there. Then a cron job on the remote could finish the deletions.
I like this! Considering that such a "trash bin" (special) remote could be initialized with --private (right?) it would be possible to make it fully invisible to the user too, while indeed being much more flexible. I suppose the cron job would then be something like git annex drop --from trash-bin --all --not --accessedwithin=30d, assuming that moving it there counts as "accessing" and no background job on the server accesses it afterwards (maybe an additional matching option for mtime or ctime instead of atime would be useful here?). This feels very much git-annex'y 🙂
A third approach would be to have a config setting that makes dropped objects be instead moved to a remote. So the drop would succeed, but whereis would indicate that the object was being retained there. Then a cron job on the remote could finish the deletions.
This would not be singifinantly more heavyweight than just moving to a directory, if you used eg a directory special remote. And it's also a lot more flexible.
Of course, this would make dropping take longer than usual, depending on how fast the object could be moved to the remote. If it were slow, there would be no way to convey progress back to the user without a lot more complication than this feature warrants.
Open to your thoughts on these alternatives..
The deletion could be handled by a cron job that the user is responsible for setting up, which avoids needing to configure a time limit in git-annex, and also avoids the question of what git-annex command(s) would handle the clean up.
An alternative way to handle this would be to use the "appendonly" config
of git-annex p2phttp (and git-annex-shell has something similar). Then
the repository would refuse to drop. And instead you could have a cron job
that uses git-annex unused to drop old objects. This would need some way
to only drop unused objects after some period of time.
I think there are some benefits to that path, it makes explicit to the user that they data they wanted to drop is not immediately going away from the server. Which might be important for legal reasons (although the prospect of backups of annexed files makes it hard to be sure if a server has really deleted something anyway). And if the repository had a disk quota, this would make explicit to the user why dropping content from it didn't free up quota.
(I think it would also be possible to (ab)use the annex.secure-erase-command
to instead move objects to the directory. Probably not a good idea,
especially because there's no guarantee that command is only run on
complete annex objects.)
I've landed a complete fix for this. The server no longer locks up when I run the test case for a long time.
Additionally, there was a bug that caused p2phttp to leak lock file descriptors, which gets triggered by the same test case. I've fixed that.
There are two problems I noticed in testing though.
git-annex get sometimes slows down to just bytes per second,
or entirely stalls. This is most apparent with -J10, but I've seen it
happen even when there is no concurrency or other clients.
This should probably be treated as a separate bug, but it does
cause the test case to eventually hang, unless git-annex is configured
to do stall detection. The server keeps responding to
other requests though.
Running git-annex drop and interrupting it at the wrong moment
while it's locking content on the server seems to cause a P2P protocol
worker to not get returned to the worker pool. When it happens enough
times, this can cause the server to stop responding to new requests.
Which seems closely related to this bug.
I think that --debug output from the p2phttp server would be helpful in narrowing down if there is particular operation that causes this hang.
I should have been a bit more clear, I also saw the deadlock sometimes with concurrent get's, sometimes with drop's, and sometimes with a mix of both, so there wasn't one particular operation that seemed to be the issue.
-J2 also seems quite low though.
This is for Forgejo-aneksajo, where there is still one p2phttp process being started per repository. Since there could potentially be 1000's of concurrent processes at any given time I thought it might be wise to start with the bare minimum by default. Due to how p2phttp and proxying is supposed to interact I've also realized that the current integration is not working as it should (https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo/issues/96) and that I probably won't be able to make use of the single p2phttp process for all repositories (because of ambiguity with authorization when there are multiple different repositories with differing permissions that proxy for the same remote).
Pushed a preliminary fix in the p2phttp_deadlock branch.
That has some known problems, documented in the commit. But it does avoid p2phttp locking up like this.