Recent comments posted to this site:

comment 3

The deletion could be handled by a cron job that the user is responsible for setting up, which avoids needing to configure a time limit in git-annex, and also avoids the question of what git-annex command(s) would handle the clean up.

Agreed, that makes sense.

An alternative way to handle this would be to use the "appendonly" config of git-annex p2phttp (and git-annex-shell has something similar). Then the repository would refuse to drop. And instead you could have a cron job that uses git-annex unused to drop old objects.

While realistically most force drops probably would be unused files those two things aren't necessarily the same.

I think there are some benefits to that path, it makes explicit to the user that they data they wanted to drop is not immediately going away from the server.

I think I would deliberately want this to be invisible to the user, since I wouldn't want anyone to actively start relying on it.

Which might be important for legal reasons (although the prospect of backups of annexed files makes it hard to be sure if a server has really deleted something anyway).

That's a tradeoff for sure, but the expectation should already be that a hosted service like a Forgejo-aneksajo instance will retain backups at least for disaster recovery purposes. But that's on the admin(s) to communicate, and within a personal setting it doesn't matter at all.

And if the repository had a disk quota, this would make explicit to the user why dropping content from it didn't free up quota.

Actually for that reason I would not count this soft-deleted data towards quotas for my own purposes.

A third approach would be to have a config setting that makes dropped objects be instead moved to a remote. So the drop would succeed, but whereis would indicate that the object was being retained there. Then a cron job on the remote could finish the deletions.

I like this! Considering that such a "trash bin" (special) remote could be initialized with --private (right?) it would be possible to make it fully invisible to the user too, while indeed being much more flexible. I suppose the cron job would then be something like git annex drop --from trash-bin --all --not --accessedwithin=30d, assuming that moving it there counts as "accessing" and no background job on the server accesses it afterwards (maybe an additional matching option for mtime or ctime instead of atime would be useful here?). This feels very much git-annex'y 🙂

Comment by matrss
comment 2

A third approach would be to have a config setting that makes dropped objects be instead moved to a remote. So the drop would succeed, but whereis would indicate that the object was being retained there. Then a cron job on the remote could finish the deletions.

This would not be singifinantly more heavyweight than just moving to a directory, if you used eg a directory special remote. And it's also a lot more flexible.

Of course, this would make dropping take longer than usual, depending on how fast the object could be moved to the remote. If it were slow, there would be no way to convey progress back to the user without a lot more complication than this feature warrants.

Open to your thoughts on these alternatives..

Comment by joey
comment 1

The deletion could be handled by a cron job that the user is responsible for setting up, which avoids needing to configure a time limit in git-annex, and also avoids the question of what git-annex command(s) would handle the clean up.

An alternative way to handle this would be to use the "appendonly" config of git-annex p2phttp (and git-annex-shell has something similar). Then the repository would refuse to drop. And instead you could have a cron job that uses git-annex unused to drop old objects. This would need some way to only drop unused objects after some period of time.

I think there are some benefits to that path, it makes explicit to the user that they data they wanted to drop is not immediately going away from the server. Which might be important for legal reasons (although the prospect of backups of annexed files makes it hard to be sure if a server has really deleted something anyway). And if the repository had a disk quota, this would make explicit to the user why dropping content from it didn't free up quota.

(I think it would also be possible to (ab)use the annex.secure-erase-command to instead move objects to the directory. Probably not a good idea, especially because there's no guarantee that command is only run on complete annex objects.)

Comment by joey
comment 7

I've landed a complete fix for this. The server no longer locks up when I run the test case for a long time.

Additionally, there was a bug that caused p2phttp to leak lock file descriptors, which gets triggered by the same test case. I've fixed that.

There are two problems I noticed in testing though.

git-annex get sometimes slows down to just bytes per second, or entirely stalls. This is most apparent with -J10, but I've seen it happen even when there is no concurrency or other clients. This should probably be treated as a separate bug, but it does cause the test case to eventually hang, unless git-annex is configured to do stall detection. The server keeps responding to other requests though.

Running git-annex drop and interrupting it at the wrong moment while it's locking content on the server seems to cause a P2P protocol worker to not get returned to the worker pool. When it happens enough times, this can cause the server to stop responding to new requests. Which seems closely related to this bug.

Comment by joey
comment 6

I think that --debug output from the p2phttp server would be helpful in narrowing down if there is particular operation that causes this hang.

I should have been a bit more clear, I also saw the deadlock sometimes with concurrent get's, sometimes with drop's, and sometimes with a mix of both, so there wasn't one particular operation that seemed to be the issue.

-J2 also seems quite low though.

This is for Forgejo-aneksajo, where there is still one p2phttp process being started per repository. Since there could potentially be 1000's of concurrent processes at any given time I thought it might be wise to start with the bare minimum by default. Due to how p2phttp and proxying is supposed to interact I've also realized that the current integration is not working as it should (https://codeberg.org/forgejo-aneksajo/forgejo-aneksajo/issues/96) and that I probably won't be able to make use of the single p2phttp process for all repositories (because of ambiguity with authorization when there are multiple different repositories with differing permissions that proxy for the same remote).

Comment by matrss
comment 5

Pushed a preliminary fix in the p2phttp_deadlock branch.

That has some known problems, documented in the commit. But it does avoid p2phttp locking up like this.

Comment by joey
comment 4

Tested a modified p2phttp that uses 2 worker pools, one for the P2P client side and one for server side. This means that -J2 actually runs up to 4 threads, although with only 2 capabilities, so the change won't affect CPU load. So I tried with -J2 and 4 clients running the loop.

It still stalls much as before, maybe after a bit longer.

It still seems likely that the two workers used per http request is the root of the problem. When there are more than annex.jobs concurrent requests, each http response handler calls inAnnexWorker, and so one will block. If the corresponding localConnection successfully gets a worker, that means one of the other localConnections is blocked. Resulting in a running http response handler whose corresponding localConnection is blocked. The inverse also seems possible.

If 2 worker pools is not the solution, it seems it would need to instead be solved by rearchitecting the http server to not have that separation. Or to ensure that getP2PConnection doesn't return until the localConnection has allocated its worker. I'll try that next.

Comment by joey
comment 3

Aha! We have two things both calling inAnnexWorker:

  1. localConnection, handling the P2P protocol client side of things inside the http server. (Or in the case of a proxy, other functions that do the similar thing.)
  2. http response handlers, which run the P2P protocol server side.

For each http request, both of these run asyncronously.

So, with -J2, if two http requests happen at the same time, and localConnection wins both races, the two worker threads are both stalled waiting for a response from the P2P server side. Which is blocked waiting for a worker thread. Or perhaps both of the http response handlers win, similar deadlock.

Maybe it could even happen that the localConnection for one request wins, as well as the response handler for the other request?

(And higher -J numbers would still have the same problem, when there are more clients. The docs for -J are also a bit wrong, they say that the http server uses 1 thread itself, but it can really use any number of threads since localConnection does run inAnnexWorker in an async action.)

Anyway, if this analysis is correct, the fix is surely to have 2 worker thread pools, once for the P2P protocol client side, and one for the P2P protocol server side.

Comment by joey
comment 2

I was able to reproduce this fairly quickly with 2 clones, each running the loop on the same 5 files, which I made each be 1 mb in size.

Both hung on get, of different files. The tail of the --debug:

[2025-11-05 13:14:06.255833094] (P2P.IO) [http server] [ThreadId 914] P2P > DATA 1048576
[2025-11-05 13:14:06.255872078] (P2P.IO) [http client] [ThreadId 912] P2P < DATA 1048576
[2025-11-05 13:14:06.262783513] (P2P.IO) [http server] [ThreadId 914] P2P > VALID
[2025-11-05 13:14:06.262897622] (P2P.IO) [http client] [ThreadId 912] P2P < VALID
[2025-11-05 13:14:06.262956555] (P2P.IO) [http client] [ThreadId 912] P2P > SUCCESS
[2025-11-05 13:14:06.263008765] (P2P.IO) [http server] [ThreadId 914] P2P < SUCCESS
[2025-11-05 13:14:06.264030615] (P2P.IO) [http client] [ThreadId 883] P2P > CHECKPRESENT SHA256E-s1048576--06477b9c41f04aaa5c09af0adbd093506435193c868ef56a5510eff0d3c9fc2b
[2025-11-05 13:14:06.264088566] (P2P.IO) [http server] [ThreadId 916] P2P < CHECKPRESENT SHA256E-s1048576--06477b9c41f04aaa5c09af0adbd093506435193c868ef56a5510eff0d3c9fc2b
[2025-11-05 13:14:06.264183098] (P2P.IO) [http server] [ThreadId 916] P2P > SUCCESS
[2025-11-05 13:14:06.264219447] (P2P.IO) [http client] [ThreadId 883] P2P < SUCCESS
[2025-11-05 13:14:06.265125295] (P2P.IO) [http client] [ThreadId 920] P2P > GET 0 3 SHA256E-s1048576--06477b9c41f04aaa5c09af0adbd093506435193c868ef56a5510eff0d3c9fc2b
[2025-11-05 13:14:06.265177174] (P2P.IO) [http server] [ThreadId 921] P2P < GET 0 3 SHA256E-s1048576--06477b9c41f04aaa5c09af0adbd093506435193c868ef56a5510eff0d3c9fc2b
[2025-11-05 13:14:06.265598603] (P2P.IO) [http server] [ThreadId 921] P2P > DATA 1048576
[2025-11-05 13:14:06.265639962] (P2P.IO) [http client] [ThreadId 920] P2P < DATA 1048576
[2025-11-05 13:14:06.274452543] (P2P.IO) [http server] [ThreadId 921] P2P > VALID
[2025-11-05 13:14:06.274505514] (P2P.IO) [http client] [ThreadId 920] P2P < VALID
[2025-11-05 13:14:06.274551963] (P2P.IO) [http client] [ThreadId 920] P2P > SUCCESS
[2025-11-05 13:14:06.274594385] (P2P.IO) [http server] [ThreadId 921] P2P < SUCCESS
[2025-11-05 13:14:06.276689062] (P2P.IO) [http client] [ThreadId 883] P2P > CHECKPRESENT SHA256E-s1048576--81386bfd2b7880ed397001ea5325ee25cfa69cf46d097b7a69b0a31b5e990f8d
[2025-11-05 13:14:06.276783864] (P2P.IO) [http server] [ThreadId 924] P2P < CHECKPRESENT SHA256E-s1048576--81386bfd2b7880ed397001ea5325ee25cfa69cf46d097b7a69b0a31b5e990f8d
[2025-11-05 13:14:06.276799023] (P2P.IO) [http client] [ThreadId 892] P2P > CHECKPRESENT SHA256E-s1048576--06477b9c41f04aaa5c09af0adbd093506435193c868ef56a5510eff0d3c9fc2b
[2025-11-05 13:14:06.276912961] (P2P.IO) [http server] [ThreadId 924] P2P > SUCCESS
[2025-11-05 13:14:06.276939743] (P2P.IO) [http client] [ThreadId 883] P2P < SUCCESS
[2025-11-05 13:14:06.276944802] (P2P.IO) [http server] [ThreadId 926] P2P < CHECKPRESENT SHA256E-s1048576--06477b9c41f04aaa5c09af0adbd093506435193c868ef56a5510eff0d3c9fc2b
[2025-11-05 13:14:06.277069411] (P2P.IO) [http server] [ThreadId 926] P2P > SUCCESS
[2025-11-05 13:14:06.277111522] (P2P.IO) [http client] [ThreadId 892] P2P < SUCCESS

A second hang happened with the loops each running on the same 2 files. This time, one clone was doing "get 1" and the other clone "drop 1 (locking origin...)" when they hung.

[2025-11-05 13:28:03.931334099] (P2P.IO) [http server] [ThreadId 8421] P2P > SUCCESS
[2025-11-05 13:28:03.931380284] (P2P.IO) [http client] [ThreadId 8424] P2P < SUCCESS
[2025-11-05 13:28:03.932204439] (P2P.IO) [http client] [ThreadId 8424] P2P > UNLOCKCONTENT
[2025-11-05 13:28:03.932251987] (P2P.IO) [http server] [ThreadId 8421] P2P < UNLOCKCONTENT
[2025-11-05 13:28:04.252596865] (P2P.IO) [http client] [ThreadId 8427] P2P > CHECKPRESENT SHA256E-s1048576--4ad843113f3ee799f2ff834a80bb2aaff35d5babd68395339406671c50e99f6a
[2025-11-05 13:28:04.252748136] (P2P.IO) [http server] [ThreadId 8429] P2P < CHECKPRESENT SHA256E-s1048576--4ad843113f3ee799f2ff834a80bb2aaff35d5babd68395339406671c50e99f6a
[2025-11-05 13:28:04.252918516] (P2P.IO) [http server] [ThreadId 8429] P2P > SUCCESS
[2025-11-05 13:28:04.253026869] (P2P.IO) [http client] [ThreadId 8427] P2P < SUCCESS

A third hang, again with 2 files, and both hung on "drop 1 (locking origin...)"

[2025-11-05 13:34:34.413288012] (P2P.IO) [http client] [ThreadId 16147] P2P > CHECKPRESENT SHA256E-s1048576--c644050a65e9e93a43f5b21e1188e4e7a406057d84102c78fce0007ceb875c69
[2025-11-05 13:34:34.413341843] (P2P.IO) [http server] [ThreadId 16172] P2P < CHECKPRESENT SHA256E-s1048576--c644050a65e9e93a43f5b21e1188e4e7a406057d84102c78fce0007ceb875c69
[2025-11-05 13:34:34.413415351] (P2P.IO) [http server] [ThreadId 16172] P2P > SUCCESS
[2025-11-05 13:34:34.413442692] (P2P.IO) [http client] [ThreadId 16147] P2P < SUCCESS
[2025-11-05 13:34:34.414251817] (P2P.IO) [http client] [ThreadId 16176] P2P > GET 0 2 SHA256E-s1048576--c644050a65e9e93a43f5b21e1188e4e7a406057d84102c78fce0007ceb875c69
[2025-11-05 13:34:34.4142963] (P2P.IO) [http server] [ThreadId 16177] P2P < GET 0 2 SHA256E-s1048576--c644050a65e9e93a43f5b21e1188e4e7a406057d84102c78fce0007ceb875c69
[2025-11-05 13:34:34.414731756] (P2P.IO) [http server] [ThreadId 16177] P2P > DATA 1048576
[2025-11-05 13:34:34.414777692] (P2P.IO) [http client] [ThreadId 16176] P2P < DATA 1048576
[2025-11-05 13:34:34.421258237] (P2P.IO) [http server] [ThreadId 16177] P2P > VALID
[2025-11-05 13:34:34.421322858] (P2P.IO) [http client] [ThreadId 16176] P2P < VALID
[2025-11-05 13:34:34.421358204] (P2P.IO) [http client] [ThreadId 16176] P2P > SUCCESS
[2025-11-05 13:34:34.421390053] (P2P.IO) [http server] [ThreadId 16177] P2P < SUCCESS
[2025-11-05 13:34:34.764709623] (P2P.IO) [http client] [ThreadId 16188] P2P > LOCKCONTENT SHA256E-s1048576--4ad843113f3ee799f2ff834a80bb2aaff35d5babd68395339406671c50e99f6a

Here the P2P protocol client inside the http server got a worker thread, but then apparently the http response handler stalled out. That's different from the other 2 debug logs where the protocol client was able to send a response. I think in the other 2 debug logs, the P2P protocol client then stalls getting a worker thread.

Comment by joey
comment 1

I think that --debug output from the p2phttp server would be helpful in narrowing down if there is particular operation that causes this hang.

p2phttp has a pool of worker threads, so if a thread stalls out, or potentially crashes in some way that is not handled, it can result in all subsequent operations hanging. 91dbcf0b56ba540a33ea5a79ed52f33e82f4f61b is one recent example of that; I remember there were some similar problems when initially developing it.

-J2 also seems quite low though. With the http server itself using one of those threads, all requests get serialized through the second thread. If there is any situation where request A needs request B to be made and finish before it can succeed, that would deadlock.

Comment by joey