Originally was trying to reproduce datalad/issues/3653 assuming that multiple files pointed to the same key. It was not the case, and my attempt revealed another bug - annex inability to "obtain" files in parallel when multiple of them point to the same key:
setup of original repo(click to expand)
/tmp > mkdir src; (cd src; git init; git annex init; dd if=/dev/zero of=1 count=1024 bs=1024; for f in {2..10}; do cp 1 $f; done ; git annex add *; git commit -m added; )
Initialized empty Git repository in /tmp/src/.git/
init (scanning for unlocked files...)
ok
(recording state in git...)
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00106651 s, 983 MB/s
add 1
ok
add 10
ok
add 2
ok
add 3
ok
add 4
ok
add 5
ok
add 6
ok
add 7
ok
add 8
ok
add 9
ok
(recording state in git...)
[master (root-commit) 63b1163] added
10 files changed, 10 insertions(+)
create mode 120000 1
create mode 120000 10
create mode 120000 2
create mode 120000 3
create mode 120000 4
create mode 120000 5
create mode 120000 6
create mode 120000 7
create mode 120000 8
create mode 120000 9
And that is what happens then when we try to get the same key in parallel:
/tmp > git clone src dst; (cd dst; git annex get -J 5 *; )
Cloning into 'dst'...
done.
(merging origin/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
get 2 (from origin...) (checksum...)
git-annex: thread blocked indefinitely in an STM transaction
failed
git-annex: thread blocked indefinitely in an MVar operation
I felt like it is an old issue but failed to find a trace of it upon a quick lookup
Reproduced.
After building git-annex with the DebugLocks flag, I got this:
Which points to pickRemote and ensureOnlyActionOn. But pickRemote does no STM actions when there's only 1 remote, so it must really be the latter.
Also, I notice that when 5 files to get are provided, it crashes, but with less than 5, it succeeds. Even this trivial case crashes:
git annex get -J1 1 2
Ok, I see the bug. ensureOnlyActionOn does a STM retry if it finds in the activekeys map some other thread is operating on the same key. But, there is no running STM transaction what will update the map. So, STM detects that the retry would deadlock.
It's not really a deadlock, because once the other thread finishes, it will update the map to remove itself. But STM can't know that. The solution will be to not use STM for waiting on the other thread.
Hmm, I tried the obvious approach, using a MVar semaphore to wait for the thread, but that just resulted in more STM and MVar deadlocks.
I don't understand why after puzzling over it for two hours. I did instrument all calls to atomically, and it looks, unfortunately, like the one in finishCommandActions is deadlocking. If the problem extends beyond ensureOnlyActionOn it may be much more complicated.
Patch that does not work and I don't know why.
Tried going back to c04b2af3e1a8316e7cf640046ad0aa68826650ed, which is before the separation of perform and cleanup stages. The same code was in onlyActionOn back then. And the test case does not crash.
So, that gives a good commit to start a bisection. Which will probably find the bug was introduced in the separation of perform and cleanup stages, because that added a lot of STM complexity.
(Have to cherry-pick 018b5b81736a321f3eb9762a2afb7124e19dbdf9 onto those old commits to make them build with current libraries.)
Simplified version of patch above, that converts ensureOnlyActionOn to not use STM at all, and is significantly simpler.
With this patch, the test case still STM deadlocks. So this seems to be proof that the actual problem is not in ensureOnlyActionOn.
finishCommandActions is reaching the retry case, and STM deadlocks there. The WorkerPool is getting into a state where allIdle is False, and is not leaving it, perhaps due to an earlier STM deadlock. (There seem to be two different ones.)
Also, I notice with --json-error-messages:
So the thread that actually gets to run on the key is somehow reaching a STM deadlock.
Which made me wonder if that thread deadlocks on enteringStage. And it seems so. If Command.Get is changed to use commandStages rather than transferStages, the test case succeeds.
Like finishCommandActions, enteringStage has a STM retry if it needs to wait for something to happen to the WorkerPool. So again it looks like the WorkerPool is getting screwed up.
Not sure if feasible, but maybe a ?catch-all deadlock breaker could be implemented to mask this and other deadlocks?
The moon landings software had something like this, and it worked pretty well...
Added tracing of changes to the WorkerPool.
Transfer starts for file 1
Transfer complete, verifying starts.
This second thread is being started to process file 2. It starts in TransferStage, but it will be blocked from doing anything by ensureOnlyActionOn.
All files have threads to process them started, so finishCommandActions starts up. It will retry since the threads are still running.
The first thread is done with verification, and the stage is being restored to transfer.
The 0 means that there are 0 spareVals. Normally, the number of spareVals should be the same as the number of IdleWorkers, so it should be 1. It's 0 because the thread is in the process of changing between stages.
The thread should at this point be waiting for an idle TransferStage slot to become available. The second thread still has that active. It seems that wait never completes, because a trace I had after that wait never got printed.
It retries again, because of the active worker and also because spareVals is not the same as IdleWorkers.
Deadlock.
Looks like that second thread that got into transfer stage never leaves it, and then the first thread, which wants to restore back to transfer stage, is left waiting forever for it. And so is finishCommandActions.
Aha! The second thread is in fact still in ensureOnlyActionOn. So it's waiting on the first thread to finish. But the first thread can't transition back to TransferStage because the second thread has stolen it.
Now it makes sense.
So.. One way to fix this would be to add a new stage, which is used for threads that are just starting. Then the second thread would be in StartStage, and the first thread would not be prevented from transitioning back to TransferStage. Would need to make sure that, once a thread leaves StartStage, it does not ever transition back to it.