projects/datalad/bugs-done/parallel get to the files for the same key would fail with thread blocked indefinitely in an STM transactionyohhttp://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/git-annexikiwiki2023-01-05T17:30:31Zcomment 1http://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/comment_1_46420a92dbd8655af9b16349da24d0fc/joey2023-01-05T17:30:31Z2019-11-13T16:34:34Z
<p>Reproduced.</p>
<p>After building git-annex with the DebugLocks flag, I got this:</p>
<pre><code>debugLocks, called at ./Annex/Transfer.hs:248:18 in main:Annex.Transfer
debugLocks, called at ./CmdLine/Action.hs:263:26 in main:CmdLine.Action
</code></pre>
<p>Which points to pickRemote and ensureOnlyActionOn. But pickRemote
does no STM actions when there's only 1 remote, so it must really be
the latter.</p>
<p>Also, I notice that when 5 files to get are provided, it crashes, but with
less than 5, it succeeds.
Even this trivial case crashes: <code>git annex get -J1 1 2</code></p>
comment 2http://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/comment_2_b1f4dc12ad00a4aa73e5bdc7c0a8f489/joey2023-01-05T17:30:31Z2019-11-13T17:07:29Z
<p>Ok, I see the bug. ensureOnlyActionOn does a STM
retry if it finds in the activekeys map some other thread
is operating on the same key.
But, there is no running STM transaction what will update
the map. So, STM detects that the retry would deadlock.</p>
<p>It's not really a deadlock, because once the other thread finishes,
it will update the map to remove itself. But STM can't know that.
The solution will be to not use STM for waiting on the other thread.</p>
<p>Hmm, I tried the obvious approach, using a MVar semaphore to wait for the
thread, but that just resulted in more STM and MVar deadlocks.</p>
<p>I don't understand why after puzzling over it for two hours. I did
instrument all calls to atomically, and it looks, unfortunately, like
the one in finishCommandActions is deadlocking. If the problem extends
beyond ensureOnlyActionOn it may be much more complicated.</p>
<p>Patch that does not work and I don't know why.</p>
<pre><code>diff --git a/CmdLine/Action.hs b/CmdLine/Action.hs
index 87298a95f..bf4bdd589 100644
--- a/CmdLine/Action.hs
+++ b/CmdLine/Action.hs
@@ -268,16 +268,30 @@ ensureOnlyActionOn k a = debugLocks $
go ConcurrentPerCpu = goconcurrent
goconcurrent = do
tv <- Annex.getState Annex.activekeys
- bracket (setup tv) id (const a)
- setup tv = liftIO $ do
+ bracketIO (setup tv) id (const a)
+ setup tv = do
+ mysem <- newEmptyMVar
mytid <- myThreadId
- atomically $ do
+ finishsetup <- atomically $ do
m <- readTVar tv
case M.lookup k m of
- Just tid
- | tid /= mytid -> retry
- | otherwise -> return $ return ()
+ Just (tid, theirsem)
+ | tid /= mytid -> return $ do
+ -- wait for the other
+ -- thread to finish, and
+ -- retry (STM retry would
+ -- deadlock)
+ readMVar theirsem
+ setup tv
+ | otherwise -> return $
+ -- same thread, so no
+ -- blocking
+ return $ return ()
Nothing -> do
- writeTVar tv $! M.insert k mytid m
- return $ liftIO $ atomically $
- modifyTVar tv $ M.delete k
+ writeTVar tv $! M.insert k (mytid, mysem) m
+ return $ return $ do
+ atomically $ modifyTVar tv $
+ M.delete k
+ -- indicate finished
+ putMVar mysem ()
+ finishsetup
diff --git a/Annex.hs b/Annex.hs
index 9eb4c5f39..936399ae7 100644
--- a/Annex.hs
+++ b/Annex.hs
@@ -143,7 +143,7 @@ data AnnexState = AnnexState
, existinghooks :: M.Map Git.Hook.Hook Bool
, desktopnotify :: DesktopNotify
, workers :: Maybe (TMVar (WorkerPool AnnexState))
- , activekeys :: TVar (M.Map Key ThreadId)
+ , activekeys :: TVar (M.Map Key (ThreadId, MVar ()))
, activeremotes :: MVar (M.Map (Types.Remote.RemoteA Annex) Integer)
, keysdbhandle :: Maybe Keys.DbHandle
, cachedcurrentbranch :: (Maybe (Maybe Git.Branch, Maybe Adjustment))
</code></pre>
comment 3http://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/comment_3_d8ba7cc5a860e9ccaab32c637cc2a7cd/joey2023-01-05T17:30:31Z2019-11-13T19:07:49Z
<p>Tried going back to c04b2af3e1a8316e7cf640046ad0aa68826650ed,
which is before the separation of perform and cleanup stages.
The same code was in onlyActionOn back then. And the test case does not
crash.</p>
<p>So, that gives a good commit to start a bisection. Which will probably
find the bug was introduced in the separation of perform and cleanup stages,
because that added a lot of STM complexity.</p>
<p>(Have to cherry-pick 018b5b81736a321f3eb9762a2afb7124e19dbdf9
onto those old commits to make them build with current libraries.)</p>
comment 4http://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/comment_4_eecaa7f7b0279c56902c90ed58d1444f/joey2023-01-05T17:30:31Z2019-11-13T21:22:07Z
<p>Simplified version of patch above, that converts ensureOnlyActionOn to not use
STM at all, and is significantly simpler.</p>
<p>With this patch, the test case still STM deadlocks. So this seems to be
proof that the actual problem is not in ensureOnlyActionOn.</p>
<pre><code>diff --git a/Annex.hs b/Annex.hs
index 9eb4c5f39..9baf7755a 100644
--- a/Annex.hs
+++ b/Annex.hs
@@ -143,7 +143,7 @@ data AnnexState = AnnexState
, existinghooks :: M.Map Git.Hook.Hook Bool
, desktopnotify :: DesktopNotify
, workers :: Maybe (TMVar (WorkerPool AnnexState))
- , activekeys :: TVar (M.Map Key ThreadId)
+ , activekeys :: MVar (M.Map Key (ThreadId, MVar ()))
, activeremotes :: MVar (M.Map (Types.Remote.RemoteA Annex) Integer)
, keysdbhandle :: Maybe Keys.DbHandle
, cachedcurrentbranch :: (Maybe (Maybe Git.Branch, Maybe Adjustment))
@@ -154,7 +154,7 @@ data AnnexState = AnnexState
newState :: GitConfig -> Git.Repo -> IO AnnexState
newState c r = do
emptyactiveremotes <- newMVar M.empty
- emptyactivekeys <- newTVarIO M.empty
+ emptyactivekeys <- newMVar M.empty
o <- newMessageState
sc <- newTMVarIO False
return $ AnnexState
diff --git a/CmdLine/Action.hs b/CmdLine/Action.hs
index 87298a95f..a8c2bd205 100644
--- a/CmdLine/Action.hs
+++ b/CmdLine/Action.hs
@@ -22,7 +22,7 @@ import Remote.List
import Control.Concurrent
import Control.Concurrent.Async
import Control.Concurrent.STM
-import GHC.Conc
+import GHC.Conc (getNumProcessors)
import qualified Data.Map.Strict as M
import qualified System.Console.Regions as Regions
@@ -267,17 +267,22 @@ ensureOnlyActionOn k a = debugLocks $
go (Concurrent _) = goconcurrent
go ConcurrentPerCpu = goconcurrent
goconcurrent = do
- tv <- Annex.getState Annex.activekeys
- bracket (setup tv) id (const a)
- setup tv = liftIO $ do
+ mv <- Annex.getState Annex.activekeys
+ bracketIO (setup mv) id (const a)
+ setup mv = do
mytid <- myThreadId
- atomically $ do
- m <- readTVar tv
- case M.lookup k m of
- Just tid
- | tid /= mytid -> retry
- | otherwise -> return $ return ()
- Nothing -> do
- writeTVar tv $! M.insert k mytid m
- return $ liftIO $ atomically $
- modifyTVar tv $ M.delete k
+ m <- takeMVar mv
+ let ready sem = do
+ putMVar mv $! M.insert k (mytid, sem) m
+ return $ do
+ modifyMVar_ mv $ pure . M.delete k
+ putMVar sem ()
+ case M.lookup k m of
+ Nothing -> ready =<< newEmptyMVar
+ Just (tid, sem)
+ | tid /= mytid -> do
+ takeMVar sem
+ ready sem
+ | otherwise -> do
+ putMVar mv m
+ return noop
</code></pre>
comment 5http://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/comment_5_052acf169f9c2b8b0233adddabb02559/joey2023-01-05T17:30:31Z2019-11-13T21:42:58Z
<p>finishCommandActions is reaching the retry case, and STM deadlocks there.
The WorkerPool is getting into a state where allIdle is False, and is not
leaving it, perhaps due to an earlier STM deadlock. (There seem to be two
different ones.)</p>
<p>Also, I notice with --json-error-messages:</p>
<pre><code>{"command":"get","note":"from origin...\nchecksum...","success":false,"key":"SHA256E-s524288--07854d2fef297a06ba81685e660c332de36d5d18d546927d30daad6d7fda1541","error-messages":["git-annex: thread blocked indefinitely in an STM transaction"],"file":"1"}
</code></pre>
<p>So the thread that actually gets to run on the key is somehow reaching a
STM deadlock.</p>
<p>Which made me wonder if that thread deadlocks on enteringStage.
And it seems so. If Command.Get is changed to use commandStages
rather than transferStages, the test case succeeds.</p>
<p>Like finishCommandActions, enteringStage has a STM retry if it needs to
wait for something to happen to the WorkerPool. So again it looks like
the WorkerPool is getting screwed up.</p>
catch-all deadlock breakerhttp://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/comment_6_880c962d5ca77c494c984e7f74725265/Ilya_Shlyakhter2023-01-05T17:30:31Z2019-11-13T22:33:59Z
<p>Not sure if feasible, but maybe a <span class="createlink"><a href="http://git-annex.branchable.com/ikiwiki.cgi?do=create&from=projects%2Fdatalad%2Fbugs-done%2Fparallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction%2Fcomment_6_880c962d5ca77c494c984e7f74725265&page=todo%2Fmore_extensive_retries_to_mask_transient_failures" rel="nofollow">?</a>catch-all deadlock breaker</span> could be implemented to mask this and other deadlocks?</p>
<p>The moon landings software <a href="https://www.ibiblio.org/apollo/hrst/archive/1033.pdf">had something</a> <a href="https://history.nasa.gov/computers/Ch2-6.html">like this</a>, and it worked <a href="https://www.wsj.com/articles/apollo-11-had-a-hidden-hero-software-11563153001">pretty well</a>...</p>
comment 6http://git-annex.branchable.com/projects/datalad/bugs-done/parallel_get_to_the_files_for_the_same_key_would_fail_with__thread_blocked_indefinitely_in_an_STM_transaction/comment_6_e33df0ec76069deb069b94c944d62c76/joey2023-01-05T17:30:31Z2019-11-14T15:20:13Z
<p>Added tracing of changes to the WorkerPool.</p>
<pre><code>joey@darkstar:/tmp/dst>git annex get -J1 1 2 --json
("initial pool",WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [IdleWorker TransferStage,IdleWorker VerifyStage] 2)
("starting worker",WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [ActiveWorker TransferStage,IdleWorker VerifyStage] 1)
</code></pre>
<p>Transfer starts for file 1</p>
<pre><code>(("change stage from",TransferStage,"to",VerifyStage),WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [IdleWorker TransferStage,ActiveWorker VerifyStage] 1)
</code></pre>
<p>Transfer complete, verifying starts.</p>
<pre><code>("starting worker",WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [ActiveWorker TransferStage,ActiveWorker VerifyStage] 0)
</code></pre>
<p>This second thread is being started to process file 2.
It starts in TransferStage, but it will be blocked from doing anything
by ensureOnlyActionOn.</p>
<pre><code>("finishCommandActions starts with",WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [ActiveWorker TransferStage,ActiveWorker VerifyStage] 0)
("finishCommandActions observes",WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [ActiveWorker TransferStage,ActiveWorker VerifyStage] 0)
</code></pre>
<p>All files have threads to process them started, so finishCommandActions starts up.
It will retry since the threads are still running.</p>
<pre><code>(("change stage from",VerifyStage,"to",TransferStage),WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [IdleWorker VerifyStage,ActiveWorker TransferStage] 0)
</code></pre>
<p>The first thread is done with verification, and
the stage is being restored to transfer.</p>
<p>The 0 means that there are 0 spareVals. Normally, the number of spareVals
should be the same as the number of IdleWorkers, so it should be 1.
It's 0 because the thread is in the process of changing between stages.</p>
<p>The thread should at this point be waiting for an idle TransferStage
slot to become available. The second thread still has that active.
It seems that wait never completes, because a trace I had after that wait
never got printed.</p>
<pre><code>("finishCommandActions observes",WorkerPool UsedStages {initialStage = TransferStage, stageSet = fromList [TransferStage,VerifyStage]} [IdleWorker VerifyStage,ActiveWorker TransferStage] 0)
</code></pre>
<p>It retries again, because of the active worker and also because spareVals
is not the same as IdleWorkers.</p>
<pre><code>git-annex: thread blocked indefinitely in an STM transaction
</code></pre>
<p>Deadlock.</p>
<p>Looks like that second thread that got into transfer stage
never leaves it, and then the first thread, which wants to
restore back to transfer stage, is left waiting forever for it. And so is
finishCommandActions.</p>
<p>Aha! The second thread is in fact still in ensureOnlyActionOn.
So it's waiting on the first thread to finish. But the first thread can't
transition back to TransferStage because the second thread has stolen it.</p>
<p>Now it makes sense.</p>
<p>So.. One way to fix this would be to add a new stage, which is used for
threads that are just starting. Then the second thread would be in
StartStage, and the first thread would not be prevented from transitioning
back to TransferStage. Would need to make sure that, once a thread leaves
StartStage, it does not ever transition back to it.</p>