Please describe the problem.
Originally reported (so more details could be found there) in [datalad issue #2988].
While trying to annex get
on local cluster, over ssh without a ssh-agent, I was prompted for password multiple times, although previous version didn't do it:
[yhalchen@discovery7 QA]$ git annex version | head -n1 ; git annex get -J2 sub-qa/
git-annex version: 6.20180926-gc906aaf
get sub-qa/ses-20161128/dwi/sub-qa_ses-20161128_acq-DTIX30Xp2Xs4_dwi.nii.gz get sub-qa/ses-20161128/dwi/sub-qa_ses-20161128_acq-DTIX30Xp2_dwi.nii.gz (from origin...)
(from origin...)
yoh@falkor.datalad.org's password: yoh@falkor.datalad.org's password:
and when I try an "imputated" from NeuroDebian .deb package recent build of git-annex then I am observing
[yhalchen@discovery7 QA]$ export PATH=/ihome/yhalchen/7.20181105+git22-g4c7236c58/usr/lib/git-annex.linux:$PATH
[yhalchen@discovery7 QA]$ git annex version | head -n1 ; git annex get -J2 sub-qa/
git-annex version: 7.20181105+git22-g4c7236c58-1~ndall+1
get sub-qa/ses-20161128/dwi/sub-qa_ses-20161128_acq-DTIX30Xp2Xs4_dwi.nii.gz get sub-qa/ses-20161128/dwi/sub-qa_ses-20161128_acq-DTIX30Xp2_dwi.nii.gz (transfer already in progress, or unable to take transfer lock)
Unable to access these remotes: origin
Try making some of these repositories available:
6384a551-a41d-4290-b186-9258befede97 -- bids@rolando:/inbox/BIDS/dbic/QA
7d9ed214-3e5f-4cc8-ac88-f397145b2d4c -- yoh@falkor:/srv/datasets.datalad.org/www/dbic/QA [origin]
ba8f2cea-f229-422c-82be-6580e5e07ed5 -- yoh@smaug:/mnt/datasets/datalad/crawl/dbic/QA
failed
get sub-qa/ses-20161128/func/sub-qa_ses-20161128_task-rest_acq-p2Xs4X35mm_bold.nii.gz (from origin...)
(from origin...)
thread blocked indefinitely in an MVar operation
thread blocked indefinitely in an STM transaction
Unable to access these remotes: origin
Unable to access these remotes: origin
Try making some of these repositories available:
6384a551-a41d-4290-b186-9258befede97 -- bids@rolando:/inbox/BIDS/dbic/QA
7d9ed214-3e5f-4cc8-ac88-f397145b2d4c -- yoh@falkor:/srv/datasets.datalad.org/www/dbic/QA [origin]
ba8f2cea-f229-422c-82be-6580e5e07ed5 -- yoh@smaug:/mnt/datasets/datalad/crawl/dbic/QA
failed
Try making some of these repositories available:
6384a551-a41d-4290-b186-9258befede97 -- bids@rolando:/inbox/BIDS/dbic/QA
7d9ed214-3e5f-4cc8-ac88-f397145b2d4c -- yoh@falkor:/srv/datasets.datalad.org/www/dbic/QA [origin]
ba8f2cea-f229-422c-82be-6580e5e07ed5 -- yoh@smaug:/mnt/datasets/datalad/crawl/dbic/QA
failed
git-annex: thread blocked indefinitely in an STM transaction
so smells like may be recent fixes caused some other locking problems
This is on an NFS mounted partition, but remains the same when run under /tmp
Running a significantly more outdated version I found laying around (6.20170815+gitg22da64d0f-1~ndall+1) shows a single password prompt
This looks like a know issue: get -J cannot be used with password-based authentication.
Try without the -J2:
@andrew, the bug you linked to is only about multiple password prompting when using
GIT_SSH
, which I don't think is in use here.Anyway that bug did not involve MVar/STM deadlocks, so this is not a duplicate.
It certianly seems possible that 6ecd55a9fa4c64399b5ede3944bd2aa5f78cf0b1 could have introduced a locking problem causing such a crash, assuming you are running that version (can't tell from your version string due to the git rev not being one from the git-annex repo).
As Andrew noted, the multiple ssh password prompt issue is expected if you are using
GIT_SSH
orGIT_SSH_COMMAND
. From discussion in https://github.com/datalad/datalad/issues/2988 it seems like you might be?The fact that it says "(transfer already in progress, or unable to take transfer lock)" strongly suggests they you have annex.pidlock set locally. It would be good to know about this or other configuration.
I have not had any luck reproducing the crash so far. I have tried setting
GIT_SSH
,GIT_ANNEX_USE_GIT_SSH
, andannex.pidlock
.I don't see anything in 6ecd55a9fa4c64399b5ede3944bd2aa5f78cf0b1 that could cause a STM deadlock; it uses a TVar and never retries.
patch for my "custom" version is now (if was not before) sent out -- needs adjustment in standalone patch
in a comment on that issue I call
git annex
directly, so I think I did not have any custom env settings set... Moreover as I commented further nothing changed when I setGIT_ANNEX_USE_GIT_SSH=1
which I guess should have counteracted?I have tried now with
6.20170815+gitg22da64d0f-1~ndall+1
-- provides a single prompt. Here is again output now also checking for SSH variables:smth like?
I also no longer see that "transfer already in progress..." - may be I changed config since then:
FWIW, I have now ran
git annex test
with TMPDIR on that partition, withas the result. HEre is the full log -- may be it would give you some hints.
FWIW, running under /tmp with TMPDIR there --
All 293 tests passed
, so aforementioned test failures are FS related.another (complete) attempt on
get
under/tmp
, which is less noisy but still failing due to locking: [[!format sh """ [yhalchen@discovery7 tmp]$ git clone yoh@falkor.datalad.org:/srv/datasets.datalad.org/www/dbic/QA/ Cloning into 'QA'... yoh@falkor.datalad.org's password: Permission denied, please try again. yoh@falkor.datalad.org's password: remote: Counting objects: 31315, done. remote: Compressing objects: 100% (10068/10068), done. remote: Total 31315 (delta 15553), reused 26858 (delta 14185) Receiving objects: 100% (31315/31315), 31.02 MiB | 43.04 MiB/s, done. Resolving deltas: 100% (15553/15553), done.[yhalchen@discovery7 tmp]$ cd QA
[yhalchen@discovery7 QA]$ git annex get -J2 sub-*
(merging origin/git-annex into git-annex...) (recording state in git...) git-annex: thread blocked indefinitely in an STM transaction
[yhalchen@discovery7 QA]$ export | grep SSH declare -x SSH_CLIENT="10.31.184.44 54944 22" declare -x SSH_CONNECTION="10.31.184.44 54944 129.170.232.78 22" declare -x SSH_TTY="/dev/pts/6"
[yhalchen@discovery7 QA]$ export | grep GIT
[yhalchen@discovery7 QA]$ git annex version | head -n 1 git-annex version: 7.20181105+git22-g4c7236c58-1~ndall+1
[yhalchen@discovery7 QA]$ git annex get --debug -J2 sub-* [2018-11-12 12:25:05.499642416] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"] [2018-11-12 12:25:05.515163636] process done ExitSuccess [2018-11-12 12:25:05.515289739] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"] [2018-11-12 12:25:05.5303119] process done ExitSuccess [2018-11-12 12:25:05.530496533] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--cached","-z","--","sub-amit","sub-emmet","sub-qa","sub-sid000143"] [2018-11-12 12:25:05.547049728] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"] [2018-11-12 12:25:05.561799248] process done ExitSuccess [2018-11-12 12:25:05.561984198] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"] [2018-11-12 12:25:05.576759502] process done ExitSuccess [2018-11-12 12:25:05.577215967] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..b984c211b3b15db79921d77d1c30cf4b06140d0e","--pretty=%H","-n1"] [2018-11-12 12:25:05.59389018] process done ExitSuccess [2018-11-12 12:25:05.594415045] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"] [2018-11-12 12:25:05.596445232] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"] [2018-11-12 12:25:05.612916] read: ssh ["-o","BatchMode=true","-S",".git/annex/ssh/yoh@falkor.datalad.org","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","yoh@falkor.datalad.org","true"] [2018-11-12 12:25:05.712832294] process done ExitFailure 255 git-annex: thread blocked indefinitely in an STM transaction
```
So sshing without letting it prompt for a password has failed. The next thing git-annex should do at that point is take a lock to prevent concurrent prompts (which does not use STM), ask concurrent-output to hide any regions from the display (which does use STM), and run ssh again to prompt.
It's apparently crashing before it can run ssh again. (Although I'm not really sure if it is, it's also possible it overwrote some of the debug output while trying to hide console regions. Please capture a complete typescript.)
It could be that concurrent-output's waitDisplayChange is buggy, with some kind of race. But this does not seem to be behaving like a race condition, does it? It happens every time?
Could you give me a hint on "how"?
Initially didn't spot the question:
yes
I don't think this bug will be able to be tracked down without rebuilding git-annex one way or another.
I could use the technique in https://www.fpcomplete.com/blog/2018/05/pinpointing-deadlocks-in-haskell to instrument every possible place in git-annex that could deadlock. This would be a lot of work, probably hundreds of lines of code changes.
It seems much easier to either find a way to reproduce the problem outside of this one machine with the problem, or bisect it.
about to start bisection for the locking issue. Relevant PR on datalad: https://github.com/datalad/datalad/pull/2995 to get a helper script and Singularity environment to build
As for double password prompt - should I bisect too? that one seems to be not that easy since if
annex get
is invoked within script, typescript records nothing:872af2b2f1049e4eecf274ac70caf99a367f3818 allows ruling out a concurrent-output bug. If git-annex with that commit is run with --quiet and still exhibits the problem, it shouldn't involve concurrent-output at all.
woohoo
git bisect run ~/bisect-git-annex ~/git-annex-dev.img ~/bisect-git-annex-lock.sh
(ongit bisect start HEAD 6.20180926
with HEAD being 05bfce7ca8e67612c66e2cdb3c7c0c26fd8c5243 finished with ground breaking discoveryso it is all destructive Ilya's comments!
That is a commit right after the 6.20180926 which I chose as "good" since the good was the 6.20180926-gc906aaf (installed via conda build pl5.22.2.1_0).
So may be the problem is not in git annex per se but in its build-depends, which changed in buster (I've been building standalones with buster for a while). I will try to do manual comparison across a number of builds from datalad.
ok, not sure yet if it is git-annex change or ghc change or both here is the point where things changed radically:
going from 6.20180519+gitgf6f199be3 build which used ghc 8.0.1-17+b1 and where we had a nice single ssh password prompt:
to the next build I have 6.20180521+gitg0e3bebd71 which used ghc 8.0.2-11 and with which locking madness starts to kick in:
I hope that would give enough clues, Joey?
FWIW 1, it was a switch from ghc 8.0.1-17+b1 because it was when (I guess) I switched from Debian stable (stretch) to testing (buster) build environment since build-dependencies were no longer satisfiable!
FWIW 2, I have now tested the build in debian unstable with ghc 8.4.3+dfsg1-4 -- the same issue with locks
not sure how I was blind that
git annex version
outputs dependencies information, so here is too nearby versions wherethis one is good (no lock issues, but double password prompt):
and here is a bad one (locks issue)
I don't see anything in ghc 8.0.2's release notes that points to a breaking change to STM/MVar handling. There were a couple of changes in that version that involve concurrency and MVars though, that could somehow have led to the problem.
Parallel programs should be significantly more reliable on platforms with weak memory consistency guarantees https://ghc.haskell.org/trac/ghc/ticket/12469
Scheduling bug with forkOS + MVar https://ghc.haskell.org/trac/ghc/ticket/12419
One thing that seems worth trying re the second of those is to edit CmdLine/Action.hs and delete the setNumCapabilities line:
A compiler bug is looking not entirely unlikely..
I've added a DebugLocks build flag to try to track down the source of the deadlock. It's not enabled by default, so you'll need to eg modify the Makefile to set it:
Calls to
debugLocks
are scattered around in a several of the places I suspect may be involved, around ssh prompting, transfer locks, and general lock files. If one of them is, it will display line number information when the deadlock happens.It will probably take several iterations of adding more calls to
debugLocks
to narrow in on the code that is involved in the deadlock.05bfce7ca..900bf3436
)! bisecting now which change possibly fixed itMy comment in the concurrent-output commit was that I didn't think git-annex could be affected, since it can only call waitDisplayChange once at a time. That analysis still looks right to me; Messages.prompt takes the promptLock before calling it.
So I want to make very sure that fix did fix it, and it's not masked by some other change to the build environment, or by the problem being more intermittent than you thought it was. It would be good if you can revert the concurrent-output change, reproduce, re-introduce the change, and verify it fixes it.
Also it would be good to build git-annex with the DebugLocks build flag described above, and the unfixed concurrent-output, and verify that the debug info when it crashes shows it crashed in Message.hs or concurrent-output.
FWIW, regarding
git bisect start 900bf3436 05bfce7ca^
git bisect run bash -c '! ../bisect-git-annex ../git-annex-dev.img ../bisect-git-annex-lock.sh'
so build environment (singularity image) remained the same, the bisect script was
and it did point to that commit... I can rerun it again for the geeks of it whenever I would have another guaranteed hour online (processes time out on that host, so can't reliably run screen)
git annex test
output on that box/that version with85 out of 293 tests failed
Bisecting git-annex couldn't isolate the problem to any commit if the bug was fixed in concurrent-output. What commit did bisect point to?
Ok, did
Ok so we know for sure it's crashing in Message.prompt, good.
I see you mean that my commit to git-annex that avoids using concurrent-output in some circumstances "fixed" it. I'm not sure how your bisection environment avoids those circumstances; it does not seem to be passing --quiet or --json to git-annex. Perhaps piping the git-annex output causes it to decide it's not on a tty and so avoid using concurrent-output. Perhaps it's a locale setting. Doesn't really matter..
Except, it seems we have still not verified that the actual source of the crash in concurrent-output is really fixed. That needs building with concurrent-output 1.10.8 and then trying to reproduce the crash at the command line.
As of 451145738267b5858132554da2367a78c145b610 git-annex depends on the hopefully fixed concurrent-output, so if you build that version and it works at the command line, we can know for sure.
You'll need to run
cabal update; cabal install --only-dependencies
in the git-annex source dir to get the fixed concurrent-output installed before building git-annex.Interesting that one side of the deadlock is a MVar and the other side is STM:
This seems consistent with a single call of
waitDisplayChange
being involved. That's the STM side. The MVar side is thetakeMVar
call inprompt
.So somehow
waitDisplayChange
is blocking and of course thetakeMVar
also blocks waiting for it. But with only a single caller towaitDisplayChange
, my bug fix in concurrent-output can't have been the actual fix. It has to block with a single caller.Oh, what if
waitDisplayChange
is called when there are no display changes pending, because there are no console regions being displayed? I think it would still block then!In a bench test I've at least verified it hangs in that situation, though I have not triggered ghc's deadlock detection. Which is good enough for me, I know I need to fix this problem in concurrent-output, and also it seems my earlier fix to git-annex is enough to avoid triggering the underlying bug; git-annex won't need to depend on the fixed concurrent-output.
Fixed underlying bug in concurrent-output. No more changes in git-annex needed for this, and I'm confident enough I've gotten to the root cause now.
Note that you can open a new bug report about the test suite failures on that system, if it's really a bug and not some NFS problem or whatever.