We're experiencing a zombie outbreak:
user 25976 0.0 0.0 6808 1240 ? S 17:17 0:00 \_ /opt/git-annex.linux/exe/git --library-path /opt/git-annex.linux//lib/x86_64-linux-gnu: /opt/git-annex.linux/shimmed/git/git annex get --jobs 4 --batch -z
user 26001 88.4 0.2 1076112312 41728 ? Sl 17:17 4:09 \_ /opt/git-annex.linux/exe/git-annex --library-path /opt/git-annex.linux//lib/x86_64-linux-gnu: /opt/git-annex.linux/shimmed/git-annex/git-annex get --jobs 4 --batch -z
user 26278 0.0 0.0 0 0 ? Z 17:17 0:00 \_ [git] <defunct>
user 26279 0.0 0.0 0 0 ? Z 17:17 0:00 \_ [git] <defunct>
user 26474 0.0 0.0 0 0 ? Z 17:17 0:00 \_ [git] <defunct>
user 26475 0.0 0.0 0 0 ? Z 17:17 0:00 \_ [git] <defunct>
user 26735 0.0 0.0 0 0 ? Z 17:17 0:00 \_ [git] <defunct>
user 26737 0.0 0.0 0 0 ? Z 17:17 0:00 \_ [git] <defunct>
...
over 1229 and climbing as I type.
It's git-annex version: 8.20210331-g1fb59a63a
I've now stopped and restarted the 'git-annex get', but without --jobs 4
, and this time it appears to have two zombie git processes, but it's not increasing as it gets more objects.
I'm unsure how to debug what is wrong, so seeking guidance.
Tomorrow I'll start to see if I can reproduce with older/newer versions of git-annex.
sync --content
operation stall numerous times yesterday because of errors related to too many git processes. This was with-J2
and many small files using Mac OS, latest git-annex version via Homebrew.Pass --debug to git-annex, it will output the git commands it runs along with their pid in [brackets], correlate that to the pid of a zombie and tell me what the specific git command is that has the problem.
Using 8.20210223-1~ndall+1 ;
This is the result of my test framework at the moment, which aims to determine when this issue appeared:
I can add more versions to check if requested.
If you want to completely narrow down a commit, use
git bisect
between 4262ba3c4 and 5e5829a8d. With 400 commits in that range it should take less than 9 rebuilds.Do you have annex.retry or similar config set? That range includes a change that spins up a bunch of child git-annex processes when that is set, which seems like the kind of thing that could be relevant.
Since git cat-file is usually started up once and left running for the lifetime of git-annex (and waited for at the end if it exits normally), it's hard to see how it could end up being run repeatedly otherwise.
found this issue which was not
done
.I have ran into it again: on initial occasion I could not even ssh to that host because ~4000
[git] <defunct>
processes forbid any other process to start according to the policy. Updated git-annex to standalone build from conda (8.20211012-geb95ed486), repeated again (on http://datasets.datalad.org/labs/hasson/narratives/derivatives/fmriprep/.git ) with a plaingit annex get -J5 .
-- had to interrupt it when I saw that reached 1k of those. Here is dump of annexy config variables so - no retriesI wish Nick_P has shared the script -- I could try to bisect... otherwise someone needs to write one
if I bisected it right (didn't doublecheck manually), it is due to 8.20210127-107-gdd39e9e25 AKA 8.20210223~41
bisection script
I have not managed to reproduce it using that script. I let it run for 5 minutes (and 1 gb transferred).
Since that commit points to stall detection, I hacked the code to constantly detect false stalls. But I am still not seeing any zombie git processes with that.
If stall detection is involved, I'd expect that you would see "Transfer seems to have stalled" when reproducing the bug.
Since stall detection could depend on available bandwidth etc, I wonder if the script reproduces the bug reliably enough for the bisection to be correct. It would be helpful to manually verify the bisection result, with a longer test period than the script used. And look for the above message when reproducing it.
I found a case where zombie git processes could be started in theory, but only when git-annex is run without -J. And only a few zombies I think. And I couldn't find a code path where it actually happened. So not the same as this bug. But it did involve setConcurrency, which the bisected commit also involves (via forkState), so at least shows how that could cause a such a problem in theory. Fixed that.
Reproduced the bug. Took laptop entirely offline, then
git-annex get -J5
in the repo @yoh provided above. Sometimes it happens immediately, sometimes it needs to wait a few minutes for DNS timeouts.Looks like one leaked git cat-file per time it failed to download in that situation.
Should be able to take it from here..