git-annex causing zombie git processes to build up

We're experiencing a zombie outbreak:

user 25976  0.0  0.0   6808  1240 ?        S    17:17   0:00          \_ /opt/git-annex.linux/exe/git --library-path /opt/git-annex.linux//lib/x86_64-linux-gnu: /opt/git-annex.linux/shimmed/git/git annex get --jobs 4 --batch -z
user 26001 88.4  0.2 1076112312 41728 ?    Sl   17:17   4:09              \_ /opt/git-annex.linux/exe/git-annex --library-path /opt/git-annex.linux//lib/x86_64-linux-gnu: /opt/git-annex.linux/shimmed/git-annex/git-annex get --jobs 4 --batch -z
user 26278  0.0  0.0      0     0 ?        Z    17:17   0:00                  \_ [git] <defunct>
user 26279  0.0  0.0      0     0 ?        Z    17:17   0:00                  \_ [git] <defunct>
user 26474  0.0  0.0      0     0 ?        Z    17:17   0:00                  \_ [git] <defunct>
user 26475  0.0  0.0      0     0 ?        Z    17:17   0:00                  \_ [git] <defunct>
user 26735  0.0  0.0      0     0 ?        Z    17:17   0:00                  \_ [git] <defunct>
user 26737  0.0  0.0      0     0 ?        Z    17:17   0:00                  \_ [git] <defunct>
...

over 1229 and climbing as I type.

It's git-annex version: 8.20210331-g1fb59a63a

I've now stopped and restarted the 'git-annex get', but without --jobs 4, and this time it appears to have two zombie git processes, but it's not increasing as it gets more objects.

I'm unsure how to debug what is wrong, so seeking guidance.

Tomorrow I'll start to see if I can reproduce with older/newer versions of git-annex.

done

RSS Atom

comment 1

Not sure if related, but I had a large sync --content operation stall numerous times yesterday because of errors related to too many git processes. This was with -J2 and many small files using Mac OS, latest git-annex version via Homebrew.

Comment by strmd — Wed May 19 07:00:58 2021

Remove comment

comment 2

My testing suggests that this was introduced between 8.20210127 (is good), and 8.20210223 (is bad).

Comment by Nick_P — Wed May 19 14:29:22 2021

Remove comment

comment 3

I have not yet checked if fixed in https://git-annex.branchable.com/news/version_8.20210428/ , since https://downloads.kitenet.net/.git/ appears to be not yet updated with that version

Comment by Nick_P — Wed May 19 14:35:43 2021

Remove comment

comment 4

Pass --debug to git-annex, it will output the git commands it runs along with their pid in [brackets], correlate that to the pid of a zombie and tell me what the specific git command is that has the problem.

Comment by joey — Wed May 19 14:53:30 2021

Remove comment

comment 5

Using 8.20210223-1~ndall+1 ;

chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]

Comment by Nick_P — Wed May 19 16:59:21 2021

Remove comment

comment 6

This is the result of my test framework at the moment, which aims to determine when this issue appeared:

version,measured after seconds,zombies
8.20201104-g13bab4f2c,30,3
8.20201104-g13bab4f2c,60,3
8.20201104-g13bab4f2c,90,3
8.20201104-g13bab4f2c,120,3
8.20201104-g13bab4f2c,150,3
8.20201128-g4262ba3c4,30,3
8.20201128-g4262ba3c4,60,3
8.20201128-g4262ba3c4,90,3
8.20201128-g4262ba3c4,120,3
8.20201128-g4262ba3c4,150,3
8.20210128-g5e5829a8d,30,5
8.20210128-g5e5829a8d,60,21
8.20210128-g5e5829a8d,90,47
8.20210128-g5e5829a8d,120,51
8.20210128-g5e5829a8d,150,55
8.20210224-gf951847c6,30,9
8.20210224-gf951847c6,60,19
8.20210224-gf951847c6,90,35
8.20210224-gf951847c6,120,51
8.20210224-gf951847c6,150,61
8.20210429-g84ceedb26,30,3
8.20210429-g84ceedb26,60,17
8.20210429-g84ceedb26,90,35
8.20210429-g84ceedb26,120,49
8.20210429-g84ceedb26,150,55

I can add more versions to check if requested.

Comment by Nick_P — Thu May 20 10:43:14 2021

Remove comment

comment 7

If you want to completely narrow down a commit, use git bisect between 4262ba3c4 and 5e5829a8d. With 400 commits in that range it should take less than 9 rebuilds.

Do you have annex.retry or similar config set? That range includes a change that spins up a bunch of child git-annex processes when that is set, which seems like the kind of thing that could be relevant.

Since git cat-file is usually started up once and left running for the lifetime of git-annex (and waited for at the end if it exits normally), it's hard to see how it could end up being run repeatedly otherwise.

Comment by joey — Fri May 21 15:33:06 2021

Remove comment

god bless search

found this issue which was not done.

I have ran into it again: on initial occasion I could not even ssh to that host because ~4000 [git] <defunct> processes forbid any other process to start according to the policy. Updated git-annex to standalone build from conda (8.20211012-geb95ed486), repeated again (on http://datasets.datalad.org/labs/hasson/narratives/derivatives/fmriprep/.git ) with a plain git annex get -J5 . -- had to interrupt it when I saw that reached 1k of those. Here is dump of annexy config variables so - no retries

(datalad) [d31548v@discovery7 fmriprep]$ git config --list | grep annex
annex.pidlock=true
remote.origin.annex-bare=false
remote.origin.annex-uuid=348d8bd8-1142-4de1-b52e-6b9899faa1d5
annex.uuid=75d12ed4-e3e4-4ad0-ba7e-ec11538966d2
annex.version=8
filter.annex.smudge=git-annex smudge -- %f
filter.annex.clean=git-annex smudge --clean -- %f
remote.fcp-indi.annex-s3=true
remote.fcp-indi.annex-uuid=bf6f56f3-f6b7-4df8-9898-a4226dc71400

I wish Nick_P has shared the script -- I could try to bisect... otherwise someone needs to write one ;-)

Comment by yarikoptic — Wed Nov 17 21:03:29 2021

Remove comment

bisection

if I bisected it right (didn't doublecheck manually), it is due to 8.20210127-107-gdd39e9e25 AKA 8.20210223~41

bisection script

#!/bin/bash

set -u

u=http://datasets.datalad.org/labs/hasson/narratives/derivatives/fmriprep/.git
d=/mnt/scrap/tmp/fmriprep

[ -e "$d" ] || datalad clone "$u" "$d"
cd "$d"
git annex drop --all
git annex get -J5 . >/dev/null &
p=$!  # use pstree?
for i in {1..200}; do # 200 sec seems to be enough
    np=$(ps auxw | grep $USER | grep '\[git\] <defunct>' | wc -l)
    echo -n "$i $np "
    if [ $np -gt 10 ]; then { kill $p; exit 1; } fi
    sleep 1
done
kill "$p" 
echo done

Comment by yarikoptic — Thu Nov 18 02:29:01 2021

Remove comment

comment 10

I have not managed to reproduce it using that script. I let it run for 5 minutes (and 1 gb transferred).

Since that commit points to stall detection, I hacked the code to constantly detect false stalls. But I am still not seeing any zombie git processes with that.

If stall detection is involved, I'd expect that you would see "Transfer seems to have stalled" when reproducing the bug.

Since stall detection could depend on available bandwidth etc, I wonder if the script reproduces the bug reliably enough for the bisection to be correct. It would be helpful to manually verify the bisection result, with a longer test period than the script used. And look for the above message when reproducing it.

Comment by joey — Fri Nov 19 13:30:25 2021

Remove comment

comment 12

I found a case where zombie git processes could be started in theory, but only when git-annex is run without -J. And only a few zombies I think. And I couldn't find a code path where it actually happened. So not the same as this bug. But it did involve setConcurrency, which the bisected commit also involves (via forkState), so at least shows how that could cause a such a problem in theory. Fixed that.

Comment by joey — Fri Nov 19 15:37:43 2021

Remove comment

comment 12

Reproduced the bug. Took laptop entirely offline, then git-annex get -J5 in the repo @yoh provided above. Sometimes it happens immediately, sometimes it needs to wait a few minutes for DNS timeouts.

Looks like one leaked git cat-file per time it failed to download in that situation.

Should be able to take it from here..

Comment by joey — Fri Nov 19 16:02:41 2021