Please describe the problem.
I have mentioned this issue here and there but I believe I never filed an issue for it. ATM I have a process running which does have a good number (see below). Originally we started to observe this issue after refactoring in datalad (PR4829, but we have not found any possible reason on why it is happening, so decided just to let those batched processes let go without being stuck on them. In that PR testing it was quite reproducible (testing datalad-crawler
extension). In real life it is some % of processes which exhibit the same behavior. E.g. below it is I believe only those handful from over 1k of such processes.
0 REG 0x34 0 226206175 /mnt/scrap/tmp/abcd/testds-fast-full4/sourcedata/.git/annex/journal.lck
1 FIFO 0xc 184736925 pipe
2 REG 0x32 0 111383230 /home/yoh/.tmp/tmpo88ucekd (deleted)
3 a_inode 0xd 0 10388 [eventpoll]
4 a_inode 0xd 0 10388 [timerfd]
5 FIFO 0xc 184711928 pipe
6 FIFO 0xc 184711928 pipe
7 a_inode 0xd 0 10388 [eventfd]
8 FIFO 0xc 184711931 pipe
9 FIFO 0xc 184711931 pipe
10 a_inode 0xd 0 10388 [eventfd]
11 a_inode 0xd 0 10388 [eventpoll]
12 FIFO 0xc 184711936 pipe
13 FIFO 0xc 184711936 pipe
14 a_inode 0xd 0 10388 [eventfd]
15 a_inode 0xd 0 10388 [eventpoll]
16 FIFO 0xc 184711937 pipe
17 FIFO 0xc 184711937 pipe
18 a_inode 0xd 0 10388 [eventfd]
19 a_inode 0xd 0 10388 [eventpoll]
20 FIFO 0xc 184711940 pipe
21 FIFO 0xc 184711940 pipe
22 a_inode 0xd 0 10388 [eventfd]
23 REG 0x34 0 226206176 /mnt/scrap/tmp/abcd/testds-fast-full4/sourcedata/.git/annex/othertmp.lck
24 REG 0x34 1327104 227606027 /mnt/scrap/tmp/abcd/testds-fast-full4/sourcedata/.git/annex/othertmp/jlog3458164-0
25 FIFO 0xc 185127155 pipe
26 FIFO 0xc 184895622 pipe
27 FIFO 0xc 184895623 pipe
28 FIFO 0xc 184733096 pipe
29 FIFO 0xc 184733097 pipe
31 FIFO 0xc 184733098 pipe
cwd DIR 0x34 47782 226206111 /mnt/scrap/tmp/abcd/testds-fast-full4/sourcedata
mem REG 0x2d 110620457 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib64/ld-linux-x86-64.so.2 (path dev=0,50)
mem REG 0x2d 110620869 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex (path dev=0,50)
mem REG 0x2d 110621138 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libmagic.so.1 (path dev=0,50)
mem REG 0x2d 110621139 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libc.so.6 (path dev=0,50)
mem REG 0x2d 110621140 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libgmp.so.10 (path dev=0,50)
mem REG 0x2d 110621145 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libm.so.6 (path dev=0,50)
mem REG 0x2d 110621163 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libsqlite3.so.0 (path dev=0,50)
mem REG 0x2d 110621175 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libbz2.so.1.0 (path dev=0,50)
mem REG 0x2d 110621176 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/liblzma.so.5 (path dev=0,50)
mem REG 0x2d 110621187 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libpthread.so.0 (path dev=0,50)
mem REG 0x2d 110621207 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libffi.so.7 (path dev=0,50)
mem REG 0x2d 110621208 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libdl.so.2 (path dev=0,50)
mem REG 0x2d 110621211 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib/x86_64-linux-gnu/libz.so.1 (path dev=0,50)
mem REG 0x2d 111381598 /home/yoh/.cache/git-annex/locales/bb375eb6ec0d2706f1723307f068911a/en_US.UTF-8/LC_CTYPE (path dev=0,50)
rtd DIR 0x900 4096 2 /
txt REG 0x32 177928 110620457 /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/lib64/ld-linux-x86-64.so.2
the tail of the log file from above:
$> tail /mnt/scrap/tmp/abcd/testds-fast-full4/sourcedata/.git/annex/othertmp/jlog3458164-0
768_b3b_URL-s129400--s3&c%%NDAR__Central__1%submission__-5b733a46f742db820f7b5552c7c576df.log
f64_54f_URL-s428527--s3&c%%NDAR__Central__1%submission__-4103b9975e7643d71239cf50fd5b767d.log.web
f64_54f_URL-s428527--s3&c%%NDAR__Central__1%submission__-4103b9975e7643d71239cf50fd5b767d.log
9e8_f0f_URL-s428527--s3&c%%NDAR__Central__1%submission__-bc9437b29056960784bab28fb24e1f87.log.web
9e8_f0f_URL-s428527--s3&c%%NDAR__Central__1%submission__-bc9437b29056960784bab28fb24e1f87.log
276_2e8_URL-s425371--s3&c%%NDAR__Central__1%submission__-3012b0043ac8847cf38c2c97a12cfa26.log.web
276_2e8_URL-s425371--s3&c%%NDAR__Central__1%submission__-3012b0043ac8847cf38c2c97a12cfa26.log
dda_614_URL-s425371--s3&c%%NDAR__Central__1%submission__-6d7ea2d5b0d0771f6a44e2e51edef648.log.web
dda_614_URL-s425371--s3&c%%NDAR__Central__1%submission__-6d7ea2d5b0d0771f6a44e2e51edef648.log
b66_7bf_URL-s196404--s3&c%%NDAR__Central__1%submission%
13411 yoh 20 0 17804 6452 1720 S 0.0 0.0 0:13.81 │ ├─ zsh
3136785 yoh 20 0 4176M 4094M 11008 S 21.6 3.2 22:10.12 │ │ └─ python3 datalad-nda/scripts/datalad-nda --pdb add2datalad -i /proc/self/fd/15 -d testds-fast-full4 --fast
3881942 yoh 20 0 8 8 0 R 2.0 0.0 0:00.03 │ │ ├─ python3
3458129 yoh 20 0 6992 3312 2920 S 0.0 0.0 0:00.09 │ │ └─ git --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git/git annex addurl --fast --with-files --json --json-error-messages --batch
3458164 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 22:03.39 │ │ └─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3725766 yoh 20 0 12348 8568 3748 D 0.0 0.0 0:03.74 │ │ ├─ git --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs write-tree
3549190 yoh 20 0 7140 4172 3644 S 0.0 0.0 0:06.23 │ │ ├─ git --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs hash-object -w --stdin-paths --no-filters
3463573 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.11 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3462846 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.07 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458347 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.13 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458216 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 2:12.84 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458215 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.20 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458209 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:02.94 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458207 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:30.23 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458181 yoh 20 0 100M 55756 17736 S 0.0 0.0 0:41.01 │ │ ├─ python3 /mnt/scrap/tmp/abcd/datalad/venvs/dev3/bin/git-annex-remote-datalad
3458175 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 5:09.58 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458174 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 3:37.60 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458173 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.93 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458172 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.91 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458171 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 5:07.11 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458170 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.92 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458169 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 4:16.06 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458167 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:49.34 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458166 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:00.93 │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
3458165 yoh 20 0 1.0T 99344 35792 S 0.0 0.1 0:05.84 │ │ └─ git-annex --library-path /home/yoh/.tmp/ga-mgY44yd/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-mgY44yd/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
What version of git-annex are you using? On what operating system?
$> which git-annex
/home/yoh/.tmp/ga-RIFzW89/git-annex.linux//git-annex
$> git-annex version
git-annex version: 8.20200908-gcfc74c2f4
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.26 DAV-1.3.4 feed-1.3.0.1 ghc-8.8.4 http-client-0.6.4.1 persistent-sqlite-2.10.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.1.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP
512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso hook external
operating system: linux x86_64
supported repository versions: 8
upgrade supported from repository versions: 0 1 2 3 4 5 6 7
I had placed that top process into background, and then when I brought it back to see if I could may be strace any of those processes, I ran into https://git-annex.branchable.com/bugs/standalone_runshell_can_race_and_fail_to_remove9612647.cache47git-annex47locales4796dirs/ and although main process and a single annex addurl --batch
was still hanging there, the rest was gone/finished...
Any ideas joey on what could hold them? if needed I could try to give reproducible docker or singularity recipe based on running datalad tests...
That process tree seems to show git-annex with a bunch of child git-annex processes. (Assuming that the first field is the pid, and so those are not threads..)
But git-annex rarely exec's git-annex, and when it does, it's never with the same params as the parent git-annex was run with. And, addurl in this situation normally has some other child processes like
git cat-file --batch
that this one is missing.So it kind of looks like a fork/exec stalled before the exec somehow, I don't know how. strace of a child process in this situation may shed some light on this strange state.
If a fork/exec somehow stalled, it might explain why the jlog file is incompletely written, since the haskell IO manager might have gotten locked up.
It seems at least plasible that the race you then hit is somehow related to it getting into this state. When I fixed that race, I noticed and fixed a condition where runshell would run git-annex with the locales not generated, which would lead to running with the system locales. Which, if there were an incompatibility between them and the bundled libc, might cause some kind of weird behavior.
Or, it could be something like https://gitlab.haskell.org/ghc/ghc/-/issues/4074 But that was fixed in 2010, so unless this git-annex was built with an old haskell library, unlikely that precise bug.
may be that one already exited "peacefully"? I believe I was looking at the moment when we already requested closing that
annex addurl --batch
.I wonder if may be
htop
(I believe I pasted process tree from it) somehow may be groupped incorrectly and those are "independent" hanginggit annex addurl
processes which got "folded"... I don't know..Anyways, to hopefully "shine the light", I have managed to reproduce (with up to date snapshot of annex, didn't get there with systemwide older git-annex) on smaug using initially stalling datalad-crawler tests (so 100% reproducible): just login as
test
user tosmaug
. There is a screen session with one shell there with python's PDB where we were waiting for batched process to exit but it didn't; then 2nd one showsps auxw -H | grep '^test'
. Pleasestrace
or inspect anything you might find pertinent (IIRC, raising debug level for datalad, or disabling code tests coverage collection, was making the stall going away). FTR the command isPATH=~/git-annex/20201005-snapshot/:$PATH DATALAD_LOG_LEVEL=WARNING DATALAD_TESTS_TEMP_KEEP=1 python -m nose -s -v --with-cov --cover-package datalad ../datalad_crawler/dbs/tests/test_files.py ../datalad_crawler/nodes/tests/test_annex.py
and it ends up in pdb (python debugger) if batched annex did not exit although we waited.Looks like this time it was a git-annex info --batch. And none of the child processes look unusual like what you showed before, it's the usual collection of git cat-file.
The git-annex process's stdin is still connected to a pipe, the other end of which datalad still has open for write. So there's no reason for git-annex to exit, it's still waiting for batch requests.
3 seconds? Are you sure this is the same problem? Or that this is a problem at all? git-annex info could take longer than 3 seconds if you ask it for info on a directory, or if it needs to merge the git-annex branch, or maybe in other circumstances.
Getting the feeling the problem may be in your test machinery..
the mystery is that to our knowledge we do close the stdin (and stdout) of git-annex process right before, so pipe should have not remained open, unless something else may be somehow keeps it open... And I cannot reproduce the bug ATM neither locally or on smaug where I did prepare you the reproduced case!
Indeed though I agree that pretty much any timeout could not generally be "sufficient". That is why we need to figure out what is happening/why some times batched processes do not quit. Either the later "reproducer" was exactly the same situation as in the original reported (where it was a few out of 1000s of batched processes) -- I do not know. may be there are multiple issues.
896825 is the process that still had the pipe open for write. It appears to not be the main python process 894971 that started git-annex.
So, my guess would be it inherits the pipe across a fork.
Thank you Joey for the analysis! I am not 100% certain why that nose fork process came about. May be nose (testing framework) causes a fork if some processes are still lingering after a test completes... I guess something for us to keep in mind
Also I - identified the change has caused the issue to disappear while looking through recent datalad changes. Not yet sure if that was anyhow a "fix" but at least I do not observe it any longer - reproduced the issue with elderly 7.20190819+git2-g908476a9b-1~ndall+1, so it confirms that it is not a regression within git-annex (due to some flaky run I thought about possible regression).
I think you can close it, and we (datalad) should keep an eye on nose and real-life if any stalls... so far I only observed datalad abandoning
git-annex
process without waiting long enough (like in the original case I reported here) when it could be just taking longer to complete indeed.