Please describe the problem.
I got a new server for the repronim project (welcome typhon
to the family of smaug
, drogon
, and falkor
). Wanted to redo https://www.datalad.org/test_fs_analysis.html benchmarking. Added git annex test
call to see how its time would vary across file systems etc.
My surprise was that my script never exited.
Current ps tree is (yeah, running as root
, likely will reinstall the thing anyways fully after I am done benchmarking)
root 185696 0.0 0.0 62536 54768 pts/1 S+ Nov15 2:56 python test_fs.py benchmark
root 3655423 0.0 0.0 2484 516 pts/1 S+ Nov18 0:00 /bin/sh -c cd test9; git annex test
root 3655424 0.0 0.0 9728 3340 pts/1 S+ Nov18 0:00 git annex test
root 3655425 0.0 0.0 1076293924 28360 pts/1 Sl+ Nov18 0:00 /usr/bin/git-annex test
root 3696268 0.2 0.0 1074196636 41932 pts/1 Sl+ Nov18 13:14 /usr/bin/git-annex --color=never test
root 3710526 0.0 0.0 1074122856 44736 pts/1 Sl+ Nov18 0:00 /usr/bin/git-annex move --to origin foo
root 3710593 0.0 0.0 0 0 pts/1 Z+ Nov18 0:00 [git] <defunct>
root 3710594 0.0 0.0 0 0 pts/1 Z+ Nov18 0:00 [git] <defunct>
root 3710595 0.0 0.0 0 0 pts/1 Z+ Nov18 0:00 [git] <defunct>
root 3710596 0.0 0.0 0 0 pts/1 Z+ Nov18 0:00 [git] <defunct>
root 3710597 0.0 0.0 2484 516 pts/1 S+ Nov18 0:00 sh -c git-annex test --fakessh -- 'localhost' 'git-annex-shell '"'"'p2pstdio'"'"' '"'"'/mnt/test/test9/.t/75/main0'"'"' '"'"'6129b8ec-34af-47d1-b7e6-3c4fcc0d33e3'"'"' --uuid c129397d-8209-40ea-8347-16a8c3fe69de'
root 3710598 0.0 0.0 1074048960 20408 pts/1 Sl+ Nov18 0:00 git-annex test --fakessh -- localhost git-annex-shell 'p2pstdio' '/mnt/test/test9/.t/75/main0' '6129b8ec-34af-47d1-b7e6-3c4fcc0d33e3' --uuid c129397d-8209-40ea-8347-16a8c3fe69de
root 3710604 0.0 0.0 2484 580 pts/1 S+ Nov18 0:00 /bin/sh -c git-annex-shell 'p2pstdio' '/mnt/test/test9/.t/75/main0' '6129b8ec-34af-47d1-b7e6-3c4fcc0d33e3' --uuid c129397d-8209-40ea-8347-16a8c3fe69de
root 3710605 0.0 0.0 1074122860 18948 pts/1 Sl+ Nov18 2:39 git-annex-shell p2pstdio /mnt/test/test9/.t/75/main0 6129b8ec-34af-47d1-b7e6-3c4fcc0d33e3 --uuid c129397d-8209-40ea-8347-16a8c3fe69de
What version of git-annex are you using? On what operating system?
git-annex version: 10.20221003
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.26 DAV-1.3.4 feed-1.3.0.1 ghc-8.8.4 http-client-0.6.4.1 persistent-sqlite-2.10.6.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.1.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
any ideas? need access to troubleshoot?
Seems like the test it is stuck in is "move (ssh remote)".
Is this replicable?
I wonder if
test_fs.py
is doing something that causes the problem. There is some complexity in there. Maybe it's happening on one particular FS.If it happens when not run by
test_fs.py
, it would be useful to get a debug log, because I think that seeing the P2P protocol dump would help explain what has happened.There was not a good way to get the test suite to output a full debug log, so I had to implement one. With an updated build of git-annex, you can use:
Here's the expected output of that section of the test suite, when it is working properly:
I examined this situation on the machine. I had to ctrl-z the process to get a shell.
Then I checked what the stdin of 3710605 was, and it was still open, and was a pipe from 3710526. So that's why the p2pstdio process was still running; that pipe should be closed when the parent git-annex is done, but the parent was apparently stuck on something.
Then I straced 3710605 but it was suspended (oops). So I ran
fg
. This somehow unstuck everything! The test suite finished up very fast.Here is what it output for the test that had gotten stuck:
That seems like it must come from here in the test suite:
So it seems that the content was moved back to origin successfully (
annexed_present
checks that the object file is present before checking the location log), but that the location log didn't get updated. Need to check if that update would have been done by thep2pstdio
process or themove
process.Why would a SIGCONT have unstuck it I wonder?
I have re-ran the command to see if the bug replicates..
note: it was considerable amount of time (days?) for it to take there I made a copy
test_fs_testonly.py
where I removed all other "benchmarks" prior runningannex test
-- might get there faster if works so feel free to interrupt and rerun that one. That code is old (circa 2014 ;)) , and I should do some face lift but haven't had a chance yetThe rerun did not get stuck. So it's not something directly the fault of the script running
git-annex test
, I think.Well, a specific filesystem or mount option that the script sets up, for example.
Or if it's something of that kind, it's also intermittent.
This will limit the test suite to only running the that test:
I have that repeating locally in a loop, in case it's just something that happens every 1000 test runs or something like that.
(I've patched git-annex to speed up the above command 400%.)
Very good news! Reproduced the hang on kite, after 500 iterations of test suite.
strace of the
git-annex move
process:Strace of the child:
Which looks like it's blocked on taking a lock of
/home/joey/tmp/.t/3/main0/.git/annex/restage.lck
Oh interesting, the same process has restage.lck open twice.
There are other processes that have the same restage.lck open too. In the process tree below, pid 1710382 does, and so does pid 1710483.
(Strange that pid 1710368 is not a child process of the main git-annex test run. I'm unsure why that happens. )
Stracing 1710483, it is also stuck on
fcntl F_SETLKW
. This looks like it may be the main deadlock, because the parent is waiting ongit update-index
, which is waiting ongit-annex filter-process
, which is stuck waiting to take the lock... which its parent is holding!(There are also several other process trees, running
git-annex test --fakessh
, that have gotten deparented from the main test run, and are hanging around. Those also have restage.lck open, though in their cases it is a since-deleted file.)I think this is a bug in restagePointerFiles, which uses streamRestageLog while refreshIndex is running. The restage.log was added 2 months ago, in 6a3bd283b8af53f810982e002e435c0d7c040c59.
I suspect that
git-annex filter-process
is running populatePointerFile via Command.Smudge.getMoveRaceRecovery, which needs to write to the restage.log. That would explain why the deadlock only happens sometimes, because it involves a race that triggers that code.Reading the whole restage.log into memory first would probably avoid the bug, but is not ideal.
I've developed a fix. Gonna re-run the reproducer long enough to be sure the bug is squashed..
The retest seemed to hang again, but I'm not sure I was testing the right binary. Running it again.
This patch exposes when the lock is held and so illustrates the bug reliably and quickly:
With that patch, the unfixed git-annex hangs immediately. With my fix, it does not hang, which seems to show I fixed it correctly..
Got a clean test run with the right binary for sure, and I've confirmed this bug is really fixed. Test ran 20k iterations successfully.