Please describe the problem.
In a daily testing of git-annex on client systems, on smaug
we observed tests stalling and being eventually killed once in a while consistently at conflict resolution (mixed locked and unlocked file):
step:
(git)smaug:/mnt/datasets/datalad/ci/git-annex-ci-client-jobs[master]builds/2022
$> datalad foreach-dataset --o-s relpath git grep 'Elapsed time: 3600 seconds'
06/result-smaug-720/handle-result.yaml-164-4355efb7-success/result-smaug-720/git-annex.log: conflict resolution (nonannexed symlink): + echo 'Elapsed time: 3600 seconds'
06/result-smaug-720/handle-result.yaml-164-4355efb7-success/result-smaug-720/git-annex.log:Elapsed time: 3600 seconds
11/result-smaug-900/handle-result.yaml-437-402e214e-success/result-smaug-900/git-annex.log: present False: + echo 'Elapsed time: 3600 seconds'
11/result-smaug-900/handle-result.yaml-437-402e214e-success/result-smaug-900/git-annex.log:Elapsed time: 3600 seconds
08/result-smaug-787/handle-result.yaml-276-ca542974-success/result-smaug-787/git-annex.log: conflict resolution (nonannexed symlink): + echo 'Elapsed time: 3600 seconds'
08/result-smaug-787/handle-result.yaml-276-ca542974-success/result-smaug-787/git-annex.log:Elapsed time: 3600 seconds
10/result-smaug-853/handle-result.yaml-373-7e7a9411-success/result-smaug-853/git-annex.log: conflict resolution (mixed locked and unlocked file): + echo 'Elapsed time: 3600 seconds'
10/result-smaug-853/handle-result.yaml-373-7e7a9411-success/result-smaug-853/git-annex.log:Elapsed time: 3600 seconds
10/result-smaug-854/handle-result.yaml-375-d95326d3-success/result-smaug-854/git-annex.log: conflict resolution (nonannexed symlink): + echo 'Elapsed time: 3600 seconds'
10/result-smaug-854/handle-result.yaml-375-d95326d3-success/result-smaug-854/git-annex.log:Elapsed time: 3600 seconds
so far seemed to happen only on smaug among "test clients".
What version of git-annex are you using? On what operating system?
in above grep
the versions are:
(git)smaug:/mnt/datasets/datalad/ci/git-annex-ci-client-jobs[master]builds/2022
$> datalad foreach-dataset --o-s relpath git grep -l 'Elapsed time: 3600 seconds' | xargs grep '^git-annex version:' | sort -n
06/result-smaug-720/handle-result.yaml-164-4355efb7-success/result-smaug-720/git-annex.log:git-annex version: 10.20220525+git47-g4bf796225-1~ndall+1
08/result-smaug-787/handle-result.yaml-276-ca542974-success/result-smaug-787/git-annex.log:git-annex version: 10.20220724+git77-ga24ae0814-1~ndall+1
10/result-smaug-853/handle-result.yaml-373-7e7a9411-success/result-smaug-853/git-annex.log:git-annex version: 10.20220927+git27-g82dab0749-1~ndall+1
10/result-smaug-854/handle-result.yaml-375-d95326d3-success/result-smaug-854/git-annex.log:git-annex version: 10.20220927+git27-g82dab0749-1~ndall+1
11/result-smaug-900/handle-result.yaml-437-402e214e-success/result-smaug-900/git-annex.log:git-annex version: 10.20221103+git9-gad7dfdb37-1~ndall+1
full logs are available on smaug at the paths listed above
I notice that just the init test is taking 346 seconds to run. Then the add test is taking 499 seconds. Those are both tests that do very little work, and here run in 1s or so. The only thing "wrong" with the conflict resolution tests may be that they do 10x as much work.
In the same log, the quickcheck tests, which are entirely CPU bound, ran at normal speed. So probably it's disk IO slowing the other tests down.
Also,
result-smaug-900/handle-result.yaml-437-402e214e-success/result-smaug-900/git-annex.log
times out on another test. So I'm sure the problem is not a specific test stalling.And in
result-smaug-899/handle-result.yaml-435-3e89687a-success/result-smaug-899/git-annex.log
, it succeeds in 856 seconds. That still seems slow, my laptop can run it in 250 seconds with 4 cores (even when using -J16). That makes me think that the IO load (or whatever) on smaug is varying and slowing down the test by different amounts.Smaug has 16 cores. The test suite automatically parallelizes, so there will be 16 worker processes. But each worker process has to itself run the init and add tests, which are prereqs for most other tests. So those tests are being run 16 times. Which just by itself, at the 346 and 499 second run times, will take more than an hour. Also if disk IO is starved, a lot of workers is only going to thrash it more.
So
git-annex test -J1
or so seems like a good idea.It might also be better to not have a single timeout for the whole test suite, but time out if it stalls for long enough without generating new output.
I wondered if running the add and init tests once per worker was causing a diminishing returns at higher -J levels. Since at -J16, each worker runs the add and init tests followed by only 2 or 3 other tests, the time spent on the add and init tests becomes more and more significant.
Timings from my laptop (with 4 cores):
So a small diminishing returns at -J16.
After some improvements:
That won't solve smaug being so overloaded though. If anything, it will just make it get to some other test before it times out..