speed up "standalone build" and/or tests

As part of the datalad/git-annex builds setup there is work to enable testing on some target deployments, but there is problem (see git-annex/pull/102#issuecomment-1054561966) that overall sweep of git annex test of the standalone build on a target system takes over 5,000 seconds (over an hour!).

Clearly that system is "unique" in its "performance" (most likely due to nfs4'ed filesystem tests go under, not due to "Standalone" but I can't change the issue title at this moment) and here is the timing spread we see from other runs on CI:

(git)smaug:/mnt/datasets/datalad/ci/git-annex/builds/2022/02[master]git
$> git grep 'All 9.. tests passed (' | sort -t '(' -n -k 3 | head -n 5
cron-20220222/build-ubuntu.yaml-586-85dd9355-failed/1_test-annex (normal, ubuntu-latest).txt:2022-02-22T02:57:08.4002055Z All 994 tests passed (239.44s)
cron-20220222/build-ubuntu.yaml-586-85dd9355-failed/test-annex (normal, ubuntu-latest)/7_Run tests.txt:2022-02-22T02:57:08.4002052Z All 994 tests passed (239.44s)
pr-102/build-ubuntu.yaml-600-12c9bb39-failed/1_test-annex (normal, ubuntu-latest).txt:2022-02-25T18:47:54.5577129Z All 994 tests passed (240.31s)
pr-102/build-ubuntu.yaml-600-12c9bb39-failed/test-annex (normal, ubuntu-latest)/7_Run tests.txt:2022-02-25T18:47:54.5577125Z All 994 tests passed (240.31s)
cron-20220225/build-ubuntu.yaml-595-85dd9355-failed/1_test-annex (normal, ubuntu-latest).txt:2022-02-25T03:01:16.8138190Z All 994 tests passed (240.87s)

$> git grep 'All 9.. tests passed (' | sort -t '(' -n -k 3 | tail -n 5
cron-20220227/build-macos.yaml-604-85dd9355-failed/test-annex (normal, macos-latest)/8_Run tests.txt:2022-02-27T02:32:31.2444240Z All 994 tests passed (1284.60s)
cron-20220203/build-macos.yaml-578-85dd9355-success/3_test-annex (normal, macos-latest).txt:2022-02-03T02:27:38.5945050Z All 991 tests passed (1292.09s)
cron-20220203/build-macos.yaml-578-85dd9355-success/test-annex (normal, macos-latest)/8_Run tests.txt:2022-02-03T02:27:38.5945040Z All 991 tests passed (1292.09s)
manual-20220222/build-macos.yaml-599-85dd9355-failed/3_test-annex (normal, macos-latest).txt:2022-02-22T18:44:42.5220730Z All 994 tests passed (1340.62s)
manual-20220222/build-macos.yaml-599-85dd9355-failed/test-annex (normal, macos-latest)/8_Run tests.txt:2022-02-22T18:44:42.5220720Z All 994 tests passed (1340.62s)

so we have runs as short as 4 minutes (cron-20220222/build-ubuntu.yaml-586-85dd9355-failed) and as "long" as 20 minutes (manual-20220222/build-macos.yaml-599-85dd9355-failed) .

but at >5k seconds it made me wonder if worth looking into making operation on such system a bit more "snappy". FWIW, here are the slowest tests:

[d31548v@discovery7 git-annex]$ grep 's)' tests.out | sort -t '(' -n -k 2 | tail -n 10
    conflict resolution:                                  OK (55.24s)
    adjusted branch merge regression:                     OK (56.63s)
    conflict resolution movein regression:                OK (60.02s)
    transition propagation:                               OK (60.61s)
OK (61.00s)
OK (70.30s)
OK (78.37s)
    union merge regression:                               OK (91.15s)
    transition propagation:                               OK (97.68s)
All 994 tests passed (5576.99s)

done --Joey

RSS Atom

comment 1

Since you mentioned standalone builds, I suspect you are running something like: "cd git-annex.linux; ./git-annex test"

I have verified that is much much slower than running git-annex not from the standalone build. On the order of 1+ hours vs 5 minutes.

There is a simple workaround, run "git-annex.linux/git-annex" from somewhere outside the standalone build. That will make it much faster. Here it takes about 5 minutes.

With that said, there's still a lot of room to speed up git-annex test, and the main thing would probably be to parallelize its tests. Which can be done, but needs tasty-1.2. That finally made it into debian stable, so it should be able to depend on it now. I would not be surprised if it can be sped up 10x that way, because tests often have to wait one or more seconds after writing a file due to time stamp issues etc. I have created a paralleltest branch with a start on that, not yet working.

Comment by joey — Tue Mar 1 18:06:32 2022

Remove comment

comment 2

I have fixed the problem I identified, which was due to git-annex test adding the cwd to PATH, which caused it to run git-annex.linux/git, so runshell was being run repeatedly and unncessarily.

Now it will run git-annex.linux/bin/git and avoid the repeated runshell overhead, so will be about as fast as git-annex not run from the standalone tarball.

Comment by joey — Tue Mar 1 19:58:37 2022

Remove comment

still slow

I have ran tests with "bleeding edge" build of git-annex on ndoli (less busy node) of discovery and unfortunately it is still quite slow -- took 50 minutes

more details

> time git annex test 2>&1 | tee 10.20220222+git62-gce523f756-1~ndall+1-tests.log
...
OK (71.94s)
    preferred content:                                    OK (34.75s)
    required_content:                                     OK (17.34s)
    add subdirs:                                          OK (12.63s)
    addurl:                                               OK (10.71s)

All 840 tests passed (2670.47s)

real    48m52.802s
user    3m23.073s
sys     4m53.776s

[d31548v@ndoli tmp]$ git annex version | head
git-annex version: 10.20220222+git62-gce523f756-1~ndall+1
...

[d31548v@ndoli tmp]$ pwd
/dartfs-hpc/rc/lab/C/CANlab/labdata/data/tmp

so there is probably more to the story (NFS), or probably needing nfs4_*etfacl? well -- I ran under a folder which should be proper POSIX -- took 5075.48s (the same 10.20220222+git62-gce523f756-1~ndall+1)

Comment by yarikoptic — Tue Mar 8 00:27:59 2022

Remove comment

comment 4

One thing I notice is that the test suite reports it took 44 minutes (2670 seconds), but time reports 49 minutes. Those additional 5 minutes must be the test suite cleaning up the test directories. Which fits with NFS. That is 5 minutes to effectively rm -rf maybe 20k directories/files.

Also, you originally said it took more than 1 hour (or perhaps more than 5000 seconds, which would be 1.4 hours). So it seems that my fix did have a significant impact on speed.

What I see benchmarking locally is that the standalone tarball takes 1016 seconds (down from 3600+), while a bare git-annex binary takes 614 seconds. That is probably due to the small overhead (100 failed opens) discussed in ?this old todo, multiplied by the thousands of times the test suite runs git-annex.

Parallelizing the test suite seems like the only way to get a substantial speedup.

Comment by joey — Tue Mar 8 18:16:40 2022

Remove comment

comment 5

Re that 5 minutes to clean up, I tried making it clean up test directories in the background while running other tests. Interestingly, that slowed it down here by 33 seconds (5%). Due to disk IO contention I suppose? (SSD)

Makes me wonder how much benefit parallelism would be..

Comment by joey — Tue Mar 8 19:43:01 2022

Remove comment

comment 6

@yoh It would be interesting if you could check the speed on NFS when bypassing the standalone build's overhead by running git-annex.linux/shimmed/git-annex/git-annex test

That will not use the bundled libraries and programs, but if the test system is reasonably similar to the build system it would still work.

It's good you're testing the standalone build works, but I think this test is about testing NFS really, so you could leave the standalone build testing to other test runs than this one, if that is a significant speedup.

Comment by joey — Tue Mar 8 19:47:12 2022

Remove comment

comment 7

I tried as an experiment, opening 5 terminals, and running each of the 5 main groups of tests in parallel manually, each command in a different directory:

git-annex test -p Tests.QuickCheck
git-annex test -p Tests.Remote
git-annex test -p 'Tests.Unit Tests v8 locked'
git-annex test -p 'Tests.Unit Tests v8 unlocked'
git-annex test -p 'Tests.Unit Tests v8 adjusted unlocked branch'

They took, respectively, 34, 58, 159, 154, and 220 seconds. Compared to a sequential runtime of 444 seconds, this shows it can be sped up well by parallelism at least in some cases. Seems likely that splitting up the slower blocks further and having 8 groups of tests could make it faster yet.

@yoh, it would be interesting if you could try this on the NFS system and see if it speeds it up enough.

Tasty does not seem to have a way for parallel forks of the test program to report back their status in a way that will be combined together. That does seem like something that could perhaps be added to it in a nice clean way.

But, a quick hack is also possible: Have git-annex test fork off one child process per each of these groups, and serialize the output. Using --color=always when at the console and using concurrent-output to stream one of the currently running tests while buffering the rest for later display should make this almost indistinguishable from the "right" way. Would also need to detect some tasty options and fall back to running it normally.

This approach would avoid the problem I hit on the paralleltest branch of needing to rewrite all the testing code to not run chdired into the repo in order to make it able to be run in parallel. That would be a lot of work, and would also make it harder to write new tests, since it would be easy to make a mistake that caused a test to write outside the test repo.

Comment by joey — Thu Mar 10 17:58:07 2022

Remove comment

comment 8

I have implemented parallelism as described in comment 7.

Currently there are 5 child processes, and the test runtime dropped from 444 to 334 seconds on my laptop. Splitting up the test groups further, so there are more child processes will probably improve that more. Remains to be seen if it helps on NFS much..

The git-annex test output is currently a mess, it needs to be serialized. Ran out of time to do that today, but the speed improvement is worth temporarily ugly output.

Comment by joey — Mon Mar 14 18:52:09 2022

Remove comment

comment 9

I've finished up parallelizing git-annex test.

Splitting up the test groups further and improved scheduling sped it up more. On my laptop, it's dropped from 444 to 334 to now 289 seconds.

Also, the -J option is now supported by git-annex test, so you can experiment to find the number of jobs where it runs fastest in your particular situation. The default is one job per CPU core.

My guess is that on NFS, it's not CPU bound but is network latency bound, and so a rather high -J value like -J10 may behave better.

Comment by joey — Wed Mar 16 17:55:52 2022

Remove comment