Please describe the problem.
With recent RFing of scanning for unlocked/annexed files (I guess), a sweep of datalad tests on OSX started to take about 3h 30min instead of prior 1h 46min. So pretty much twice. Besides possibly affecting user experience, I am afraid that would cause too much ripples though our CI setup which might not run out of time
Logs etc are at https://github.com/datalad/git-annex/actions/workflows/build-macos.yaml
The first red is ok, just a fluke but then they all fail due to change in output log string (for which there is a fix but somehow behavior on osx seems different, yet to check).
What version of git-annex are you using? On what operating system?
Currently 8.20210428+git282-gd39dfed2a and first got slow with 8.20210428+git228-g13a6bfff4 and was ok with 8.20210428+git202-g9a5981a15
Oh, fresh runs datalad tests fail because I guess you implemented that "don't display message unless takes awhile" I spotted in one of the comments but didn't have a chance to follow up - I guess from now on we weren't be able to test us passing through messages from git annex (unless there is some setting where were could disable that feature during tests).
That is unrelated to this issue of slow down though
I assume these tests are creating lots of clones of repositories. Are they also doing lots of merges or switching between branches?
The init scan time has already been optimised as much as seems feasible. The update scan time can still be optimised, see ?speed up keys db update with git streaming.
Also worth saying: When I have a choice between a bug fix and a performance change, I kind of have to pick the bug fix. ?indeterminite preferred content state for duplicated file was a longstanding bug that could cause very expensive misbehavior. The only way to fix it that would not have an on performance would be to remove include= and exclude= from the preferred content expression syntax, which would prevent a lot of current uses of preferred content.
What about deferring the scan until the first command that uses
include=/exclude=
gets run, ifannex.supportunlocked=false
? Also, could the scan be limited to the files that match one of the include/exclude globs?Turns out
git-annex init
got a lot slower than it had to, it was doing the same kind of scan twice. I think that probably explains much of the slowdown you saw. (Although it still needs to do somewhat more work than before, but also does it more efficiently than before.)Also I've optimised the update scan's use of git, around a 20% speedup, although I don't know if your test case speed was impacted much by that scan anyway.
Looking at the CI log, I don't see any past runs that took 1h 46min. A month ago they were taking 2h 6min. Let's see if the changes I'm pushing now drop it back to that.
For completeness
I was looking at OSX ones. And the timing was for the one right before slow down: It was this one, but even previous one was 1h40min
🤞
FWIW:
lots of
git annex init
in new git repos. Also a good number of clones AFAIK. FWIW: we used to do more clones instead of just recreating new repos per test, but then we switched to just creating a new repo. We still have a good number though of cloning.Not that many AFAIK.
I do not see 1h40min anywhere on the page you linked. It says 2h 6m at the top. Oh I see, you are clicking around to get to https://github.com/datalad/git-annex/runs/2661363385?check_suite_focus=true, and that number is in there next to "Run datalad tests".
My improvements yesterday did not improve your test suite time any. But they certainly sped it up a lot in my benchmarks. So I think what you think is taking more time, eg the scanning at init, does not really have much to do with whatever is really taking more time. And if your test suite is mostly not cloning repos but is initializing new empty repos, then the scanning at init was and still is effectively a noop, it will not have gotten more expensive.
It might be that the incremental updating git-annex now does when it sees changes to the index is making your test suite run a bit longer. But twice as long? That is not a very expensive process.
Also notice that the git-annex test part of the CI job used to take 15m54s and is now taking 17m29s. That's doing some cloning and some adding of files etc, and it didn't double in run time.
I did find another nice optimisation this morning c831a562f550e6ff93e081f555eced3a8a0d524f, so we can see if that improves things in the next run.
I suspect you're going to have to look at specific parts of your test suite and/or the CI system to identify what's slower though.
Thank you Joey for all the OPTs. Looking at it again: the significant impact seems to be OSX builds specific: Linux and Windows builds did get just a bit slower (if at all). So may be situation is not that dire Two possibilities:
I will try to do some timings locally. Also I should finally get those logs (and built packages) into some more convenient form using tinuous John has been developing -- would allow for easy greps instead of all these jumping through CIs etc.
I can't think of anything OSX specific in the recent changes.
I have added debugging of when reconcileStaged wakes up and possibly wastes some time, eg:
All the new work happens in between those two debugs[1], so you could check if the time sink is there or elsewhere.
(Note that the last release takes 2 seconds longer for that init than it does now..)
[1] With the exception of a single call to
git write-tree
, but that should be very fast.Finished fetching all the builds (logs + artifacts, i.e. built installers and packages) to smaug. FWIW
joey
user on smaug should be able to access them at /mnt/datasets/datalad/ci/git-annex as well (it is all git/git-annex with submodules happen you want to get a "clone" ;))grep confirms timing for datalad testson OSX
Interestingly the top one is in 10k, but prior ones were ok
and absence of the drastic effect confirmed for linux and windows
now will try to do timing on a local OSX to see if it is indeed just some coincidence with OSX VM getting slower or it is "real".
I think I have localized the slowdown to a single particular test in datalad which operates on a very heavy tree with tiny files. Good and bad runs:
you can see from timestamps (a guess github prepends time stamp AFTER getting full line) that there is over 1h30m spent there on
test_files_split(<class 'datalad.support.annexrepo.AnnexRepo'>,)
. Here is the actual test etc for posterity. Yet to pin point more specifically on what is going on but most likely some interplay with command line length invocation limits (specific to OSX) etc.So good news is that it is not some widely spread drastic slow-down effect as far as I see it.
ok -- I think it (or at least a part of it, datalad test is still running) boils down to
git-annex add
now doing someO(n-staged * n-cmdline-paths)
lookup/operation (instead of before justO(n-cmdline-paths)
) whenever we have a series ofannex add --json cmdline-paths ...
. This is reflected by the fact that if before we had about~30 sec
per each invocation ofannex add
, now we have30 284 528 720
.In any case in datalad we should finally switch to use
annex add --batch
, filed an issue, but I guess may be it could also be addressed on git-annex side since sounds like some suboptimal data structure is used for some paths' matching.This is starting to make some sense. If you're running git-annex add N times adding M files each time, then each run will now diff the changes in the index made by the previous run.
And the first part of diffing the index is generating a tree from all the files in it, which is to some extent
O(N*M)
(though IDK, git may have optimisations involving subtrees or such). So the combined N git-annex add runs come toO(N*N*M)
On linux,
git write-tree
with 100,000 files in the index runs in under 1 second, so atheN*M
is not too bad. And then there's the overhead of git-annex processing the resulting diff, which takes more time but is what I've been optimising.Perhaps on OSX something is making the write-tree significantly slower. Or something is making it run the command more with fewer files per run. Although IIRC OSX has close to the same maximum command line length as linux.
Or maybe the index diffing is diffing from the wrong start point. One way I can think of where this would happen is if it somehow misdetects the index as being locked.
A --debug trace of the one of the later git-annex add runs in that test case would probably shed some useful light.
Yes, --batch should avoid the problem...
The latter would be my guess... We seems to get 2 vs 5 splits on Linux vs OSX... our
datalad.utils.CMD_MAX_ARG
(logic is here) gets set to 1048576 on linux and 524288 on OSX. So "matches" and "OSX mystery resolved"!Meanwhile confirming that using
add --batch
mitigates it. With ad-hoc patch to add add --batched to datalad I get 187.8053s run for our test usingannex add --batch
and bleeding edge annex 57b567ac8 and 183.1654s (so about the same) using annex 9a5981a15 from 20210525 (which takes over hour with splitting); and then with our old splitting and that old git-annex 9a5981a15 I get 188.3987s run.@known_failure_windows
;)) so we do not see that "regression" on windows runs (was yet another part of mystery to me why OSX only ;))Trying to repro on linux, batches of 1000 files each time:
There's some growth here, but it seems linear to the number of new files in the git index.
Doing the same with the last git-annex release:
0:10.97elapsed, 0:11.24elapsed, 0:11.65elapsed, 0:11.90elapsed, 0:12.25elapsed, 0:12.86elapsed, 0:12.74elapsed, 0:12.84elapsed, 0:13.26elapsed
So close to the same, the slowdown feels minimal.
Then on OSX:
Well, this does not seem exponential but it's certainly growing faster than linux.
Then tried on OSX with 100 chunks of 100 files each:
There's more overhead because git-annex has to start up and diff the index trees, which takes longer than processing the 100 files in the chunk. But this is not taking hours to run, so either the test case involves a lot more than 10k files, or there is something else involved.
thank you Joey for timing it up! May be relevant: in our test scenario we have 100 directories with 100 files in each one of those and each directory or file name is 100 chars long (not sure if this is relevant either). So doing 5 subsequent adds (on OSX) should result in ~20k paths specified on the command line for each invocation.
so I did this little replication script which would do similar drill (just not long filenames for this one)
with "older" 8.20210429-g9a5981a15 on OSX - stable 30 sec per batch (matches what I observed from running our tests)
and the newer 8.20210429-g57b567ac8 -- I got the same-ish nice timing without significant growth! -- damn it
so I looked into our test generation again and realized -- we are not populating unique files. They are all empty!
and now confirming with this slightly adjusted script which just touches them:
the case I had observed
so -- sorry I have not spotted that peculiarity right from the beginning!
FWIW the complete timing was
21.65 real 8.47 user 10.89 sys 139.96 real 61.51 user 78.18 sys 253.47 real 112.01 user 142.78 sys 370.43 real 161.94 user 211.15 sys 481.03 real 210.78 user 274.31 sys
so goes pretty close with what I observed with 4 splits (30 284 528 720
).The file contents all being the same is the crucial thing. On linux, adding 1000 dup files at a time (all in same directory), I get:
run 1: 0:08
run 2: 0:42
run 3: 1:14
run 4: 1:46
After run 4, adding 1000 files with all different content takes 0:11, so not appreciably slowed down; it only affects adding dups, and only when there are a lot of them.
This feels like quite an edge case, and also not really a new problem, since unlocked files would have already had the same problem before recent changes.
I thought this might be an innefficiency in sqlite's index, similar to how hash tables can scale poorly when a lot of things end up in the same bucket. But disabling the index did not improve performance.
Aha -- the slowdown is caused by
git-annex add
looking to see what other annexed files use the same content, so that it can populate any unlocked files that didn't have the content present before. With all these locked files now recorded in the db, it has to check each file in turn, and there's theO(N^2)
If the database recorded when files were unlocked or not, that could be avoided, but tracking that would add a lot of complexity for what is just an edge case. And probably slow things down generally by some amount due to the db being larger.
It seems almost cheating, but it could remember the last few keys it's added, and avoid trying to populate unlocked files when adding those keys again. This would slow down the usual case by some tiny amount (eg an IORef access) but avoid
O(N^2)
in this edge case. Though it wouldn't fix all edge cases, eg when the files it's adding rotate through X different contents, and X is larger than the number of keys it remembers.Found an optimisation that sped it up 50%, but this edge case is still
O(N^2)
, so shrug.Could this be skipped if
annex.supportunlocked=false
?One not-quite-edge case where that happens is when empty files are used as placedholders for outputs of failed steps throughout a large workflow.
any
vs currentall
operation/check (in case ofadd
at least) if needed just to test if there is already a path with a given key, and assuming that DB doesn't require a full rewrite of allpath
entries if a new path added for the key.@Ilya sure can be skipped when annex.supportunlocked=false. I've implemented that. (And also for several other cases that have similar behavior, like dropping a key.)
@yarik no, it needs to check all files. Consider what happens when
foo
is an annexed link to key K, which is not present, and you copy.git/annex/objects/../K
from some other repo and rungit annex add K
(or reinject, or get, it's all the same) --foo
now points to the content of K. The same needs to also hold true for unlocked files, and so it has to check iffoo
is an unlocked pointer to K and populate the file with the content. Repeat for all the other 9999 files..but that should only needs to be done iff K became present/known whenever it was not present before. If K is already known (e.g. was just added for another file, or may be was added in previous "commit") no such checks are needed since those files could be expected to already be populated. Right?
Some thoughts leading to a workable plan:
It's easy to detect this edge case because getAssociatedFiles will be returning a long list of files. So it could detect say 10 files in the list and start doing something other than the usual, without bothering the usual case with any extra work.
Git starts to get slow anyway in the 1 million to 10 million file range. So we can assume less than that many files are being added. And there need to be a fairly large number of duplicates of a key for speed to become a problem when adding that key. Around 1000 based on above benchmarks, but 100 would be safer.
If it's adding 10 million files, there can be at most 10000 keys that have
>=
1000 duplicates (10 million / 1000). No problem to remember 10000 keys; a key is less than 128 bytes long, so that would take 1250 kb, plus the overhead of the Map. Might as well remember 12 mb worth of keys, to catch 100 duplicates.It would be even better to use a bloom filter, which could remember many more, and I thought I had a way, but the false positive case seems the wrong way around. If the bloom filter remembers keys that have already had their associated files populated, then a false positive would prevent doing that for a key that it's not been done for.
It would make sense to do this not only in populateUnlockedFiles but in Annex.Content.moveAnnex and Annex.Content.removeAnnex. Although removeAnnex would need a different bloom filter, since a file might have been populated and then somehow get removed in the same git-annex call.
Oh, there's a much better solution: If the annex object file already exists when ingesting a new file, skip populating other associated files. They will have already been populated. moveAnnex has to check if the annex object file already exists anyway, so this will have zero overhead.
(Maybe that's what yarik was getting at in comment #30)
Implemented that, and here's the results, re-running my prior benchmark:
run 1: 0:03.14 run 2: 0:03.24 run 3: 0.03.35 run 4: 0.03.45 run 9: 0:03.65
That also shows the actual overhead of the diffing of the index, as its size grows, is quite small.
yes, awesome, thank you Joey!
FWIW, here is the little trail of our tests (non-adjusted in released version) timings
so we are indeed back to normal (if not better ;)) !