Please describe the problem.
I was trying to see/measure effect from using annex.supportunlocked
to overcome significant IO due to git annex init
while working on some glorious collection of thousands of git-annex repos with thousands of files in them. But I came to realization that it seems that the setting is of no effect (I still see "scanning for unlocked, timing is the same) and that prior version of git-annex is considerably faster:
$> builtin cd /tmp; rm -rf 6e387cd5-fa41-495a-928e-49a0c0b0a35f; git clone https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f >&/dev/null && cd 6e387cd5-fa41-495a-928e-49a0c0b0a35f && time git annex init; git annex version | head -n 1
0/ 1/ 2/ 3/ 4/ 5/ 6/
init (scanning for annexed files...)
Remote origin not usable by git-annex; setting annex-ignore
https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f/config download failed: Not Found
ok
(recording state in git...)
git annex init 10.23s user 1.48s system 109% cpu 10.701 total
git-annex version: 10.20220624-g769be12
and with that config annex.supportunlocked
set before annex init
:
$> builtin cd /tmp; rm -rf 6e387cd5-fa41-495a-928e-49a0c0b0a35f; git clone https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f >&/dev/null && cd 6e387cd5-fa41-495a-928e-49a0c0b0a35f && git config annex.supportunlocked false && time git annex init; git annex version | head -n 1
0/ 1/ 2/ 3/ 4/ 5/ 6/
init (scanning for annexed files...)
Remote origin not usable by git-annex; setting annex-ignore
https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f/config download failed: Not Found
ok
(recording state in git...)
git annex init 9.94s user 1.22s system 109% cpu 10.219 total
git-annex version: 10.20220624-g769be12
so took over 10 seconds (conda env built version) whenever if I use some 1.5 years old (neurodebian standalone) version -- only under 3 seconds:
$> builtin cd /tmp; rm -rf 6e387cd5-fa41-495a-928e-49a0c0b0a35f; git clone https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f >&/dev/null && cd 6e387cd5-fa41-495a-928e-49a0c0b0a35f && time git annex init; git annex version | head -n 1
0/ 1/ 2/ 3/ 4/ 5/ 6/
init (merging origin/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)
Remote origin not usable by git-annex; setting annex-ignore
ok
(recording state in git...)
git annex init 1.57s user 0.66s system 103% cpu 2.149 total
git-annex version: 8.20210223-1~ndall+1
while even "scanning for unlocked files".
Since difference is quite substantial I have decided to file this issue.
Notice that it says "scanning for annexed files", not "unlocked files" anymore.
The reason it's slower is explained by 428c91606b. And that was necessary to fix the bad, longstanding bug ?indeterminite preferred content state for duplicated file. It has to scan every annexed file, not only the unlocked ones, in order to collect the data to not have that bug.
And the scan got optimised as well as possible, I think, at the time, see the old issue you filed @yoh, https://git-annex.branchable.com/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/
FTR, on the target server where I spotted that long running
annex init
process, and which is likely to be a bit "overloaded" ATM bottlenecking in IO, timing is quite sadso it took 23m... but since user/sys are not as bad I guess it is indeed due to IO bottlenecking...
The last time I worked on optimising this, in ?speed up keys db update with git streaming, it looked like writing to sqlite was probably a large part of the remaining time. Processing 100k new files took 41 seconds, and processing 100k removed files took only 17 seconds. And the only difference as far as what git-annex is doing in those cases is adding to sqlite vs removing from sqlite.
This comment suggests using multi-value sqlite inserts as perhaps a way to speed it up. Perhaps.
I see that repo's master branch has only 86k files in it. On my laptop,
git-annex init --no-autoenable
takes 24 seconds, and is mostly CPU bound. (Autoenabling adds another 6 seconds to that.)A local
git clone
of the repo, by the way, takes 7 seconds; so git-annex init is 4x as slow as clone.After stubbing out sqlite writes,
git-annex init --no-autoenable
took 6 seconds. (2 seconds of that is building up the database write queue.) This is consistent with my earlier benchmarking that sqlite writes are most of the work here. If you would like to try to reproduce that on the slow machine to see how much of the overhead is sqlite writes, here is a quick and dirty patch:It probably goes without saying, but that patch should never reach a production system's git-annex. It will break repositories that it's used in for anything except testing this one thing.
sure thing. converted PR to draft just to be sure. But its build failed with
anyways
Looks like that PR build did succeed later. I don't know where the resulting artifact would have ended up.
didn't have time yet to look at that as for "where"
don't even know myself now how to lookup for those older github actions runs on github interface;
they all are fetched to smaug within con/tinuous archive of build logs/artifacts. So for this PR:
figured out some what to lookup on github - go to Actions for the entire repo - filter only for pull requests: - select most recent one for that PR and architecture - build artifacts are at the bottom of that page
With --debug, it shows when it enters and leaves commitDb.
commitDb is run around 70 times in this repository. Making it queue more changes into each commitDb, so it runs an order of magnitude fewer, did speed it up by around 1 second (from 24).
Summing up the time spent in commitDb according to --debug, it is around 6.5 seconds, out of the 24 second run. That is close to the actual time sqlite is taking.
But per my comment above, with commitDb stubbed out, the whole run takes only 6 seconds. (I've re-verified that to be sure.) Why is it so much faster then? After doing some profiling and looking at flame graphs, I think it comes down to allocations:
(For future reference, here is a flame graph measuring runtimes of unmodified git-annex: https://tmp.joeyh.name/git-annex-flame-1.svg
Shows the majority of time in commitDb as expected, and that it's due to addAssociatedFile. The resulting upsert spends about 1/4th of its time reading and 3/4 writing. All unsurprising.. But also a sizable amount of time is spent in GC, and in IDLE, whatever that is.)
Without it stubbed out:
With it stubbed out:
Here is a flame graph measuring allocations, not runtime, of unmodified git-annex:
https://tmp.joeyh.name/git-annex-flame-3.svg>
By far the main allocation is by mkSqlBackend, which is surprising because that represents a database handle. Why would it need to spend tons of memory making database handles?
Starting to think this could be a bug in persistent or persistent-sqlite. I've tried disabling the lowlevel persistent parts of Database.Handle in case there was a bug in there, and using persistent more typically, but that didn't change the mkSqlBackend allocations. Next step might be a simpler test case.
OTOH, it may be that the way that persistent uses the SqlBackend as a dictionary of functions, that calls to individual functions in it confuse the profiler, and allocation done to build SQL statements is being misattributed?
An idea from looking at the flame graphs, is that if reconcileStaged knew it was run on an empty keys db (or was doing a full scan of the tree, not an incremental updte), it could avoid doing the upsert, and do a blind insert instead.
This will need some care to be implemented safely...
I benchmarked it, and using insertUnique is no faster, but using insert is. This would be a 15% speed up.
Implemented the two optimisations discussed above, and init in that repository dropped from 24 seconds to 19 seconds, a 21% speedup.
I think that's as fast as reconcileStaged is likely to get without some deep optimisation of the persistent library.
Then I realized that
git-annex init
does not really need to scan for associated files. That can be done later, when running a command that needs to access the keys database. Indeed, when git-annex is used in a clone of an annexed repo without explicitly runninggit-annex init
, that's what it already did. I've implemented that, so nowgit-annex init
takes 3 seconds or so. The price will be paid later, the first time running agit-annex add
orgit-annex unlock
orgit-annex get
.I wrote a standalone test program that uses persistant-sqlite to populate a database with the same data. It runs in 4.12 seconds. Which is sufficiently close to the 6 seconds I measured before for actual sqlite runtime in git-annex init.
So, reconcileStaged is not slow only due to persistent-sqlite overhead, but something else. Weird!
However, I also dumped the database to a sql script. Running that script with sqlite3 takes only 2.39 seconds. So, persistent-sqlite does have some performance overhead. Opened an issue: https://github.com/yesodweb/persistent/issues/1441
The test program:
(Feed on stdin a series of pairs of lines of key and filename.)
https://github.com/yesodweb/persistent/issues/1441 is having some useful discussion. There are changes that would speed it up 2x. The best path may be improving sqlite to optimise
insert_
.https://github.com/yesodweb/persistent/issues/1441 got fixed in persistent-2.14.4.1. git-annex will get that speed improvement when it's built with that version. I'd anticipate it improving from 19 seconds to around 16 seconds.
(I further opened https://github.com/yesodweb/persistent/issues/1457 that would speed up some other parts of git-annex if implemented.)
The only reason to leave this bug open is that reconcileStaged is still somehow slower than only the persistent-sqlite overhead can explain.
Implemented support for https://github.com/yesodweb/persistent/issues/1457 in git-annex, which does speed up sqlite inserts 2x. That will affect the scan in question, since that inserts to the keys database. It also will speed up some unrelated parts of git-annex.