performance regression? init takes times more

Please describe the problem.

I was trying to see/measure effect from using annex.supportunlocked to overcome significant IO due to git annex init while working on some glorious collection of thousands of git-annex repos with thousands of files in them. But I came to realization that it seems that the setting is of no effect (I still see "scanning for unlocked, timing is the same) and that prior version of git-annex is considerably faster:

$> builtin cd /tmp; rm -rf 6e387cd5-fa41-495a-928e-49a0c0b0a35f; git clone https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f >&/dev/null && cd 6e387cd5-fa41-495a-928e-49a0c0b0a35f && time git annex init; git annex version | head -n 1
0/  1/  2/  3/  4/  5/  6/
init  (scanning for annexed files...)

  Remote origin not usable by git-annex; setting annex-ignore

  https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f/config download failed: Not Found
ok
(recording state in git...)
git annex init  10.23s user 1.48s system 109% cpu 10.701 total
git-annex version: 10.20220624-g769be12

and with that config annex.supportunlocked set before annex init:

$> builtin cd /tmp; rm -rf 6e387cd5-fa41-495a-928e-49a0c0b0a35f; git clone https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f >&/dev/null && cd 6e387cd5-fa41-495a-928e-49a0c0b0a35f && git config annex.supportunlocked false && time git annex init; git annex version | head -n 1
0/  1/  2/  3/  4/  5/  6/
init  (scanning for annexed files...)

  Remote origin not usable by git-annex; setting annex-ignore

  https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f/config download failed: Not Found
ok
(recording state in git...)
git annex init  9.94s user 1.22s system 109% cpu 10.219 total
git-annex version: 10.20220624-g769be12

so took over 10 seconds (conda env built version) whenever if I use some 1.5 years old (neurodebian standalone) version -- only under 3 seconds:

$> builtin cd /tmp; rm -rf 6e387cd5-fa41-495a-928e-49a0c0b0a35f; git clone https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f >&/dev/null && cd 6e387cd5-fa41-495a-928e-49a0c0b0a35f && time git annex init; git annex version | head -n 1
0/  1/  2/  3/  4/  5/  6/
init  (merging origin/git-annex into git-annex...)
(recording state in git...)
(scanning for unlocked files...)

  Remote origin not usable by git-annex; setting annex-ignore
ok
(recording state in git...)
git annex init  1.57s user 0.66s system 103% cpu 2.149 total
git-annex version: 8.20210223-1~ndall+1

while even "scanning for unlocked files".

Since difference is quite substantial I have decided to file this issue.

I have decided to close this, it seems plausible that the additional overhead of reconcileStaged that I was measuring is not something that can be eliminated. Measurement error also is possible. Since I made several optimisations and persistent-sqlite got optimised as well, and since the scan was moved to not happen at init time, this should no longer be a problem. done --Joey

RSS Atom

comment 1

Notice that it says "scanning for annexed files", not "unlocked files" anymore.

The reason it's slower is explained by 428c91606b. And that was necessary to fix the bad, longstanding bug ?indeterminite preferred content state for duplicated file. It has to scan every annexed file, not only the unlocked ones, in order to collect the data to not have that bug.

And the scan got optimised as well as possible, I think, at the time, see the old issue you filed @yoh, https://git-annex.branchable.com/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/

Comment by joey — Tue Jul 12 20:34:02 2022

Remove comment

comment 1

FTR, on the target server where I spotted that long running annex init process, and which is likely to be a bit "overloaded" ATM bottlenecking in IO, timing is quite sad

(dandisets) dandi@drogon:/mnt/backup/dandi/tmp$ builtin cd /mnt/backup/dandi/tmp; rm -rf 6e387cd5-fa41-495a-928e-49a0c0b0a35f; git clone https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f >&/dev/null && cd 6e387cd5-fa41-495a-928e-49a0c0b0a35f && git config annex.supportunlocked false && time git annex init; git annex version | head -n 1
init  (scanning for annexed files...)

  Remote origin not usable by git-annex; setting annex-ignore

  https://github.com/dandizarrs/6e387cd5-fa41-495a-928e-49a0c0b0a35f/config download failed: Not Found
ok
(recording state in git...)

real    23m53.851s
user    0m14.184s
sys 0m13.692s
git-annex version: 10.20220525-gf1fdc90

so it took 23m... but since user/sys are not as bad I guess it is indeed due to IO bottlenecking...

Comment by yarikoptic — Tue Jul 12 20:47:22 2022

Remove comment

comment 3

The last time I worked on optimising this, in ?speed up keys db update with git streaming, it looked like writing to sqlite was probably a large part of the remaining time. Processing 100k new files took 41 seconds, and processing 100k removed files took only 17 seconds. And the only difference as far as what git-annex is doing in those cases is adding to sqlite vs removing from sqlite.

This comment suggests using multi-value sqlite inserts as perhaps a way to speed it up. Perhaps.

Comment by joey — Tue Jul 12 20:51:05 2022

Remove comment

comment 4

I see that repo's master branch has only 86k files in it. On my laptop, git-annex init --no-autoenable takes 24 seconds, and is mostly CPU bound. (Autoenabling adds another 6 seconds to that.)

A local git clone of the repo, by the way, takes 7 seconds; so git-annex init is 4x as slow as clone.

After stubbing out sqlite writes, git-annex init --no-autoenable took 6 seconds. (2 seconds of that is building up the database write queue.) This is consistent with my earlier benchmarking that sqlite writes are most of the work here. If you would like to try to reproduce that on the slow machine to see how much of the overhead is sqlite writes, here is a quick and dirty patch:

diff --git a/Database/Queue.hs b/Database/Queue.hs
index f4882d2fa..2fbd522f6 100644
--- a/Database/Queue.hs
+++ b/Database/Queue.hs
@@ -55,7 +55,7 @@ flushDbQueue :: DbQueue -> IO ()
 flushDbQueue (DQ hdl qvar) = do
    q@(Queue sz _ qa) <- debugLocks $ takeMVar qvar
    if sz > 0
-       then tryNonAsync (commitDb hdl qa) >>= \case
+       then tryNonAsync (pure ()) >>= \case
            Right () -> debugLocks $ putMVar qvar =<< emptyQueue
            Left e -> do
                debugLocks $ putMVar qvar q
@@ -105,7 +105,7 @@ queueDb (DQ hdl qvar) commitchecker a = do
    let enqueue = debugLocks . putMVar qvar
    ifM (commitchecker sz' lastcommittime)
        ( do
-           r <- commitDb' hdl qa'
+           r <- pure (Right ()) -- commitDb' hdl qa'
            case r of
                Left _ -> enqueue $ Queue sz' lastcommittime qa'
                Right _ -> enqueue =<< emptyQueue

Comment by joey — Wed Jul 13 16:15:10 2022

Remove comment

comment 5

submitted a PR with that patch datalad/git-annex/pull/128 to get a build to try (I don't remember if we archive such PR builds on smaug...)

Comment by yarikoptic — Wed Jul 13 19:27:05 2022

Remove comment

comment 6

It probably goes without saying, but that patch should never reach a production system's git-annex. It will break repositories that it's used in for anything except testing this one thing.

Comment by joey — Wed Jul 13 21:24:51 2022

Remove comment

comment 7

sure thing. converted PR to draft just to be sure. But its build failed with

Annex/Branch.hs:442:52: error:
    * Variable not in scope: tforceCommit :: String -> Annex ()
    * Perhaps you meant `forceCommit' (line 447)
    |
442 | commit = whenM (journalDirty gitAnnexJournalDir) . tforceCommit
    |                                                    ^^^^^^^^^^^^
make[3]: *** [Makefile:58: git-annex] Error 1

anyways

Comment by yarikoptic — Wed Jul 13 22:05:46 2022

Remove comment

comment 8

That was fixed in 1b680d330b902e01bbe01e3e7dc1bde2fe512c0b

Comment by joey — Thu Jul 14 17:54:20 2022

Remove comment

comment 9

Looks like that PR build did succeed later. I don't know where the resulting artifact would have ended up.

Comment by joey — Mon Aug 1 17:06:50 2022

Remove comment

comment 10

didn't have time yet to look at that :-/ as for "where"

don't even know myself now how to lookup for those older github actions runs on github interface;
they all are fetched to smaug within con/tinuous archive of build logs/artifacts. So for this PR:

(git)smaug:/mnt/datasets/datalad/ci/git-annex/builds/2022/07[master]
$> /usr/bin/find pr-128/build-ubuntu.yaml-7* -iname *deb
pr-128/build-ubuntu.yaml-761-ebf2f1b0-failed/git-annex-debianstandalone-packages_10.20220624+git120-gc3df38dd1_amd64/git-annex-standalone_10.20220624+git120-gc3df38dd1-1~ndall+1_amd64.deb

Comment by yarikoptic — Mon Aug 1 18:51:59 2022

Remove comment

comment 11

figured out some what to lookup on github - go to Actions for the entire repo - filter only for pull requests: - select most recent one for that PR and architecture - build artifacts are at the bottom of that page

Comment by yarikoptic — Mon Aug 1 18:55:38 2022

Remove comment

comment 12

With --debug, it shows when it enters and leaves commitDb.

commitDb is run around 70 times in this repository. Making it queue more changes into each commitDb, so it runs an order of magnitude fewer, did speed it up by around 1 second (from 24).

Summing up the time spent in commitDb according to --debug, it is around 6.5 seconds, out of the 24 second run. That is close to the actual time sqlite is taking.

But per my comment above, with commitDb stubbed out, the whole run takes only 6 seconds. (I've re-verified that to be sure.) Why is it so much faster then? After doing some profiling and looking at flame graphs, I think it comes down to allocations:

(For future reference, here is a flame graph measuring runtimes of unmodified git-annex: https://tmp.joeyh.name/git-annex-flame-1.svg
Shows the majority of time in commitDb as expected, and that it's due to addAssociatedFile. The resulting upsert spends about 1/4th of its time reading and 3/4 writing. All unsurprising.. But also a sizable amount of time is spent in GC, and in IDLE, whatever that is.)

Without it stubbed out:

    total alloc = 6,385,408,240 bytes  (excludes profiling overheads)

With it stubbed out:

    total alloc = 1,714,898,200 bytes  (excludes profiling overheads)

Here is a flame graph measuring allocations, not runtime, of unmodified git-annex:
https://tmp.joeyh.name/git-annex-flame-3.svg>
By far the main allocation is by mkSqlBackend, which is surprising because that represents a database handle. Why would it need to spend tons of memory making database handles?

Starting to think this could be a bug in persistent or persistent-sqlite. I've tried disabling the lowlevel persistent parts of Database.Handle in case there was a bug in there, and using persistent more typically, but that didn't change the mkSqlBackend allocations. Next step might be a simpler test case.

OTOH, it may be that the way that persistent uses the SqlBackend as a dictionary of functions, that calls to individual functions in it confuse the profiler, and allocation done to build SQL statements is being misattributed?

Comment by joey — Thu Nov 17 17:43:27 2022

Remove comment

comment 13

An idea from looking at the flame graphs, is that if reconcileStaged knew it was run on an empty keys db (or was doing a full scan of the tree, not an incremental updte), it could avoid doing the upsert, and do a blind insert instead.

This will need some care to be implemented safely...

I benchmarked it, and using insertUnique is no faster, but using insert is. This would be a 15% speed up.

Comment by joey — Thu Nov 17 20:20:16 2022

Remove comment

comment 14

Implemented the two optimisations discussed above, and init in that repository dropped from 24 seconds to 19 seconds, a 21% speedup.

I think that's as fast as reconcileStaged is likely to get without some deep optimisation of the persistent library.

Then I realized that git-annex init does not really need to scan for associated files. That can be done later, when running a command that needs to access the keys database. Indeed, when git-annex is used in a clone of an annexed repo without explicitly running git-annex init, that's what it already did. I've implemented that, so now git-annex init takes 3 seconds or so. The price will be paid later, the first time running a git-annex add or git-annex unlock or git-annex get.

Comment by joey — Fri Nov 18 17:26:03 2022

Remove comment

comment 15

I wrote a standalone test program that uses persistant-sqlite to populate a database with the same data. It runs in 4.12 seconds. Which is sufficiently close to the 6 seconds I measured before for actual sqlite runtime in git-annex init.

So, reconcileStaged is not slow only due to persistent-sqlite overhead, but something else. Weird!

However, I also dumped the database to a sql script. Running that script with sqlite3 takes only 2.39 seconds. So, persistent-sqlite does have some performance overhead. Opened an issue: https://github.com/yesodweb/persistent/issues/1441

The test program:

{-# LANGUAGE EmptyDataDecls             #-}
{-# LANGUAGE FlexibleContexts           #-}
{-# LANGUAGE GADTs                      #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE MultiParamTypeClasses      #-}
{-# LANGUAGE OverloadedStrings          #-}
{-# LANGUAGE QuasiQuotes                #-}
{-# LANGUAGE TemplateHaskell            #-}
{-# LANGUAGE TypeFamilies               #-}
{-# LANGUAGE DerivingStrategies #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE UndecidableInstances #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE BangPatterns #-}
import           Control.Monad.IO.Class  (liftIO)
import           Database.Persist
import           Database.Persist.Sqlite
import           Database.Persist.TH
import           Data.Text
import           Data.Text.IO
import Control.Monad

share [mkPersist sqlSettings, mkMigrate "migrateAll"] [persistLowerCase|
Val
    foo Text
    bar Text
    FooBarIndex foo bar
    BarFooIndex bar foo
|]

main :: IO ()
main = runSqlite "sqlitedb" $ do
    runMigration migrateAll
    forM_ [1..(172156/2)] $ \x -> do
        f <- liftIO Data.Text.IO.getLine
        b <- liftIO Data.Text.IO.getLine
        _ <- insert $ Val f b
    return ()

(Feed on stdin a series of pairs of lines of key and filename.)

Comment by joey — Fri Nov 18 19:37:46 2022

Remove comment

comment 16

https://github.com/yesodweb/persistent/issues/1441 is having some useful discussion. There are changes that would speed it up 2x. The best path may be improving sqlite to optimise insert_.

Comment by joey — Mon Dec 5 17:29:11 2022

Remove comment

comment 17

https://github.com/yesodweb/persistent/issues/1441 got fixed in persistent-2.14.4.1. git-annex will get that speed improvement when it's built with that version. I'd anticipate it improving from 19 seconds to around 16 seconds.

(I further opened https://github.com/yesodweb/persistent/issues/1457 that would speed up some other parts of git-annex if implemented.)

The only reason to leave this bug open is that reconcileStaged is still somehow slower than only the persistent-sqlite overhead can explain.

Comment by joey — Mon Dec 26 19:10:04 2022

Remove comment

comment 18

Implemented support for https://github.com/yesodweb/persistent/issues/1457 in git-annex, which does speed up sqlite inserts 2x. That will affect the scan in question, since that inserts to the keys database. It also will speed up some unrelated parts of git-annex.

Comment by joey — Fri Mar 31 18:36:56 2023

Remove comment

Add a comment