importtree spends hours reading cidsdb

Please describe the problem.

Hi!

I've got a directory special remote set up, with importtree. After scanning through it (import filename... ok), it stops reading from the filesystem and spends hours just reading from cidsdb. Verifying with strace, there is not reading from the filesystem here.

What steps will reproduce the problem?

Here are my prep steps:

git init
git config annex.thin true
git annex init 'local hub'
git annex wanted . "include=* and exclude=testdata/*"
touch mtree
git annex add mtree
git annex sync
git annex adjust --unlock-present
git annex initremote source type=directory directory=/acrypt/git-annex/bind-ro/testdata importtree=yes encryption=none
git annex enableremote source directory=/acrypt/git-annex/bind-ro/testdata
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch main:testdata # (says to put the files in the "$REPO" directory)
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M
git annex sync

It is that last sync that runs into this problem.

What version of git-annex are you using? On what operating system?

10.20230407 on Debian bullseye

Please provide any additional information below.

There are about 150,000 files in that tree. This problem occurs after git-annex is done scanning the tree (verified with strace and lsof). This problem does not occur with the same number of files in a local/standard git-annex repo (as opposed to the directory special remote).

.git/annex/cidsdb/db is only 51M so it is certainly entirely cached. git-annex is entirely CPU-bound at this point.

I can rerun the sync with an unchanged import directory. It still takes 107 minutes, the majority of which is spent reading cidsdb. Only the first minute or two are spent scanning the source area.

I have tested this on a source directory that's 2.2G and another that's 1.1T, both with about 150,000 files. After the first import, the subsequent syncs are similar in performance. In other words, this behavior appears to be related to the number of files, not the size of files.

Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)

Yes, and I hope to use it for a project to archive family photos and videos to BD-R (that's what this is about here)

fixed --Joey

RSS Atom

comment 1

Hm. I have tried to reproduce this with a git-annex instrumented to log each time it accesses the cidsdb.

I have not tried with the full 150k files, only with 10000, but have no reason to think it wouldn't have the same cidsbd access pattern with more files.

What I found is, there was no access to the cidsdb after it finished the "import source" part of the sync.

But, if I then run git-annex sync a second time, it behaves much more like you describe. It does not display "import source" for any files, because no files have changed. And it hits the cidsdb twice per file. Once is the necessary call to getContentIdentifierKeys, which is how it determines it's seen a file before and doesn't need to import it a second time.

The other hit is recordContentIdentifier, which happens for each recorded cid, due to updateFromLog. That seems unnecessary, because the previous sync already recorded all the cids.

So I think a 2x speedup should be possible to repeated syncs when most of the files have not changed, but either I am missing something to reproduce your problem, or half the overhead is due to git-annex determining when it's already imported files before, which is unavoidable.

Importing trees from special remotes is simply not going to be able to scale to huge numbers of files. Using a git repository is always gonna be able to perform better. (But of course hits its own scalability limits eventually.) This is a more general mechanism, and so many of the approaches git uses to speed up are not available.

Here's the script I used.

#!/bin/sh
NFILES=$1
mkdir bench
cd bench
mkdir d
perl -e 'for(1..'$NFILES') { open(OUT,">", "d/f$_"); print OUT $_; close OUT }'
git init r
cd r
git config annex.thin true
git annex init 'local hub'
git annex wanted . "include=* and exclude=testdata/*"
touch mtree
git annex add mtree
git annex sync
git annex adjust --unlock-present
git annex initremote source type=directory directory=../d importtree=yes encryption=none
git annex enableremote source directory=../d
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch master:testdata
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M
git annex sync
echo sync complete

Here's the output when run with just 10 files, which shows that cidsdb access stops after it finishes "import source":

May 30 13:56:57 Initialized empty Git repository in /home/joey/tmp/bench/r/.git/
May 30 13:56:57 init local hub ok
May 30 13:56:57 (recording state in git...)
May 30 13:56:57 wanted . ok
May 30 13:56:57 (recording state in git...)
May 30 13:56:57 add mtree 
May 30 13:56:57 ok
May 30 13:56:57 (recording state in git...)
May 30 13:56:57 commit  
May 30 13:56:57 [master (root-commit) 5ae9d8a] git-annex in local hub
May 30 13:56:57  1 file changed, 1 insertion(+)
May 30 13:56:57  create mode 120000 mtree
May 30 13:56:57 ok
May 30 13:56:57 adjust  
May 30 13:56:57 Switched to branch 'adjusted/master(unlockpresent)'
May 30 13:56:57 ok
May 30 13:56:58 initremote source ok
May 30 13:56:58 (recording state in git...)
May 30 13:56:58 enableremote source ok
May 30 13:56:58 (recording state in git...)
May 30 13:56:58 commit  
May 30 13:56:58 On branch adjusted/master(unlockpresent)
May 30 13:56:58 nothing to commit, working tree clean
May 30 13:56:58 ok
May 30 13:56:58 list source ok
May 30 13:56:58 "getAnnexBranchTree"
May 30 13:56:58 "updateFromLog"
May 30 13:56:58 "recordAnnexBranchTree"
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13243577 1 1685469417 098840021")
May 30 13:56:58 import source f1 
May 30 13:56:58 100%  1 B               2 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285646 1 1685469417 098840021")
May 30 13:56:58 import source f8 
May 30 13:56:58 100%  1 B               2 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285641 1 1685469417 098840021")
May 30 13:56:58 import source f4 
May 30 13:56:58 100%  1 B               3 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285642 1 1685469417 098840021")
May 30 13:56:58 import source f5 
May 30 13:56:58 100%  1 B               3 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285648 2 1685469417 098840021")
May 30 13:56:58 import source f10 
May 30 13:56:58 100%  2 B               6 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 2, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s2--4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285647 1 1685469417 098840021")
May 30 13:56:58 import source f9 
May 30 13:56:58 100%  1 B               4 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285640 1 1685469417 098840021")
May 30 13:56:58 import source f3 
May 30 13:56:58 100%  1 B               5 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285643 1 1685469417 098840021")
May 30 13:56:58 import source f6 
May 30 13:56:58 100%  1 B               4 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285644 1 1685469417 098840021")
May 30 13:56:58 import source f7 
May 30 13:56:58 100%  1 B               4 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--7902699be42c8a8e46fbbb4501726517e86b22c56a189f7625a6da49081b2451"})
May 30 13:56:58                                   ok
May 30 13:56:58 ("getContentIdentifierKeys",ContentIdentifier "13285639 1 1685469417 098840021")
May 30 13:56:58 import source f2 
May 30 13:56:58 100%  1 B               3 KiB/s 0s("recordContentIdentifier",MkKey {keyData = Key {keyName = "d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 1, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s1--d4735e3a265e16eee03f59718b9b5d03019c07d8b6c51f90da3a666eec13ab35"})
May 30 13:56:58                                   ok
May 30 13:56:58 update refs/remotes/source/master ok
May 30 13:56:58 (Merging into master...) 
May 30 13:56:58 Merge made by the 'ort' strategy.
May 30 13:56:58  testdata/f1  | 1 +
May 30 13:56:58  testdata/f10 | 1 +
May 30 13:56:58  testdata/f2  | 1 +
May 30 13:56:58  testdata/f3  | 1 +
May 30 13:56:58  testdata/f4  | 1 +
May 30 13:56:58  testdata/f5  | 1 +
May 30 13:56:58  testdata/f6  | 1 +
May 30 13:56:58  testdata/f7  | 1 +
May 30 13:56:58  testdata/f8  | 1 +
May 30 13:56:58  testdata/f9  | 1 +
May 30 13:56:58  10 files changed, 10 insertions(+)
May 30 13:56:58  create mode 120000 testdata/f1
May 30 13:56:58  create mode 120000 testdata/f10
May 30 13:56:58  create mode 120000 testdata/f2
May 30 13:56:58  create mode 120000 testdata/f3
May 30 13:56:58  create mode 120000 testdata/f4
May 30 13:56:58  create mode 120000 testdata/f5
May 30 13:56:58  create mode 120000 testdata/f6
May 30 13:56:58  create mode 120000 testdata/f7
May 30 13:56:58  create mode 120000 testdata/f8
May 30 13:56:58  create mode 120000 testdata/f9
May 30 13:56:58 (Merging into adjusted branch...) 
May 30 13:56:58 Updating ce828a9..f420a76
May 30 13:56:58 Fast-forward
May 30 13:56:58  testdata/f1  | 1 +
May 30 13:56:58  testdata/f10 | 1 +
May 30 13:56:58  testdata/f2  | 1 +
May 30 13:56:58  testdata/f3  | 1 +
May 30 13:56:58  testdata/f4  | 1 +
May 30 13:56:58  testdata/f5  | 1 +
May 30 13:56:58  testdata/f6  | 1 +
May 30 13:56:58  testdata/f7  | 1 +
May 30 13:56:58  testdata/f8  | 1 +
May 30 13:56:58  testdata/f9  | 1 +
May 30 13:56:58  10 files changed, 10 insertions(+)
May 30 13:56:58  create mode 120000 testdata/f1
May 30 13:56:58  create mode 120000 testdata/f10
May 30 13:56:58  create mode 120000 testdata/f2
May 30 13:56:58  create mode 120000 testdata/f3
May 30 13:56:58  create mode 120000 testdata/f4
May 30 13:56:58  create mode 120000 testdata/f5
May 30 13:56:58  create mode 120000 testdata/f6
May 30 13:56:58  create mode 120000 testdata/f7
May 30 13:56:58  create mode 120000 testdata/f8
May 30 13:56:58  create mode 120000 testdata/f9
May 30 13:56:59 (recording state in git...)

Here's the patch I used for git-annex to instrument it:

diff --git a/Database/ContentIdentifier.hs b/Database/ContentIdentifier.hs
index e304dca58f..0d0dcdcdf1 100644
--- a/Database/ContentIdentifier.hs
+++ b/Database/ContentIdentifier.hs
@@ -109,11 +109,14 @@ flushDbQueue (ContentIdentifierHandle h) = H.flushDbQueue h

 -- Be sure to also update the git-annex branch when using this.
 recordContentIdentifier :: ContentIdentifierHandle -> RemoteStateHandle -> ContentIdentifier -> Key -> IO ()
-recordContentIdentifier h (RemoteStateHandle u) cid k = queueDb h $ do
-   void $ insertUniqueFast $ ContentIdentifiers u cid k
+recordContentIdentifier h (RemoteStateHandle u) cid k = do
+   liftIO $ hPutStrLn stderr $ show ("recordContentIdentifier", k)
+   queueDb h $ do
+       void $ insertUniqueFast $ ContentIdentifiers u cid k

 getContentIdentifiers :: ContentIdentifierHandle -> RemoteStateHandle -> Key -> IO [ContentIdentifier]
-getContentIdentifiers (ContentIdentifierHandle h) (RemoteStateHandle u) k = 
+getContentIdentifiers (ContentIdentifierHandle h) (RemoteStateHandle u) k = do
+   liftIO $ hPutStrLn stderr $ show ("getContentIdentifiers", k)
    H.queryDbQueue h $ do
        l <- selectList
            [ ContentIdentifiersKey ==. k
@@ -122,7 +125,8 @@ getContentIdentifiers (ContentIdentifierHandle h) (RemoteStateHandle u) k =
        return $ map (contentIdentifiersCid . entityVal) l

 getContentIdentifierKeys :: ContentIdentifierHandle -> RemoteStateHandle -> ContentIdentifier -> IO [Key]
-getContentIdentifierKeys (ContentIdentifierHandle h) (RemoteStateHandle u) cid = 
+getContentIdentifierKeys (ContentIdentifierHandle h) (RemoteStateHandle u) cid = do
+   liftIO $ hPutStrLn stderr $ show ("getContentIdentifierKeys", cid)
    H.queryDbQueue h $ do
        l <- selectList
            [ ContentIdentifiersCid ==. cid
@@ -131,16 +135,20 @@ getContentIdentifierKeys (ContentIdentifierHandle h) (RemoteStateHandle u) cid =
        return $ map (contentIdentifiersKey . entityVal) l

 recordAnnexBranchTree :: ContentIdentifierHandle -> Sha -> IO ()
-recordAnnexBranchTree h s = queueDb h $ do
-        deleteWhere ([] :: [Filter AnnexBranch])
-        void $ insertUniqueFast $ AnnexBranch $ toSSha s
+recordAnnexBranchTree h s = do
+   liftIO $ hPutStrLn stderr $ show ("recordAnnexBranchTree")
+   queueDb h $ do
+       deleteWhere ([] :: [Filter AnnexBranch])
+       void $ insertUniqueFast $ AnnexBranch $ toSSha s

 getAnnexBranchTree :: ContentIdentifierHandle -> IO Sha
-getAnnexBranchTree (ContentIdentifierHandle h) = H.queryDbQueue h $ do
-        l <- selectList ([] :: [Filter AnnexBranch]) []
-        case l of
-                (s:[]) -> return $ fromSSha $ annexBranchTree $ entityVal s
-                _ -> return emptyTree
+getAnnexBranchTree (ContentIdentifierHandle h) = do
+   liftIO $ hPutStrLn stderr $ show ("getAnnexBranchTree")
+   H.queryDbQueue h $ do
+       l <- selectList ([] :: [Filter AnnexBranch]) []
+       case l of
+           (s:[]) -> return $ fromSSha $ annexBranchTree $ entityVal s
+           _ -> return emptyTree

 {- Check if the git-annex branch has been updated and the database needs
  - to be updated with any new content identifiers in it. -}
@@ -155,6 +163,7 @@ needsUpdateFromLog db = do
 {- The database should be locked for write when calling this. -}
 updateFromLog :: ContentIdentifierHandle -> (Sha, Sha) -> Annex ()
 updateFromLog db (oldtree, currtree) = do
+   liftIO $ hPutStrLn stderr $ show ("updateFromLog")
    (l, cleanup) <- inRepo $
        DiffTree.diffTreeRecursive oldtree currtree
    mapM_ go l

Comment by joey — Tue May 30 17:53:48 2023

Remove comment

comment 2

The other hit is recordContentIdentifier, which happens for each recorded cid, due to updateFromLog. That seems unnecessary, because the previous sync already recorded all the cids.

I was able to eliminate that extra work. Now the first sync does not write to the cidsdb but only to the git-annex branch, and the second sync does the necessary work of updating the cidsdb from the git-annex branch.

I'm not sure that extra work is what the bug reporter was complaining about though.

Comment by joey — Tue May 30 18:49:34 2023

Remove comment

comment 3

See speed up import tree for a few ideas of things that would speed it up.

Comment by joey — Tue May 30 19:25:11 2023

Remove comment

comment 4

Hi Joey,

Would it be helpful for you if I ran the instrumented version here?

Also I suspect that the time spent reading cidsdb scales exponentially rather than linearly with the number of files. Or perhaps there's a magic spot somewhere (an in-memory cache limit?) that causes the excessive time. Anyhow, I'm happy to do whatever I can to help.

Thanks!

Comment by jgoerzen — Tue May 30 20:58:20 2023

Remove comment

comment 5

As far as how reads from the cidsdb scale, I think that sqlite databases like this do slow down somewhat when tables get massive, but I don't really know the details. And it's of course possible that something could be improved in the schema or queries.

I've been working on that todo I linked above, and the speed gain is impressive when there are few or no changed files in the remote. With 20,000 unchanged files, re-running git-annex import[1] sped up from 125.95 to 3.84 seconds. With 40,000 unchanged files, it sped up from 477 to 8.13 seconds. I haven't tried with 150000 files yet but the pattern is clear.

I can rerun the sync with an unchanged import directory. It still takes 107 minutes, the majority of which is spent reading cidsdb. Only the first minute or two are spent scanning the source area.

Well, I think I've certianly solved that problem. But I don't know if there's something else that is making the initial sync slower than it needs to be.

[1] More accurately, re-running it a second time, both to get a warm cache result, and because the first time, it is busy updating the cidsdb with the files that were imported earlier, as described in comment #2.

Comment by joey — Thu Jun 1 17:47:31 2023

Remove comment

Results with 150000 files

Initial sync: 39:01.37

Then git-annex import master:testdata --from source with no changed files: 8:36.30
(This is writing the cidsdb since the initial sync no longer does, see comment #2)

Subsequent imports with no changed files: 0:37.21

Subsequent syncs with no changed files: 3:18.61

Subsequent syncs with no changed files, when not on an adjusted branch: 0:38.40

That 37 second run time for an import with no changed files seems to indicate that my optimisation so far was successful at eliminating all cidsdb reading. It's bottlenecked on git operations now.

I also tried an initial import of 150000 files to see how much of that initial sync was other work than the import (updating adjusted branch, checking out files, etc).

Initial import: 38:24.36

So the initial sync is mostly spending time on imports, not other stuff. It could perhaps be sped up some. Of course it does support -J2 which may speed it up throuh parallism.

Seems like a sync with no changed files on an adjusted branch is doing unncessary work to update the adjusted branch when nothing has changed.

Comment by joey — Thu Jun 1 18:17:34 2023

Remove comment

comment 6

That's great news! I will rerun my test against master and see what it does.

If you happen to be able to point me at the sqlite queries that are being run, I could see if I could tease out any optimizations there.

Comment by jgoerzen — Thu Jun 1 21:26:23 2023

Remove comment

comment 8

I recompiled git-annex with the latest master, and I'm seeing the same behavior as before.

I attached strace to the process for 10 seconds and ran some statistics:

wc -l /tmp/foo
106710 /tmp/foo
egrep 'pread64.14' /tmp/foo | wc -l
103228

fd 14 corresponds to the db file. These preads are all 4k, which matches the Sqlite page size on this database.

There were about a dozen writes; they were to an [eventfd] fd. I'm not sure what that is used for here.

The rest were mainly calles to futex, reads from the timerfd 4, and epolls.

If it were doing one pread64 per file, at a rate of roughly 10,000 per second, it should query for every file in about 15 seconds. But as I write this, the process has used 84 minutes of CPU time. Either it is making many queries for each file, or there is some pathology in sqlite causing the queries to explode into a vast number of individual reads.

It seems to me there is some meaningful difference between my situation and yours.

One thing I have noticed is that it is getting permission denied on about a dozen files out of the test set. However, I don't believe this could have been the case with my original test set (consisting of about 150,000 photos).

I am happy to do any kind of debugging I can that would be helpful.

I note that the Sqlite default cache is about 2MB; I don't know if a larger cache may be helpful. I suspect doing fewer syscalls may help, but the underlying problem of numerous queries would still exist.

Comment by jgoerzen — Fri Jun 2 03:25:27 2023

Remove comment

comment 8

@jgoerzen if you want to take a look at the sql, see Database/ContentIdentifier.hs. getContentIdentifierKeys is the query that it's running on each file. I'm not really sure right now if the persistent schema in there actually creates an index that is used for that query. persistent's documentation of indexes is lacking and I may have misunderstood that uniqueness constraints result in indexes being created.

Dumping the database shows this, which really doesn't seem to have an index after all:

CREATE TABLE IF NOT EXISTS "content_identifiers"("id" INTEGER
PRIMARY KEY,"remote" BLOB NOT NULL,"cid" BLOB NOT NULL,"key" BLOB
NOT NULL,CONSTRAINT "content_indentifiers_key_remote_cid_index"
UNIQUE ("key","remote","cid"));

May need some raw sql to add it, like:

CREATE INDEX cidindex ON "content_identifiers" ("cid");

Also, I re-ran the 150000 file sync benchmark with getContentIdentifierKeys disabled and it took 29:56.78, so 25% faster.

That gives me the idea for an optimisation -- it could check if the database is empty at start and if so, avoid calling that at all. (It also maintains a map in memory which will still allow it to detect duplicate files.) Speeding up initial imports of a lot of files, but not later imports of a lot of files is kind of a cop out, but..

Comment by joey — Fri Jun 2 15:25:12 2023

Remove comment

Re: comment 8

What's the last thing output before it enters this state? I know it's at some point after the last "import filename... ok" but I haven't a clue what it could be.

For what it's worth, I straced the importing stage, and there are no reads from the database FDs at all. Presumably sqlite realizes its empty and doesn't need to keep re-reading to respond to the queries done in that stage. Each file imported took 113 syscalls (there are small files, so only 1 syscall needed to read the content).

I also attached a strace immediately after it finished the importing stage, and straced all the way to the end (importing 1000 files). There was not a single pread64. FD #14 was already closed at that point, and the cidsdb was never accessed again.

Which matches my reading of the code that it never accesses the cidsdb after that point. Which makes what you're seeing very strange indeed.

Something is missing to let me reproduce this. Can you reproduce it with the script I posted in comment #1?

Also straced an entire import of 1000 files, and this is all the preads from fd 14 (there were a similar amount with 3 other fds)

joey@darkstar:~/tmp>egrep 'pread64.14' /tmp/strace
pread64(14, "", 8, 512)                 = 0
pread64(14, "", 100, 0)                 = 0
pread64(14, "", 16, 24)                 = 0
pread64(14, "\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0", 16, 24) = 16
pread64(14, "SQLite format 3\0\20\0\2\2\0@  \0\0\0\1\0\0\0\1"..., 4096, 0) = 4096

Comment by joey — Fri Jun 2 16:49:35 2023

Remove comment

comment 11

So I ran your script on the same dataset that I was using.

I was rather surprised that it rocketed through in 15 minutes. I then ran a subsequent git annex sync, which after a few minutes started showing the same pread64 behavior as before.

Perhaps your improvements have been helping here?

I went back to my original full script and thought: let's time each git-annex call.

#!/bin/bash

set -euo pipefail

if [ -z "$2" ]; then
        echo "Syntax: $0 REPO DRIVE"
        exit 5
fi

REPO="$1"
DRIVE="$2"
DRIVEBASE="/acrypt/no-backup/jgoerzen/git-annex-testing"

set -x
cd /acrypt/git-annex/repos
mkdir $REPO
cd $REPO
git init
git config annex.thin true
git annex init 'local hub'
# use newer belowgit annex wanted . "include=mtree exclude=$REPO/*"
git annex wanted . "include=* and exclude=$REPO/*"

# Now initialize things.
touch mtree
time git annex add mtree
time git annex sync
time git annex adjust --unlock-present

time git annex initremote source type=directory directory=/acrypt/git-annex/bind-ro/$REPO importtree=yes encryption=none

time git annex enableremote source directory=/acrypt/git-annex/bind-ro/$REPO
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch main:$REPO # (says to put the files in the "$REPO" directory)
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M

time git annex sync

# Now set up the clone

cd $DRIVEBASE/$DRIVE/git-annex
time git clone /acrypt/git-annex/repos/$REPO
cd $REPO
git config annex.thin true
git annex init "alexandria:Goerzen3"
time git annex adjust --unlock  # NTFS doesn't like symlinks
time git annex sync

cd /acrypt/git-annex/repos/$REPO
git remote add $DRIVE $DRIVEBASE/$DRIVE/git-annex/$REPO
time git annex sync $DRIVE
git annex group $DRIVE archivedrive
git annex wanted $DRIVE 'include=mtree* or standard' # 'standard and include=mtree*'
time git annex sync

cd $DRIVEBASE/$DRIVE/git-annex/$REPO
time git annex sync
git annex enableremote source directory=/acrypt/git-annex/bind-ro/$REPO
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch main:$REPO # (says to put the files in the "$REPO" directory)
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M

time git annex sync --content
time git annex get --auto

cd /acrypt/git-annex/repos/$REPO
time git annex sync

In this script:

First sync (this is before the initremote so it is doing pretty much nothing):

real 0m0.075s user 0m0.027s sys 0m0.045s

Second sync (this is after initremote/enableremote):

real 15m1.417s user 7m10.770s sys 5m59.405s

[ that correlates closely with how long it took to run your script ]

Third sync (this is after the clone/init/adjust):

real 0m6.606s user 0m1.908s sys 0m4.632s

Fourth sync (git annex sync $DRIVE): real 0m1.066s user 0m0.634s sys 0m0.647s

Fifth sync (after the group/wanted): real 113m32.988s user 55m21.012s sys 58m40.752s

Sixth sync: real 0m1.180s user 0m0.819s sys 0m0.514s

Seventh sync (git annex sync --content): real 112m29.594s user 55m13.650s sys 57m52.259s

Eigth sync (back at the original repo): real 110m3.099s user 51m32.468s sys 58m53.885s

Comment by jgoerzen — Fri Jun 2 21:44:53 2023

Remove comment

comment 12

That is a significant move of the goal posts. I don't know if that even involves importtree being slow anymore.

You've apparently got a lot of stuff going on in the directories used by that script, NTFS, acrypt, some kind of removable drive, ...?

I tried to adapt the script to a simple test case that I can run, but had to leave all that stuff out, since I don't know anything about it. The script I ran is below.

I ran it with 10000 files, and the timings of the syncs were:

0m0.154s
2m5.719s
0m3.088s
0m15.240s
0m34.297s
0m1.244s
2m7.648s
0m35.378s

So I have not reproduced the (new) problem, #5 and #8 are not slow.

Your 5th sync taking a long time is particularly strange because the only difference between it and the 4th sync is that git annex wanted is configured. But, git-annex sync does not do anything with that configuration unless it is syncing content. That made me think you might have annex.synccontent set in gitconfig globally. It seems to me that would also make the 3rd sync slow, but I suppose you could have a complex git config that sets it for some repos but not other ones or something like that.

Otherwise I'm out of ideas about how to reproduce your problem.

I'll reiterate that it would be helpful to see the last output of the sync command before it does whatever is being slow.

#!/bin/bash

set -euo pipefail
set -x

NFILES=$1

REPO=repo
DRIVE=drive
REPOBASE=`pwd`/bench/repo
DRIVEBASE=`pwd`/bench/drive
SRBASE=`pwd`/bench/sr

mkdir -p $REPOBASE
mkdir -p $SRBASE/$REPO
perl -e 'for(1..'$NFILES') { open(OUT,">", "'$SRBASE/$REPO'/f$_"); print OUT $_; close OUT }'

set -x
cd $REPOBASE
mkdir $REPO
cd $REPO
git init
git config annex.thin true
git annex init 'local hub'
# use newer belowgit annex wanted . "include=mtree exclude=$REPO/*"
git annex wanted . "include=* and exclude=$REPO/*"

# Now initialize things.
touch mtree
git annex add mtree
time git annex sync
git annex adjust --unlock-present

git annex initremote source type=directory directory=$SRBASE/$REPO importtree=yes encryption=none

git annex enableremote source directory=$SRBASE/$REPO 
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch master:$REPO # (says to put the files in the "$REPO" directory)
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M

time git annex sync

# Now set up the clone

mkdir -p $DRIVEBASE/$DRIVE/git-annex
cd $DRIVEBASE/$DRIVE/git-annex
git clone $REPOBASE/$REPO
cd $REPO
git config annex.thin true
git annex init "alexandria:Goerzen3"
git annex adjust --unlock  # NTFS doesn't like symlinks
time git annex sync

cd $REPOBASE/$REPO
git remote add $DRIVE $DRIVEBASE/$DRIVE/git-annex/$REPO
time git annex sync $DRIVE
git annex group $DRIVE archivedrive
git annex wanted $DRIVE 'include=mtree* or standard' # 'standard and include=mtree*'
time git annex sync

cd $DRIVEBASE/$DRIVE/git-annex/$REPO
time git annex sync
git annex enableremote source directory=$SRBASE/$REPO
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch master:$REPO # (says to put the files in the "$REPO" directory)
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M

time git annex sync --content
git annex get --auto

cd $REPOBASE/$REPO
time git annex sync

Comment by joey — Mon Jun 5 16:28:05 2023

Remove comment

comment 13

My apologies for forgetting that.

So initially I thought it was after the last "import source... ok". I'm no longer convinced that it happens after the last one. There is clearly a phase before this starts, and a separate phase exhibiting this behavior.

The last output from one of these runs was:

/acrypt/git-annex/bind-ro/testdata/dvd/var..redacted..: openBinaryFile: permission denied (Permission denied)
^Mok

  Failed to import some files from source. Re-run command to resume import.

And indeed there were about 2 dozen files out of that source set that it didn't have the permission to read.

I'm sorry it looked like expansion of scope. This is not really on NTFS; it's still on ZFS as it has been from the beginning. I had been testing with one of the repos on NTFS previously but in an effort to rule that out, I've been 100% on ZFS since before reporting this issue. "acrypt" is just the name of my ZFS pool and that refers to that it's atop LUKS (I don't believe this to be relevant, since the CPU time is in git-annex, not kcryptd)

Comment by jgoerzen — Mon Jun 5 19:32:28 2023

Remove comment

comment 14

There's only one import in the sync, and your output shows it completed (with error).

The only other phase of sync that could be run after that and take a lot of time is content syncing. You would have to have annex.synccontent set somewhere for sync to do that. Do you?

Comment by joey — Tue Jun 6 17:11:35 2023

Remove comment

comment 15

It would make a lot of sense for --content syncing to be what remains slow. That has to scan over all the files and when it decides that it does not need to copy the content anywhere, that's a tight loop with no output.

In my repo with 10000 files that was set up by the latest test case, git-annex sync takes 13 seconds, and with --content it takes 61 seconds.

I optimised a numcopies/mincopies lookup away, and that got it down to 28 seconds.

The cidsdb does not get accessed by the --content scan in my testing, although there may be other situations where it does.

Comment by joey — Tue Jun 6 17:31:49 2023

Remove comment

comment 16

I did try making the permissions of a file not allow it to be read, and so reproduced the "Failed to import some files from source." That didn't change the sync time appreciably (with or without content syncing).

Comment by joey — Tue Jun 6 18:43:50 2023

Remove comment

comment 17

Another expensive thing sync has to do, due to the adjusted branch being checked out, is Command.Sync.updateBranches. In my repo with 10000 files, when not using --content, that is responsible for most of the run time of a sync when there is nothing new to import or pull.

That happens because git-annex adjust --unlock-present was used. So sync needs to update the branch if there have been any changes to the content that is present. Currently when in such an adjusted branch, it re-builds the branch every time it's run.

It would be possible to avoid that when there have been no changes to the content that is present, but it would need to deal with eg a previous sync command having transferred some content before it was interrupted. So it would need some kind of flag file to be written when content changes.

(With the master branch checked out instead, git-annex sync --content is now as fast as git-annex find --not --in source.)

Comment by joey — Tue Jun 6 19:07:25 2023

Remove comment

Re: comment 17

I've fixed that, now git-annex sync avoids updating the adjusted branch when there have been no changes to available content.

Comment by joey — Thu Jun 8 20:56:25 2023

Remove comment

comment 19

I ran the second test case with 150000 files, and here's how long the syncs took:

0m0.170s
33m37.810s
0m36.644s
2m58.773s
13m38.126s
0m3.933s
still running after 85 minutes
tbd

Sync 2 took longer than your results in comment #11, but consistent with my laptop being slower. And I think 33 minutes to import 150k files is fine.

Still not seeing sync 5 take as long as it did for you. Nowhere in the same ballpark. 13 minutes seems ok for a sync --content that has to scan 150000 files.

The 7th sync is seeming too slow to me. For you it took equally long as the 5th sync, and both are --content syncs. So maybe I'm seeing the same problem but only on the 7th for some reason?

For me it seemed to take a long time after outputting "list source ok". At that point strace showed only a lot of futex(). And the cpu was pegged. And it had the cidsdb open. Hmmm.. This is feeling a bit like the problem you originally reported.

Interrupted the 7th sync and ran again...

The "list source" takes more than 15 minutes. It's bottlenecked on checking git ignores. Bottleneck that I didn't notice with a smaller number of files. Fixed that by making sure the export db was populated, which it usually is, but not in the 7th sync's situation. Now "list source" completes in less than 2 minutes.

And.. after that, it was back to the tight futex() loop.. And this time I had intrumented the cidsdb, and it was importKeys calling getContentIdentifierKeys.

Here's the kicker: It's only running getContentIdentifierKeys 15 times per second. So that will take 166 minutes for all 150000 files.

Each call to getContentIdentifierKeys is taking 0.05 seconds. So, this bug is back to the original problem of being bottlenecked on the cidsdb. And it is smelling like a lack of indexes. Yay!

Comment by joey — Thu Jun 8 21:14:35 2023

Remove comment

comment 20

The sqlite schema is:

CREATE TABLE IF NOT EXISTS "content_identifiers"("id" INTEGER
PRIMARY KEY,"remote" BLOB NOT NULL,"cid" BLOB NOT NULL,"key" BLOB
NOT NULL,CONSTRAINT "content_indentifiers_key_remote_cid_index"
UNIQUE ("key","remote","cid"));

According to the sqlite docs:

In most cases, UNIQUE and PRIMARY KEY constraints are implemented by
creating a unique index in the database. (The exceptions are INTEGER
PRIMARY KEY and PRIMARY KEYs on WITHOUT ROWID tables.)
Hence, the following schemas are logically equivalent:

    CREATE TABLE t1(a, b UNIQUE);

    CREATE TABLE t1(a, b PRIMARY KEY);

    CREATE TABLE t1(a, b);
    CREATE UNIQUE INDEX t1b ON t1(b);

But, that index is not used for a cid query:

sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo";
QUERY PLAN
`--SCAN content_identifiers

Aha, it does use the UNIQUE index for queries going the other way:

sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo";
QUERY PLAN
`--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=?)

Even a query using the "remote" column seems to use the UNIQUE index:

sqlite> EXPLAIN QUERY PLAN SELECT cid FROM content_identifiers WHERE key = "foo" and remote = "bar";
QUERY PLAN
`--SEARCH content_identifiers USING COVERING INDEX sqlite_autoindex_content_identifiers_1 (key=? AND remote=?)

Here's a thread on mastodon about this, with some ideas about why sqlite3 behaves this way. https://octodon.social/@joeyh/110515161416516029. Also, see this commit where I added a second index to a database, that was the same columns as an existing index, only in a different order, and it did speed up a query. ca2a527e93ca22548983a7285fc6e0a892ca44a4. Oh and there was a uniqueness constraint with the cid first, but it was broken and got removed in c6e693b25de49d4d3b2fedb49ffb42f04f5fd544. So probably it got slower at that point (Dec 2020).

So probably most other tables used by git-annex do have usable indexes, for most queries. But not this one. Gonna need to manually add an index.

(Or possibly redo the schema to have UNIQUE ("key","cid","remote"), it may be that sqlite only creates an index for up to 2 columns or something like that.)

sqlite> CREATE INDEX idx_content_identifiers_cid on content_identifiers (cid);
sqlite> EXPLAIN QUERY PLAN SELECT key FROM content_identifiers WHERE cid = "foo";
QUERY PLAN
`--SEARCH content_identifiers USING INDEX idx_content_identifiers_cid (cid=?)

After modifying the database this way, I re-ran the slow sync and it's now able to run getContentIdentifierKeys 2858 times a second. What an improvement!

Comment by joey — Fri Jun 9 16:01:06 2023

Remove comment

comment 21

Here's the final timing from the second test case with 150000 files after all of the above optimisation:

0m0.131s
33m40.490s
0m41.027s
2m59.070s
14m5.679s
0m3.673s
17m39.271s
11m15.317s

(git-annex get from an importtree remote was also sped up a lot; it makes the same cid queries as sync #7)

I'm satisfied with this, considering this bug done, unless it's somehow still being slow for you.

Comment by joey — Fri Jun 9 17:54:29 2023

Remove comment

comment 22

WOW! That was it!

I ran the index-creation command on my cidsdb/db, then reran git-annex sync - the very long one. Down to 1m48s! A HUGE improvement!

Thanks Joey!

Comment by jgoerzen — Sun Jun 11 02:18:22 2023

Remove comment

Add a comment