Please describe the problem.
Hi!
I've got a directory special remote set up, with importtree. After scanning through it (import filename... ok), it stops reading from the filesystem and spends hours just reading from cidsdb. Verifying with strace, there is not reading from the filesystem here.
What steps will reproduce the problem?
Here are my prep steps:
git init
git config annex.thin true
git annex init 'local hub'
git annex wanted . "include=* and exclude=testdata/*"
touch mtree
git annex add mtree
git annex sync
git annex adjust --unlock-present
git annex initremote source type=directory directory=/acrypt/git-annex/bind-ro/testdata importtree=yes encryption=none
git annex enableremote source directory=/acrypt/git-annex/bind-ro/testdata
git config remote.source.annex-readonly true
git config remote.source.annex-tracking-branch main:testdata # (says to put the files in the "$REPO" directory)
git config annex.securehashesonly true
git config annex.genmetadata true
git config annex.diskreserve 100M
git annex sync
It is that last sync that runs into this problem.
What version of git-annex are you using? On what operating system?
10.20230407 on Debian bullseye
Please provide any additional information below.
There are about 150,000 files in that tree. This problem occurs after git-annex is done scanning the tree (verified with strace and lsof). This problem does not occur with the same number of files in a local/standard git-annex repo (as opposed to the directory special remote).
.git/annex/cidsdb/db is only 51M so it is certainly entirely cached. git-annex is entirely CPU-bound at this point.
I can rerun the sync with an unchanged import directory. It still takes 107 minutes, the majority of which is spent reading cidsdb. Only the first minute or two are spent scanning the source area.
I have tested this on a source directory that's 2.2G and another that's 1.1T, both with about 150,000 files. After the first import, the subsequent syncs are similar in performance. In other words, this behavior appears to be related to the number of files, not the size of files.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
Yes, and I hope to use it for a project to archive family photos and videos to BD-R (that's what this is about here)
Hm. I have tried to reproduce this with a git-annex instrumented to log each time it accesses the cidsdb.
I have not tried with the full 150k files, only with 10000, but have no reason to think it wouldn't have the same cidsbd access pattern with more files.
What I found is, there was no access to the cidsdb after it finished the "import source" part of the sync.
But, if I then run git-annex sync a second time, it behaves much more like you describe. It does not display "import source" for any files, because no files have changed. And it hits the cidsdb twice per file. Once is the necessary call to getContentIdentifierKeys, which is how it determines it's seen a file before and doesn't need to import it a second time.
The other hit is recordContentIdentifier, which happens for each recorded cid, due to updateFromLog. That seems unnecessary, because the previous sync already recorded all the cids.
So I think a 2x speedup should be possible to repeated syncs when most of the files have not changed, but either I am missing something to reproduce your problem, or half the overhead is due to git-annex determining when it's already imported files before, which is unavoidable.
Importing trees from special remotes is simply not going to be able to scale to huge numbers of files. Using a git repository is always gonna be able to perform better. (But of course hits its own scalability limits eventually.) This is a more general mechanism, and so many of the approaches git uses to speed up are not available.
Here's the script I used.
Here's the output when run with just 10 files, which shows that cidsdb access stops after it finishes "import source":
Here's the patch I used for git-annex to instrument it:
I was able to eliminate that extra work. Now the first sync does not write to the cidsdb but only to the git-annex branch, and the second sync does the necessary work of updating the cidsdb from the git-annex branch.
I'm not sure that extra work is what the bug reporter was complaining about though.
See speed up import tree for a few ideas of things that would speed it up.
Hi Joey,
Would it be helpful for you if I ran the instrumented version here?
Also I suspect that the time spent reading cidsdb scales exponentially rather than linearly with the number of files. Or perhaps there's a magic spot somewhere (an in-memory cache limit?) that causes the excessive time. Anyhow, I'm happy to do whatever I can to help.
Thanks!
As far as how reads from the cidsdb scale, I think that sqlite databases like this do slow down somewhat when tables get massive, but I don't really know the details. And it's of course possible that something could be improved in the schema or queries.
I've been working on that todo I linked above, and the speed gain is impressive when there are few or no changed files in the remote. With 20,000 unchanged files, re-running git-annex import[1] sped up from 125.95 to 3.84 seconds. With 40,000 unchanged files, it sped up from 477 to 8.13 seconds. I haven't tried with 150000 files yet but the pattern is clear.
Well, I think I've certianly solved that problem. But I don't know if there's something else that is making the initial sync slower than it needs to be.
[1] More accurately, re-running it a second time, both to get a warm cache result, and because the first time, it is busy updating the cidsdb with the files that were imported earlier, as described in comment #2.
Initial sync: 39:01.37
Then
git-annex import master:testdata --from source
with no changed files: 8:36.30(This is writing the cidsdb since the initial sync no longer does, see comment #2)
Subsequent imports with no changed files: 0:37.21
Subsequent syncs with no changed files: 3:18.61
Subsequent syncs with no changed files, when not on an adjusted branch: 0:38.40
That 37 second run time for an import with no changed files seems to indicate that my optimisation so far was successful at eliminating all cidsdb reading. It's bottlenecked on git operations now.
I also tried an initial import of 150000 files to see how much of that initial sync was other work than the import (updating adjusted branch, checking out files, etc).
Initial import: 38:24.36
So the initial sync is mostly spending time on imports, not other stuff. It could perhaps be sped up some. Of course it does support -J2 which may speed it up throuh parallism.
Seems like a sync with no changed files on an adjusted branch is doing unncessary work to update the adjusted branch when nothing has changed.
That's great news! I will rerun my test against master and see what it does.
If you happen to be able to point me at the sqlite queries that are being run, I could see if I could tease out any optimizations there.
I recompiled git-annex with the latest master, and I'm seeing the same behavior as before.
I attached strace to the process for 10 seconds and ran some statistics:
fd 14 corresponds to the db file. These preads are all 4k, which matches the Sqlite page size on this database.
There were about a dozen writes; they were to an
[eventfd]
fd. I'm not sure what that is used for here.The rest were mainly calles to futex, reads from the timerfd 4, and epolls.
If it were doing one pread64 per file, at a rate of roughly 10,000 per second, it should query for every file in about 15 seconds. But as I write this, the process has used 84 minutes of CPU time. Either it is making many queries for each file, or there is some pathology in sqlite causing the queries to explode into a vast number of individual reads.
It seems to me there is some meaningful difference between my situation and yours.
One thing I have noticed is that it is getting permission denied on about a dozen files out of the test set. However, I don't believe this could have been the case with my original test set (consisting of about 150,000 photos).
I am happy to do any kind of debugging I can that would be helpful.
I note that the Sqlite default cache is about 2MB; I don't know if a larger cache may be helpful. I suspect doing fewer syscalls may help, but the underlying problem of numerous queries would still exist.
@jgoerzen if you want to take a look at the sql, see
Database/ContentIdentifier.hs
.getContentIdentifierKeys
is the query that it's running on each file. I'm not really sure right now if the persistent schema in there actually creates an index that is used for that query. persistent's documentation of indexes is lacking and I may have misunderstood that uniqueness constraints result in indexes being created.Dumping the database shows this, which really doesn't seem to have an index after all:
May need some raw sql to add it, like:
Also, I re-ran the 150000 file sync benchmark with
getContentIdentifierKeys
disabled and it took 29:56.78, so 25% faster.That gives me the idea for an optimisation -- it could check if the database is empty at start and if so, avoid calling that at all. (It also maintains a map in memory which will still allow it to detect duplicate files.) Speeding up initial imports of a lot of files, but not later imports of a lot of files is kind of a cop out, but..
What's the last thing output before it enters this state? I know it's at some point after the last "import filename... ok" but I haven't a clue what it could be.
For what it's worth, I straced the importing stage, and there are no reads from the database FDs at all. Presumably sqlite realizes its empty and doesn't need to keep re-reading to respond to the queries done in that stage. Each file imported took 113 syscalls (there are small files, so only 1 syscall needed to read the content).
I also attached a strace immediately after it finished the importing stage, and straced all the way to the end (importing 1000 files). There was not a single
pread64
. FD #14 was already closed at that point, and the cidsdb was never accessed again.Which matches my reading of the code that it never accesses the cidsdb after that point. Which makes what you're seeing very strange indeed.
Something is missing to let me reproduce this. Can you reproduce it with the script I posted in comment #1?
Also straced an entire import of 1000 files, and this is all the preads from fd 14 (there were a similar amount with 3 other fds)
So I ran your script on the same dataset that I was using.
I was rather surprised that it rocketed through in 15 minutes. I then ran a subsequent git annex sync, which after a few minutes started showing the same pread64 behavior as before.
Perhaps your improvements have been helping here?
I went back to my original full script and thought: let's time each git-annex call.
In this script:
First sync (this is before the initremote so it is doing pretty much nothing):
real 0m0.075s user 0m0.027s sys 0m0.045s
Second sync (this is after initremote/enableremote):
real 15m1.417s user 7m10.770s sys 5m59.405s
[ that correlates closely with how long it took to run your script ]
Third sync (this is after the clone/init/adjust):
real 0m6.606s user 0m1.908s sys 0m4.632s
Fourth sync (git annex sync $DRIVE): real 0m1.066s user 0m0.634s sys 0m0.647s
Fifth sync (after the group/wanted): real 113m32.988s user 55m21.012s sys 58m40.752s
Sixth sync: real 0m1.180s user 0m0.819s sys 0m0.514s
Seventh sync (git annex sync --content): real 112m29.594s user 55m13.650s sys 57m52.259s
Eigth sync (back at the original repo): real 110m3.099s user 51m32.468s sys 58m53.885s
That is a significant move of the goal posts. I don't know if that even involves importtree being slow anymore.
You've apparently got a lot of stuff going on in the directories used by that script, NTFS, acrypt, some kind of removable drive, ...?
I tried to adapt the script to a simple test case that I can run, but had to leave all that stuff out, since I don't know anything about it. The script I ran is below.
I ran it with 10000 files, and the timings of the syncs were:
So I have not reproduced the (new) problem, #5 and #8 are not slow.
Your 5th sync taking a long time is particularly strange because the only difference between it and the 4th sync is that
git annex wanted
is configured. But,git-annex sync
does not do anything with that configuration unless it is syncing content. That made me think you might have annex.synccontent set in gitconfig globally. It seems to me that would also make the 3rd sync slow, but I suppose you could have a complex git config that sets it for some repos but not other ones or something like that.Otherwise I'm out of ideas about how to reproduce your problem.
I'll reiterate that it would be helpful to see the last output of the sync command before it does whatever is being slow.
My apologies for forgetting that.
So initially I thought it was after the last "import source... ok". I'm no longer convinced that it happens after the last one. There is clearly a phase before this starts, and a separate phase exhibiting this behavior.
The last output from one of these runs was:
And indeed there were about 2 dozen files out of that source set that it didn't have the permission to read.
I'm sorry it looked like expansion of scope. This is not really on NTFS; it's still on ZFS as it has been from the beginning. I had been testing with one of the repos on NTFS previously but in an effort to rule that out, I've been 100% on ZFS since before reporting this issue. "acrypt" is just the name of my ZFS pool and that refers to that it's atop LUKS (I don't believe this to be relevant, since the CPU time is in git-annex, not kcryptd)
There's only one import in the sync, and your output shows it completed (with error).
The only other phase of sync that could be run after that and take a lot of time is content syncing. You would have to have annex.synccontent set somewhere for sync to do that. Do you?
It would make a lot of sense for --content syncing to be what remains slow. That has to scan over all the files and when it decides that it does not need to copy the content anywhere, that's a tight loop with no output.
In my repo with 10000 files that was set up by the latest test case,
git-annex sync
takes 13 seconds, and with --content it takes 61 seconds.I optimised a numcopies/mincopies lookup away, and that got it down to 28 seconds.
The cidsdb does not get accessed by the --content scan in my testing, although there may be other situations where it does.
I did try making the permissions of a file not allow it to be read, and so reproduced the "Failed to import some files from source." That didn't change the sync time appreciably (with or without content syncing).
Another expensive thing sync has to do, due to the adjusted branch being checked out, is Command.Sync.updateBranches. In my repo with 10000 files, when not using --content, that is responsible for most of the run time of a sync when there is nothing new to import or pull.
That happens because
git-annex adjust --unlock-present
was used. So sync needs to update the branch if there have been any changes to the content that is present. Currently when in such an adjusted branch, it re-builds the branch every time it's run.It would be possible to avoid that when there have been no changes to the content that is present, but it would need to deal with eg a previous sync command having transferred some content before it was interrupted. So it would need some kind of flag file to be written when content changes.
(With the master branch checked out instead,
git-annex sync --content
is now as fast asgit-annex find --not --in source
.)I've fixed that, now
git-annex sync
avoids updating the adjusted branch when there have been no changes to available content.I ran the second test case with 150000 files, and here's how long the syncs took:
Sync 2 took longer than your results in comment #11, but consistent with my laptop being slower. And I think 33 minutes to import 150k files is fine.
Still not seeing sync 5 take as long as it did for you. Nowhere in the same ballpark. 13 minutes seems ok for a sync --content that has to scan 150000 files.
The 7th sync is seeming too slow to me. For you it took equally long as the 5th sync, and both are --content syncs. So maybe I'm seeing the same problem but only on the 7th for some reason?
For me it seemed to take a long time after outputting "list source ok". At that point strace showed only a lot of futex(). And the cpu was pegged. And it had the cidsdb open. Hmmm.. This is feeling a bit like the problem you originally reported.
Interrupted the 7th sync and ran again...
The "list source" takes more than 15 minutes. It's bottlenecked on checking git ignores. Bottleneck that I didn't notice with a smaller number of files. Fixed that by making sure the export db was populated, which it usually is, but not in the 7th sync's situation. Now "list source" completes in less than 2 minutes.
And.. after that, it was back to the tight futex() loop.. And this time I had intrumented the cidsdb, and it was importKeys calling getContentIdentifierKeys.
Here's the kicker: It's only running getContentIdentifierKeys 15 times per second. So that will take 166 minutes for all 150000 files.
Each call to getContentIdentifierKeys is taking 0.05 seconds. So, this bug is back to the original problem of being bottlenecked on the cidsdb. And it is smelling like a lack of indexes. Yay!
The sqlite schema is:
According to the sqlite docs:
But, that index is not used for a cid query:
Aha, it does use the UNIQUE index for queries going the other way:
Even a query using the "remote" column seems to use the UNIQUE index:
Here's a thread on mastodon about this, with some ideas about why sqlite3 behaves this way. https://octodon.social/@joeyh/110515161416516029. Also, see this commit where I added a second index to a database, that was the same columns as an existing index, only in a different order, and it did speed up a query. ca2a527e93ca22548983a7285fc6e0a892ca44a4. Oh and there was a uniqueness constraint with the cid first, but it was broken and got removed in c6e693b25de49d4d3b2fedb49ffb42f04f5fd544. So probably it got slower at that point (Dec 2020).
So probably most other tables used by git-annex do have usable indexes, for most queries. But not this one. Gonna need to manually add an index.
(Or possibly redo the schema to have
UNIQUE ("key","cid","remote")
, it may be that sqlite only creates an index for up to 2 columns or something like that.)After modifying the database this way, I re-ran the slow sync and it's now able to run getContentIdentifierKeys 2858 times a second. What an improvement!
Here's the final timing from the second test case with 150000 files after all of the above optimisation:
(git-annex get from an importtree remote was also sped up a lot; it makes the same cid queries as sync #7)
I'm satisfied with this, considering this bug done, unless it's somehow still being slow for you.
WOW! That was it!
I ran the index-creation command on my cidsdb/db, then reran git-annex sync - the very long one. Down to 1m48s! A HUGE improvement!
Thanks Joey!