For a heavy tree it takes indeed quite a while. And it is just wasteful if not just tree, but the entire repository does not have any unlocked files which is majority of the cases. So I wondered if this operation could be avoided.
E.g. following idea came to mind: git-annex could add some flag/beacon file (e.g. has-no-unlocked
) within its git-annex
branch which it would set upon initializing the git-annex
branch, and which it would remove as soon as any path gets unlocked. Then upon initial initialization of the clone'd worktree it could avoid this "Scanning" entirely if according to git-annex
branch repository is still known to now carry any unlocked file. It would not provide an ultimate solution which would be "tree specific" but at least it would provide remedy for majority of the cases (in our case) where there is no unlocked files.
I think I've improved this all that it can reasonably be sped up, so done. --Joey
Hmm, this does a git ls-tree -r and parses it looking for files that are not symlinks. Each such file has to pass through cat-file --batch to see if it is unlocked.
So I think this should be reasonably fast unless the repo has a lot of non-annexed files. Does your repo, or is it simply so large that ls-tree -r is very expensive?
Benchmarking here, a repo with 100,000 annexed files (all locked): the git ls-tree ran in 3 seconds; the init took 17 seconds overall, with most time needed to set up the git-annex branch etc.
One speedup I notice that scanUnlockedFiles uses catKey, which first checks catObjectMetaData to determine if the file is so large that it's clearly not a pointer file (and so avoid catting such a large file). If the file size was known, it could instead use catKey', which would double the speed of processing non-annexed files, as well as actual locked files. To get the size, git ls-tree has a --long switch. (git still has to do some work to get the size since tree objects don't contain it, but it should be much less expensive than a round trip through catObjectMetaData. In my benchmark it doubled the git ls-tree time to add --long.) Implementing this will need adding support for parsing the --long output so I've not done it quite yet.
The idea of putting a flag in the git-annex branch means that every time the smudge filter gets run, it would have to check the git-annex branch to see if it needs to unset that flag. I do not want to make the smudge filter even slower than it is, and trading off a one-time (per repository) cost with an ongoing cost does not seem good.
The flag would also be a source of problems in complex situations. Eg, someone could be operating on the repo w/o git-annex installed, merge in a branch that contains unlocked files, and push the result. Leading to the flag in the git-annex branch being left incorrectly set, and so causing breakage to someone who clones the repo later on.
Also, a transient unlock/modify/add of a file leaves no unlocked files in the worktree, but the unlock would set the flag, so any performance benefit would need all users of the repo to avoid such workflows.
So viable approaches seem to be to optimise the scanning as much as possible, and possibly to add some config to git-annex that avoids doing the scanning, which could be used if you happen to know your repo doesn't contain unlocked files.
That catKey optimisation actually only helps if the tree has a lot of files that are not annex symlinks, but are either unlocked or not annexed. If most of the files are locked, that would actually make the scan somewhere around twice as slow as it currently is. So not a worthwhile optimisation.
Update: Now that the scan also scans for locked files to make the associated files include information about them, the catKey optimisation did make sense. Unfortunately, that does mean this scan got a little bit slower still, since it has to use git ls-tree --long.
I don't see much else there that could be optimised. Possibly the ls-tree parser could be made faster but it's already using attoparsec so unlikely to be many gains.
If we zoom out a bit, and also consider Ilya's desire to skip installing smudge/clean filters, maybe what you guys want is a way to tell git-annex that a particular repo does not use unlocked files, so git-annex can avoid expensive stuff needed to support them.
So, I've added an
annex.supportunlocked
config that can be set to false before runninggit-annex init
, and it will disable the smudge filters and skip this expensive scan.(The config will also let you measure the actual time this scan takes, which I'm still curious about, because I've not seen it being a majority of the init time, even in a large repository. Seems likely to me that init is doing other expensive things right after this scan, in particular setting up the git-annex branch, and that it may be those are what was really seeming slow to you.)
The config as implemented does not prevent later adding unlocked files, and if you do that, git-annex will get confused and not make the content accessible until you change the config back and re-run git-annex init, or you lock the file.
While the config could grow in scope to preventing adding unlocked files, it seems like probably a bad idea (and a whole lot of work). For one thing
git add
with a largefiles configuration would then need to prevent adding the file unlocked.. but it can't add it locked, and the smudge filter can't prevent git from adding the file -- so it would need to ignore the largefiles configuration and add the file to git. Which could be pretty suprising.And really there's no way for git-annex to enforce that a repo doesn't contain unlocked files. Even if this became a repo-global config (probably not a good idea), someone could merge new unlocked files from a repo that does, using just git. Or add them using an old version of git-annex. (Or just convert symlinks to unlocked files manually, it's not hard..) If setting this config, you're telling git-annex you don't expect there to be any unlocked file, and are ok with some minor breakage if there are some.
Combining the config with
git-annex adjust --lock
would somewhat avoid such problems, although it's still possible to add files unlocked when in such a branch.you are talking about a regular
git config
configuration variable? then FWIW it is of no use for the original use case of a user cloning an existing git-annex repo since I would have no clue if that repository contains any unlocked files or not. It could indeed come handy for timing when desired to investigate.I am yet to appreciate how "repo-global config" situation would be different from handling any other git-annex manipulation within
git-annex
branch (e.g. of annex config there) where we already rely on git-annex to "do the right thing" (deciding on taking keys availability information based on timestamping etc).Found a way to speed up git-annex init's setup of the git-annex branch when run in a clone of an existing repo. In a 100,000 file repo, it improved from 17s to 10s. I have a feeling that might have been what was really making it seem slow to you.
Thanks for adding
annex.supportunlocked
!It seems to me that this is a repo property that you'd want to be consistent across clones, i.e. a candidate for a
git-annex-config
setting."there's no way for git-annex to enforce that a repo doesn't contain unlocked files" -- maybe, have
git-annex-fsck
check that there are none ifannex.supportunlocked
is set?"git add with a largefiles configuration... would need to ignore the largefiles configuration and add the file to git" -- which the
annex.gitaddtoannex
config setting already does, so maybe just document thatannex.supportunlocked=false
impliesannex.gitaddtoannex=false
? Could you then uninstall (or skip installing) the smudge/clean filters?"someone could merge new unlocked files from a repo that does, using just git" -- they can't do that inadvertently if the repo has a
git-annex-config
setting disallowing the adding of unlocked files. If they manually override the repo setting, or "convert symlinks to unlocked files manually", they're doing something odd you can't plan for anyway.Just to be clear, annex.supportunlocked=false does prevent installing smudge/clean filters.
"annex.supportunlocked=false does prevent installing smudge/clean filters" -- then, do I understand correctly that, if it were a
git-annex-config
setting andannex.supportunlocked=false
impliedannex.gitaddtoannex=false
, this would prevent inadvertently adding unlocked files? Then, assuming "typical" repo usage, it would also prevent inadvertently merging unlocked files, since you'd have to add them first.This got slower again since it now also has to scan locked files as well as unlocked. And so disabling smudge filters doesn't avoid it either.
It should be possible to speed this up by streaming the ls-tree through cat-file, like is done in CmdLine.Seek (catObjectStreamLsTree). Speedup probably in the order of 2-3x.
Implemented streaming through git. In a repo with 100000 unlocked files, version 8.20210429 took 46 seconds, now reduced to 36 seconds.
When the files are locked, of course the old version was faster due to being able to skip all symlinks, 2 seconds. The new version takes slightly less time than it does for unlocked files, 35 seconds.
Now the git query and processing is only a few seconds of the total run time, writing information about all the files to sqlite is most of the rest, and may also be possible to speed up.
There was an unnecessary check of the current time per sql insert, removing that sped it up by 3 seconds in my benchmark.
Also tried increasing the number of inserts per sqlite transaction from 1k to 10k. Memory use increased to 90 mb, but no measurable speed increase.
I don't see much else that can speed up the sqlite part, without going deep into the weeds of populating sqlite databases without using sql, or using multi-value inserts (like described here. Both would prevent using persistent to abstract sql away, and would only be usable in this case, not speeding up git-annex generally, so not too enthused.
The lengthy scan would only happen once -- when the worktree is first checked out -- and would be incremental from then on, right? But it would slow things down every time a new checkout of a large repo happens? Maybe the scan could be done lazily, invoked when the results are first needed? Also, if you just need to know which keys are used more than once, maybe a Bloom filter of the keys used in the worktree would suffice, instead of a full keys-to-paths SQL db?
The scan could be done lazily, but there are situations that use the database where unexpectedly taking a much longer time than usual would be a real problem. For example "git add".
The bloom filter idea does not work.
For unlocked files, certainly. When
annex.supportunlocked=false
, it sounded like the only situation that uses the database isdrop --auto
, or a matching expression with--includesamecontent/--excludesamecontent
? (And maybe ?git-annex whereused). Personally I would prefer an unexpected delay in these rare cases, to a delay in the more common case of checking out or switching branches.It can't know if there are unlocked files without doing this scan.
Except for when annex.supportunlocked=false, but then that config option would have the side effect of making git-annex slower at some point after init, with the situations where it does hard to enumerate and probably growing. This would be a hard behavior to explain to the user.
And there are numerous other points than the ones you listed where git-annex accesses the keys db and would trigger a deferred scan. Eg, anytime it might need to update a pointer file. Eg, when
git annex get
is run. Avoiding using the keys db when annex.supportunlocked=false in all such cases in order to avoid the scan would be effectively the same complexity as continuing to support v5 repos, which I've NAKed before.Turns out git-annex init was running both scanAnnexedFiles and reconcileStaged, which after recent changes to the latter, both do close to the same scan when run in a fresh clone. So double work!
Benchmarking with 100,000 files, git-annex init took 88 seconds.
Fixed not to use reconcileStaged it took 37 seconds.
(Keeping reconcileStaged and removing scanAnnexedFiles it took 47 seconds. That makes sense; reconcileStaged is an incremental updater and is not able to use SQL as efficiently as scanAnnexedFiles.)
Also the git clone of that 100,000 file repo itself, from another repo on the same SSD, takes 9 seconds. git-annex init taking 4x as long as a fast local git clone to do a scan is not bad.
This is EOT for me, but I will accept pathes if someone wants to make git-annex faster.
(Also see ?display when reconcileStaged is taking a long time)
Note that the bug report the previous comment links to is not actually about the overhead of this scan.