Avoid lengthy "Scanning for unlocked files ..."

For a heavy tree it takes indeed quite a while. And it is just wasteful if not just tree, but the entire repository does not have any unlocked files which is majority of the cases. So I wondered if this operation could be avoided.

E.g. following idea came to mind: git-annex could add some flag/beacon file (e.g. has-no-unlocked) within its git-annex branch which it would set upon initializing the git-annex branch, and which it would remove as soon as any path gets unlocked. Then upon initial initialization of the clone'd worktree it could avoid this "Scanning" entirely if according to git-annex branch repository is still known to now carry any unlocked file. It would not provide an ultimate solution which would be "tree specific" but at least it would provide remedy for majority of the cases (in our case) where there is no unlocked files.

I think I've improved this all that it can reasonably be sped up, so done. --Joey

RSS Atom

comment 1

Hmm, this does a git ls-tree -r and parses it looking for files that are not symlinks. Each such file has to pass through cat-file --batch to see if it is unlocked.

So I think this should be reasonably fast unless the repo has a lot of non-annexed files. Does your repo, or is it simply so large that ls-tree -r is very expensive?

Benchmarking here, a repo with 100,000 annexed files (all locked): the git ls-tree ran in 3 seconds; the init took 17 seconds overall, with most time needed to set up the git-annex branch etc.

One speedup I notice that scanUnlockedFiles uses catKey, which first checks catObjectMetaData to determine if the file is so large that it's clearly not a pointer file (and so avoid catting such a large file). If the file size was known, it could instead use catKey', which would double the speed of processing non-annexed files, as well as actual locked files. To get the size, git ls-tree has a --long switch. (git still has to do some work to get the size since tree objects don't contain it, but it should be much less expensive than a round trip through catObjectMetaData. In my benchmark it doubled the git ls-tree time to add --long.) Implementing this will need adding support for parsing the --long output so I've not done it quite yet.

Comment by joey — Tue Mar 16 18:29:26 2021

Remove comment

install clean/smudge filter only when needed

On a related note, maybe some way to not install git-annex's clean/smudge filter when all annexed files are locked could be found? Maybe, a flag could be stored on the git-annex branch to indicate that a repo is locked-files-only? My repos often are. Or maybe a flag in .gitattributes could indicate this.

Comment by Ilya_Shlyakhter — Thu Mar 18 15:47:02 2021

Remove comment

comment 3

The idea of putting a flag in the git-annex branch means that every time the smudge filter gets run, it would have to check the git-annex branch to see if it needs to unset that flag. I do not want to make the smudge filter even slower than it is, and trading off a one-time (per repository) cost with an ongoing cost does not seem good.

The flag would also be a source of problems in complex situations. Eg, someone could be operating on the repo w/o git-annex installed, merge in a branch that contains unlocked files, and push the result. Leading to the flag in the git-annex branch being left incorrectly set, and so causing breakage to someone who clones the repo later on.

Also, a transient unlock/modify/add of a file leaves no unlocked files in the worktree, but the unlock would set the flag, so any performance benefit would need all users of the repo to avoid such workflows.

So viable approaches seem to be to optimise the scanning as much as possible, and possibly to add some config to git-annex that avoids doing the scanning, which could be used if you happen to know your repo doesn't contain unlocked files.

Comment by joey — Tue Mar 23 15:54:18 2021

Remove comment

comment 4

That catKey optimisation actually only helps if the tree has a lot of files that are not annex symlinks, but are either unlocked or not annexed. If most of the files are locked, that would actually make the scan somewhere around twice as slow as it currently is. So not a worthwhile optimisation.

Update: Now that the scan also scans for locked files to make the associated files include information about them, the catKey optimisation did make sense. Unfortunately, that does mean this scan got a little bit slower still, since it has to use git ls-tree --long.

I don't see much else there that could be optimised. Possibly the ls-tree parser could be made faster but it's already using attoparsec so unlikely to be many gains.

Comment by joey — Tue Mar 23 17:02:49 2021

Remove comment

comment 5

If we zoom out a bit, and also consider Ilya's desire to skip installing smudge/clean filters, maybe what you guys want is a way to tell git-annex that a particular repo does not use unlocked files, so git-annex can avoid expensive stuff needed to support them.

So, I've added an annex.supportunlocked config that can be set to false before running git-annex init, and it will disable the smudge filters and skip this expensive scan.

(The config will also let you measure the actual time this scan takes, which I'm still curious about, because I've not seen it being a majority of the init time, even in a large repository. Seems likely to me that init is doing other expensive things right after this scan, in particular setting up the git-annex branch, and that it may be those are what was really seeming slow to you.)

The config as implemented does not prevent later adding unlocked files, and if you do that, git-annex will get confused and not make the content accessible until you change the config back and re-run git-annex init, or you lock the file.

While the config could grow in scope to preventing adding unlocked files, it seems like probably a bad idea (and a whole lot of work). For one thing git add with a largefiles configuration would then need to prevent adding the file unlocked.. but it can't add it locked, and the smudge filter can't prevent git from adding the file -- so it would need to ignore the largefiles configuration and add the file to git. Which could be pretty suprising.

And really there's no way for git-annex to enforce that a repo doesn't contain unlocked files. Even if this became a repo-global config (probably not a good idea), someone could merge new unlocked files from a repo that does, using just git. Or add them using an old version of git-annex. (Or just convert symlinks to unlocked files manually, it's not hard..) If setting this config, you're telling git-annex you don't expect there to be any unlocked file, and are ok with some minor breakage if there are some.

Combining the config with git-annex adjust --lock would somewhat avoid such problems, although it's still possible to add files unlocked when in such a branch.

Comment by joey — Tue Mar 23 17:10:22 2021

Remove comment

comment 6

I've added an annex.supportunlocked config that can be set

you are talking about a regular git config configuration variable? then FWIW it is of no use for the original use case of a user cloning an existing git-annex repo since I would have no clue if that repository contains any unlocked files or not. It could indeed come handy for timing when desired to investigate.

I am yet to appreciate how "repo-global config" situation would be different from handling any other git-annex manipulation within git-annex branch (e.g. of annex config there) where we already rely on git-annex to "do the right thing" (deciding on taking keys availability information based on timestamping etc).

Comment by yarikoptic — Tue Mar 23 18:43:30 2021

Remove comment

comment 6

Found a way to speed up git-annex init's setup of the git-annex branch when run in a clone of an existing repo. In a 100,000 file repo, it improved from 17s to 10s. I have a feeling that might have been what was really making it seem slow to you.

Comment by joey — Tue Mar 23 19:15:23 2021

Remove comment

annex.supportunlocked

Thanks for adding annex.supportunlocked!

It seems to me that this is a repo property that you'd want to be consistent across clones, i.e. a candidate for a git-annex-config setting.

"there's no way for git-annex to enforce that a repo doesn't contain unlocked files" -- maybe, have git-annex-fsck check that there are none if annex.supportunlocked is set?

"git add with a largefiles configuration... would need to ignore the largefiles configuration and add the file to git" -- which the annex.gitaddtoannex config setting already does, so maybe just document that annex.supportunlocked=false implies annex.gitaddtoannex=false? Could you then uninstall (or skip installing) the smudge/clean filters?

"someone could merge new unlocked files from a repo that does, using just git" -- they can't do that inadvertently if the repo has a git-annex-config setting disallowing the adding of unlocked files. If they manually override the repo setting, or "convert symlinks to unlocked files manually", they're doing something odd you can't plan for anyway.

Comment by Ilya_Shlyakhter — Tue Mar 23 19:30:39 2021

Remove comment

comment 9

Just to be clear, annex.supportunlocked=false does prevent installing smudge/clean filters.

Comment by joey — Tue Mar 23 19:48:36 2021

Remove comment

annex.supportunlocked=false

"annex.supportunlocked=false does prevent installing smudge/clean filters" -- then, do I understand correctly that, if it were a git-annex-config setting and annex.supportunlocked=false implied annex.gitaddtoannex=false, this would prevent inadvertently adding unlocked files? Then, assuming "typical" repo usage, it would also prevent inadvertently merging unlocked files, since you'd have to add them first.

Comment by Ilya_Shlyakhter — Tue Mar 23 20:02:16 2021

Remove comment

comment 11

This got slower again since it now also has to scan locked files as well as unlocked. And so disabling smudge filters doesn't avoid it either.

It should be possible to speed this up by streaming the ls-tree through cat-file, like is done in CmdLine.Seek (catObjectStreamLsTree). Speedup probably in the order of 2-3x.

Comment by joey — Mon May 31 14:39:17 2021

Remove comment

comment 12

Implemented streaming through git. In a repo with 100000 unlocked files, version 8.20210429 took 46 seconds, now reduced to 36 seconds.

When the files are locked, of course the old version was faster due to being able to skip all symlinks, 2 seconds. The new version takes slightly less time than it does for unlocked files, 35 seconds.

Now the git query and processing is only a few seconds of the total run time, writing information about all the files to sqlite is most of the rest, and may also be possible to speed up.

Comment by joey — Mon May 31 16:30:11 2021

Remove comment

comment 13

There was an unnecessary check of the current time per sql insert, removing that sped it up by 3 seconds in my benchmark.

Also tried increasing the number of inserts per sqlite transaction from 1k to 10k. Memory use increased to 90 mb, but no measurable speed increase.

I don't see much else that can speed up the sqlite part, without going deep into the weeds of populating sqlite databases without using sql, or using multi-value inserts (like described here. Both would prevent using persistent to abstract sql away, and would only be usable in this case, not speeding up git-annex generally, so not too enthused.

Comment by joey — Mon May 31 18:40:59 2021

Remove comment

startup scan for files

"it now also has to scan locked files as well as unlocked" -- can you remind, what necessitated this change?

Comment by Ilya_Shlyakhter — Mon May 31 20:50:36 2021

Remove comment

comment 15

?indeterminite preferred content state for duplicated file

Comment by joey — Mon May 31 21:49:35 2021

Remove comment

keys-to-paths db

The lengthy scan would only happen once -- when the worktree is first checked out -- and would be incremental from then on, right? But it would slow things down every time a new checkout of a large repo happens? Maybe the scan could be done lazily, invoked when the results are first needed? Also, if you just need to know which keys are used more than once, maybe a Bloom filter of the keys used in the worktree would suffice, instead of a full keys-to-paths SQL db?

Comment by Ilya_Shlyakhter — Mon May 31 23:15:21 2021

Remove comment

comment 17

The scan could be done lazily, but there are situations that use the database where unexpectedly taking a much longer time than usual would be a real problem. For example "git add".

The bloom filter idea does not work.

Comment by joey — Fri Jun 4 17:37:30 2021

Remove comment

deferring the keys-to-files scan

The scan could be done lazily, but there are situations that use the database where unexpectedly taking a much longer time than usual would be a real problem

For unlocked files, certainly. When annex.supportunlocked=false, it sounded like the only situation that uses the database is drop --auto, or a matching expression with --includesamecontent/--excludesamecontent? (And maybe ?git-annex whereused). Personally I would prefer an unexpected delay in these rare cases, to a delay in the more common case of checking out or switching branches.

Comment by Ilya_Shlyakhter — Mon Jun 7 16:11:00 2021

Remove comment

comment 19

It can't know if there are unlocked files without doing this scan.

Except for when annex.supportunlocked=false, but then that config option would have the side effect of making git-annex slower at some point after init, with the situations where it does hard to enumerate and probably growing. This would be a hard behavior to explain to the user.

And there are numerous other points than the ones you listed where git-annex accesses the keys db and would trigger a deferred scan. Eg, anytime it might need to update a pointer file. Eg, when git annex get is run. Avoiding using the keys db when annex.supportunlocked=false in all such cases in order to avoid the scan would be effectively the same complexity as continuing to support v5 repos, which I've NAKed before.

Comment by joey — Mon Jun 7 17:07:26 2021

Remove comment

comment 20

Turns out git-annex init was running both scanAnnexedFiles and reconcileStaged, which after recent changes to the latter, both do close to the same scan when run in a fresh clone. So double work!

Benchmarking with 100,000 files, git-annex init took 88 seconds.
Fixed not to use reconcileStaged it took 37 seconds.

(Keeping reconcileStaged and removing scanAnnexedFiles it took 47 seconds. That makes sense; reconcileStaged is an incremental updater and is not able to use SQL as efficiently as scanAnnexedFiles.)

Also the git clone of that 100,000 file repo itself, from another repo on the same SSD, takes 9 seconds. git-annex init taking 4x as long as a fast local git clone to do a scan is not bad.

This is EOT for me, but I will accept pathes if someone wants to make git-annex faster.

(Also see ?display when reconcileStaged is taking a long time)

Comment by joey — Mon Jun 7 19:22:03 2021

Remove comment

comment 21

Note that the bug report the previous comment links to is not actually about the overhead of this scan.