I am running addurl (via datalad addurls
) on a quite heavy in number of files git/annex repo.
Here is the htop I see (git annex is 8.20200908-gcfc74c2f4
I believe ;))
CPU% MEM% TIME+ Command
0.0 0.0 0:12.62 │ └─ zsh
2.0 3.1 0:59.03 │ ├─ python3 datalad-nda/scripts/datalad-nda --pdb add2datalad -i /proc/self/fd/15 -d testds-fast-full2 --fast
0.0 0.0 0:00.09 │ │ ├─ git --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git/git annex addurl --fast --with-files --json --json-error-messages --batch
4.6 0.1 0:40.33 │ │ │ └─ git-annex --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
45.2 0.2 10:27.06 │ │ │ ├─ git --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. check-ignore -z --stdin --verbose --non-matching
3.3 0.0 0:27.42 │ │ │ ├─ python3 /mnt/scrap/tmp/abcd/datalad/venvs/dev3/bin/git-annex-remote-datalad
0.0 0.0 0:00.00 │ │ │ ├─ git --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
2.0 0.1 0:24.46 │ │ │ ├─ git --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git/git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
0.0 0.1 0:02.17 │ │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
0.0 0.1 0:00.00 │ │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
0.0 0.1 0:04.55 │ │ │ ├─ git-annex --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
0.0 0.1 0:02.35 │ │ │ └─ git-annex --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git-annex/git-annex addurl --fast --with-files --json --json-error-messages --batch
0.0 0.0 0:00.08 │ │ ├─ git --library-path /home/yoh/.tmp/ga-RIFzW89/git-annex.linux//lib/x86_64-linux-gnu: /home/yoh/.tmp/ga-RIFzW89/git-annex.linux/shimmed/git/git annex addurl --fast --with-files --json --json-error-messages --batch
so -- git check-ignore --batch
is the busiest in cpu process (30-60%).
I do not mind it to be CPU heavy, but I am afraid that its use is not async
so while it is "processing", annex is waiting before proceeding to the next entry.
If it is not so -- please just close this.
If annex does wait, since it is a --batch operation, I wonder if somehow invocation of git check-ignore
could be delayed to be done in a single shot at the end of the batched process or something like that? or may be it relates to batch_async WiP TODO, i.e. internally annex would make it an async call.
You can use --force to disable the ignores check.
It might be a good idea to have an explict option to do it, since --force might also have other effects.
With -J, git-annex will run more check-ignore processes, and so won't bottleneck on one, but IDK how much if any that actually speeds it up. Since batch async is implemented already, you can try and find out now!
implemented --no-check-gitignore option, but I guess you might want to use --force to avoid depending on a new git-annex.
git check-ignore
call in general in my use-cases. Originally I was wondering more about some kind of "optimization" async --batch could help indeed, we just might need to change our code to support it (have not yet looked into the code)Should I reopen this?
Skipping checking ignores seems like a likely approach for you, because any url you want to add seems likely to have an extension that is not ignored, and it's very unlikely that the filename that addurl picks will be ignored either.
I can see two further ways to optimise it...
The ignores check currently runs in the same stage as the download. Making the download happen in perform stage and the ignores check and git add happen in cleanup would improve parallelization because it would not block other downloads on the ignores check.
Or, the ignores could not be checked by git-annex before git add, and instead git add be relied on to check, and if it fails, just clean up the annex link from the working tree. But the problem with that is, it does batch up calls to git add, and so if git add failed, it would be difficult to know which file(s) it rejected due to being ignored in order to clean them up. Also, I guess git add probably does an equivilant amount of work as git check-ignore to check the ignores, so at most this would halve that work, and eliminate a little overhead in talking to git check-ignore.
Oh hmm, that suggests a third possibility: Run git add -f, since we've already checked the ignores, which would probably avoid it doing the ignore checking work.
(Another other problem with skipping git check-ignore is it would need to download the url before git add checks if it's ignored ... but in fact it already does download the file before check-ignore, because it needs to determine if youtube-dl should be used, which needs the content of the url, and when youtube-dl is used, a different filename will be used.)
Implemented the git add -f optimisation. Unsure how much of an optimisation it really is tho.