Please describe the problem.
running git annex get -J8
on OSX (git annex 8.20200226 installed via brew)
- distribution: 10.14.6/x86_64
- encoding:
- default: utf-8
- filesystem: utf-8
- locale.prefered: UTF-8
- max_path_length: 272
- name: Darwin
- release: 18.7.0
- type: posix
- version: Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64
results in git annex failing to get some files with errors about too many open files:
get .datalad/metadata/objects/0b/cn-65a581e2e22f95210db4d750cbcd1c.xz (from origin...) (checksum...) ok
get .datalad/metadata/objects/16/cn-020f8a40eed616291d855f13793e6e.xz (from origin...) (checksum...) ok
get .datalad/metadata/objects/0e/ds-2e262d6ab40f1f0fa399e80866f732 (from origin...) (checksum...) ok
get .datalad/metadata/objects/0f/cn-550a7aca2cc94645e63cda96476d40.xz
git-annex: git: createProcess: runInteractiveProcess: pipe: resource exhausted (Too many open files)
failed
get .datalad/metadata/objects/0f/cn-3c0d085acbb8ea3af430f826828b69.xz (from origin...) (checksum...) ok
get .datalad/metadata/objects/1d/ds-8e8fbd9d2834f108c7bd5ddb4294eb (from origin...) (checksum...) ok
get .datalad/metadata/objects/0b/ds-cd8a472d84af1980703baca7ad4910 (from origin...) (checksum...) ok
get .datalad/metadata/objects/1b/ds-9f695168fd28140c49a45523b9ce5f (from origin...) (checksum...) ok
get .datalad/metadata/objects/21/ds-92598325890e1d079e1a0a8e4fd9df
git-annex: git: createProcess: runInteractiveProcess: pipe: resource exhausted (Too many open files)
failed
get .datalad/metadata/objects/0f/ds-550a7aca2cc94645e63cda96476d40 (from origin...) (checksum...) ok
get .datalad/metadata/objects/06/cn-2c3eade47bd2d9052658c6a9d10a57.xz (from origin...) (checksum...) ok
get .datalad/metadata/objects/21/cn-92598325890e1d079e1a0a8e4fd9df.xz (from origin...) (checksum...) ok
get .datalad/metadata/objects/1b/cn-f55a26461f99a8ac1dde6b618e3425.xz (from origin...) (checksum...)
git-annex: .git/annex/othertmp.lck: openFd: resource exhausted (Too many open files)
failed
get .datalad/metadata/objects/26/ds-0ad917bee8d05db1dd27d0ad50c1bb
git-annex: git: createProcess: runInteractiveProcess: pipe: resource exhausted (Too many open files)
Full bundle of logs with a bit older (20191230) and this newer version, and full output of datalad wtf
is available from http://www.onerussian.com/tmp/git-annex-J8-logs.tar.gz.
possibly related, but not yet marked done issue I found is from 6 years ago.
I have tried to replicate this on some other OSX boxes a few weeks back but I have not reproduced it, so might be specific about this one. FWIW ulimit
reported "unlimited"
I will try to get a chance to troubleshoot it more to provide possibly more details whenever I get access to it again. Meanwhile just informing, and would welcome additional instructions on what to look at (besides adding --debug
and running lsof in a loop in parallel) to troubleshoot it.
The obvious question to ask, which I can't really imagine making any progress without an answer to: What files did git-annex have open?
I did notice that of the two git-annex logs, one got 19 files before failing, while the other got 27. It seems unlikely that, if git-annex, or an external remote, or git, or whatever is somehow leaking file handles, it would leak different numbers at different times. Which leads to the second question: What else on the system has files open and how many?
OSX has a global limit of 12k open files, and a per process limit of 10k.
git-annex get
on linux needs to open around 16 files per file it downloads. So if git-annex were somehow leaking every single open FD, it would successfully download over 600 files before hitting the per-process limit. If every subprocess git-annex forks also leaked every open FD, it would of course vary by remote, but with a regular git clone on the local filesystem, the number of files opened per get is still only 62, so still over an order of magnitude less.Seems much more likely that the system is unhappy for some other reason.
Michael has reported a similar issue on Linux. I was initially also "skeptical". But the reason really is that each git-annex process takes HUNDREDS of open files (dynamic libraries etc), and parallel execution of
get
adds a good number of pipes on top (counted ~3000 forget -J 8
process). I thought to investigate more before reporting and then randomly ran into this not so old report from myself ;)A quick demo:
so with
-J5
3 seconds after initial call with-J5
I get over 2k open files used by annex (according to grep, may be some managed to escape matching).lsof
) was fluctuating as process kept going but was not really steadily growingI'm seeing a lot of git cat-file processes, not a lot of any other process.
Each -J increment adds 3 threads for the different command stages (start, perform, cleanup). Each thread might need a git cat-file run with either of two different parameters, and on either of two different index files. (Both are needed for unlocked files, only one for locked files.)
So, 5x3x2x2=60 copies of git cat-file max for -J5. And experimentally, that's exactly how many I see in the worst case repo where all files are unlocked. (Plus 4 which I think are owned by the controlling thread or something). Using your test case, I am seeing 44. So I don't think there's a subprocess leak here.
IIUC, what you show is lsof of things open by git-annex and any git processes that happen to open a file with "annex" in its name, being around 3000.
Now, lsof is for one thing showing a file that two different threads have open, as being opened twice.
That is different threads of the same process, that has certianly not opened ld.so repeatedly.
So, you should be using
lsof -Ki
or something. With that, I see around 1019 files open, between git-annex and git. git-annex by itself has only 246.(Interestingly, the majority of those seem to be sqlite. I'm unsure why sqlite is opening the same database 30 times. A single thread often has the same database opened repeatedly. Might be that the sqlite database layer has a too large connection pool. There are also a lot of FIFO's, which I think also belong to sqlite, unless they're something internal to the ghc runtime.)
Looking at Michael's bug report, looks like they were running with -J8. I don't see that exceeding the default ulimit of 1024. If they were really running at -J32, it would. It's not clear to me either how datalad's --jobs interacts with git-annex's -J, does it pass through or do you run multiple git-annex processes? People in that bug report are referring to multiple git-annex processes, which git-annex -J does not result in.
All these -J5 etc values seem a bit high. I doubt that more than -J2 makes a lot of sense given the command stages optimisation, that makes it use 6 threads and balance the work better than it used to. Only time it really would if if you're getting from several different remotes that each bottleneck on a different resource.
ATM we just run a single
annex get
with-J
option ATM (FWIW -- in--batch
mode IIRC). Things might change in the future to balance across different submodules.I could do some timing later on, but I did see benefits as I could not go over 40-60MBps in a single download process (e.g. from S3) but parallel ones (even as high as 8 or 10) could easily carry that throughput in parallel, thus scaling up quite nicely. If interested -- you could experiment on smaug to which you have access to possibly observe similar effects.
The sqlite open files is a red herring: That happened only when using a remote in a local directory. Anyway, I've fixed that.
The open files I'm seeing now in my artifical test case (two local repos with 1000 unlocked files, git-annex get between them, lsof -Ki run after that's moved 500 files, while the git-annex process is suspended):
Which seems fine, 28 file handles per -J increment.
If you have something worse than that, show me the lsof.
So there is now up to 60
git
processes, where each one has about 20 open files, totaling up to 1200 open files... so we are getting into thousandsIn my current attempt on the laptop, here is a
pstree
with counts per each process and total at the bottom - 883 open files for -J5 invocation, with eachgit cat-file
taking between 14 and 29:looking at the one with 29 open files:
and only 16
we can see why it is fluctuating, although I have no clue why those are opened by
git cat-file
: connections to the remote (although -- why??), looking at .git/annex/tmp?But overall problem seems to me is this heavy growth of external processes due to multiple external
git
invocations per eachannex get
thread, and then each process consuming very small, but still tens of, open files.git cat-file
processes grew up a bit (to 42) with total open files 1034 but seems to be stable, i.e. one other support that there is no leaking processes or file descriptors -- just sheer growth of sub processes leading to large number of open files.Thinking about this over the weekend, I had two ideas:
The worker pool has an AnnexState for each thread. If those could be partitioned so eg perfom stage is always run by the same threads, then when only one stage needs cat-file, the overall number of cat-file processes would be reduced by 1/3rd.
This might be the least resource intensive approach. But, as threads transition between stages, their AnnexState necessarily does too, and the cleanup stage might need some state change made in the perform stage, so swapping out the perform AnnexState for a cleanup one seems hard to accomplish.
Could have a pool of cat-files, and just have worker threads block until one is available. This would let it be pinned to the -J number, or event to a smaller number.
Seems likely that only 2 or 3 in the cat-file pool will maximise concurrency, because it's not a major bottleneck most of the time, and when it is the actual bottleneck is probably disk IO and so won't be helped by more (and likely more only increase unnecessary seeks).
Implemented the cat-file pool. Capped at 2 cat-files of each distinct type, so it will start a max of 8 no matter the -J level.
(Although cat-file can also be run in those repositories so there will be more then.)
While testing, I noticed git-anenx drop -Jn starts n git check-attr processes, so the same thing ought to be done with them. Leaving this bug open for that, but I do think that the problem you reported should be fixed now.