(git)smaug:/mnt/btrfs/datasets/datalad/crawl/dbic/QA[master]git
$> git annex get -J 16 --debug --json-progress .
[2018-09-25 09:40:26.76476951] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--cached","-z","--","."]
[2018-09-25 09:40:26.769184442] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"]
[2018-09-25 09:40:26.771844846] process done ExitSuccess
[2018-09-25 09:40:26.771993108] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"]
[2018-09-25 09:40:26.775598151] process done ExitSuccess
[2018-09-25 09:40:26.776318887] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..db88e4b0bf3d633885a4e202e0c113942856c47f","--pretty=%H","-n1"]
[2018-09-25 09:40:26.780382394] process done ExitSuccess
[2018-09-25 09:40:26.780553211] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..4f0e8cd9a43b003228c3898133b1e200f9bdfc22","--pretty=%H","-n1"]
[2018-09-25 09:40:26.785109549] process done ExitSuccess
[2018-09-25 09:40:26.785992031] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-09-25 09:40:26.787017719] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-09-25 09:40:26.838015294] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-09-25 09:40:26.838108997] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-09-25 09:40:26.838146759] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-09-25 09:40:26.838125814] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-09-25 09:40:26.840522185] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-09-25 09:40:26.840544359] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2018-09-25 09:40:26.840612031] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-09-25 09:40:26.843542095] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-09-25 09:40:26.843885927] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-09-25 09:40:26.844793008] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-09-25 09:40:26.84691911] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-09-25 09:40:26.848525571] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2018-09-25 09:40:26.876936065] read: ssh ["-O","stop","-S","bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","localhost"]
[2018-09-25 09:40:26.89165743] process done ExitSuccess
{"command":"get","wanted":[{"here":false,"uuid":"6384a551-a41d-4290-b186-9258befede97","description":"bids@rolando:/inbox/BIDS/dbic/QA [origin]"}],"note":"transfer already in progress, or unable to take transfer lock\nUnable to access these remotes: origin\nTry making some of these repositories available:\n\t6384a551-a41d-4290-b186-9258befede97 -- bids@rolando:/inbox/BIDS/dbic/QA [origin]\n","skipped":[],"success":false,"key":"SHA256E-s1633--1bc7a100b82cd31d084aae5b25b8d90af128324d10d92cfc9e2196d28829d375.tsv","file":".heudiconv/qa/ses-20171106/info/dicominfo_ses-20171106.tsv"}
[2018-09-25 09:40:26.89294943] read: ssh ["-o","BatchMode=true","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","bids@rolando.cns","true"]
[2018-09-25 09:40:26.892976736] read: ssh ["-o","BatchMode=true","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","bids@rolando.cns","true"]
[2018-09-25 09:40:26.892951965] read: ssh ["-o","BatchMode=true","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","bids@rolando.cns","true"]
[2018-09-25 09:40:26.901388917] read: ssh ["-o","BatchMode=true","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","bids@rolando.cns","true"]
[2018-09-25 09:40:26.901683012] read: ssh ["-o","BatchMode=true","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","bids@rolando.cns","true"]
[2018-09-25 09:40:28.479898885] process done ExitSuccess
[2018-09-25 09:40:28.480093679] chat: ssh ["bids@rolando.cns","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-T","git-annex-shell 'p2pstdio' '/inbox/BIDS/dbic/QA' '--debug' 'ba8f2cea-f229-422c-82be-6580e5e07ed5' --uuid 6384a551-a41d-4290-b186-9258befede97"]
[2018-09-25 09:40:28.505377708] process done ExitSuccess
[2018-09-25 09:40:28.505676376] chat: ssh ["bids@rolando.cns","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-T","git-annex-shell 'p2pstdio' '/inbox/BIDS/dbic/QA' '--debug' 'ba8f2cea-f229-422c-82be-6580e5e07ed5' --uuid 6384a551-a41d-4290-b186-9258befede97"]
[2018-09-25 09:40:28.508740142] process done ExitSuccess
[2018-09-25 09:40:28.508761163] process done ExitSuccess
[2018-09-25 09:40:28.508782029] process done ExitSuccess
[2018-09-25 09:40:28.50891246] chat: ssh ["bids@rolando.cns","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-T","git-annex-shell 'p2pstdio' '/inbox/BIDS/dbic/QA' '--debug' 'ba8f2cea-f229-422c-82be-6580e5e07ed5' --uuid 6384a551-a41d-4290-b186-9258befede97"]
[2018-09-25 09:40:28.508944895] chat: ssh ["bids@rolando.cns","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-T","git-annex-shell 'p2pstdio' '/inbox/BIDS/dbic/QA' '--debug' 'ba8f2cea-f229-422c-82be-6580e5e07ed5' --uuid 6384a551-a41d-4290-b186-9258befede97"]
[2018-09-25 09:40:28.509074882] chat: ssh ["bids@rolando.cns","-S",".git/annex/ssh/bids@rolando.cns","-o","ControlMaster=auto","-o","ControlPersist=yes","-T","git-annex-shell 'p2pstdio' '/inbox/BIDS/dbic/QA' '--debug' 'ba8f2cea-f229-422c-82be-6580e5e07ed5' --uuid 6384a551-a41d-4290-b186-9258befede97"]
[2018-09-25 09:40:29.936552742] P2P < AUTH-SUCCESS 6384a551-a41d-4290-b186-9258befede97
[2018-09-25 09:40:29.93677054] P2P > VERSION 1
[2018-09-25 09:40:29.937971781] P2P < VERSION 1
[2018-09-25 09:40:29.938155534] P2P > GET 0 .heudiconv/qa/ses-20171127/info/filegroup_ses-20171127.json SHA256E-s429026--12edfd6e2c58f05803cb5bff0ff3665f46229b5593354324fbecdd5b131506fc.json
[2018-09-25 09:40:29.940554579] P2P < GET 0 .heudiconv/qa/ses-20171127/info/filegroup_ses-20171127.json SHA256E-s429026--12edfd6e2c58f05803cb5bff0ff3665f46229b5593354324fbecdd5b131506fc.json[2018-09-25 09:40:29.967840409] P2P < AUTH-SUCCESS 6384a551-a41d-4290-b186-9258befede97
[2018-09-25 09:40:29.967989255] P2P > VERSION 1
[2018-09-25 09:40:29.969234907] P2P < VERSION 1
[2018-09-25 09:40:29.96933411] P2P > GET 0 .heudiconv/qa/ses-20171204/info/dicominfo_ses-20171204.tsv SHA256E-s1633--7aee5bc34c95def9972b1f99f1a9ad976a3acb9da9f18e8bfa8888f579c111f6.tsv
[2018-09-25 09:40:29.971575677] P2P < GET 0 .heudiconv/qa/ses-20171204/info/dicominfo_ses-20171204.tsv SHA256E-s1633--7aee5bc34c95def9972b1f99f1a9ad976a3acb9da9f18e8bfa8888f579c111f6.tsv[2018-09-25 09:40:29.972619373] P2P > DATA 429026[2018-09-25 09:40:29.971577514] P2P < DATA 429026
[2018-09-25 09:40:29.981195211] P2P < AUTH-SUCCESS 6384a551-a41d-4290-b186-9258befede97
[2018-09-25 09:40:29.981346807] P2P > VERSION 1
[2018-09-25 09:40:29.982599704] P2P < VERSION 1
[2018-09-25 09:40:29.982851645] P2P > GET 0 .heudiconv/qa/ses-20171113/info/filegroup_ses-20171113.json SHA256E-s429026--2f0daf77e8ff5d902c00f88343b3cad51a2c7abc1e4702d450e5aea8dafc4ca6.json
[2018-09-25 09:40:29.98372615] P2P < AUTH-SUCCESS 6384a551-a41d-4290-b186-9258befede97
[2018-09-25 09:40:29.983855252] P2P > VERSION 1
[2018-09-25 09:40:29.985098551] P2P < GET 0 .heudiconv/qa/ses-20171113/info/filegroup_ses-20171113.json SHA256E-s429026--2f0daf77e8ff5d902c00f88343b3cad51a2c7abc1e4702d450e5aea8dafc4ca6.json[2018-09-25 09:40:29.985099291] P2P < VERSION 1
[2018-09-25 09:40:29.985268868] P2P > GET 0 .heudiconv/qa/ses-20171113/info/dicominfo_ses-20171113.tsv SHA256E-s1633--76397f694088587dfbe2cbea5d35341c7588dbcb262a0ff91f94a07892c83409.tsv
[2018-09-25 09:40:29.986356073] P2P < AUTH-SUCCESS 6384a551-a41d-4290-b186-9258befede97
[2018-09-25 09:40:29.9876026] P2P < GET 0 .heudiconv/qa/ses-20171113/info/dicominfo_ses-20171113.tsv SHA256E-s1633--76397f694088587dfbe2cbea5d35341c7588dbcb262a0ff91f94a07892c83409.tsv[2018-09-25 09:40:29.986449339] P2P > VERSION 1
[2018-09-25 09:40:29.987893014] P2P < VERSION 1
[2018-09-25 09:40:29.988057111] P2P > GET 0 .heudiconv/qa/ses-20171106/info/filegroup_ses-20171106.json SHA256E-s429026--cd00d181cf477405db3b5243fbd5d7969efffccf16b49599a1cd1fde32879a66.json
[2018-09-25 09:40:29.990480076] P2P < GET 0 .heudiconv/qa/ses-20171106/info/filegroup_ses-20171106.json SHA256E-s429026--cd00d181cf477405db3b5243fbd5d7969efffccf16b49599a1cd1fde32879a66.json{"byte-progress":32752,"action":{"command":"get","note":"from origin...","key":"SHA256E-s429026--12edfd6e2c58f05803cb5bff0ff3665f46229b5593354324fbecdd5b131506fc.json","file":".heudiconv/qa/ses-20171127/info/filegroup_ses-20171127.json"},"total-size":429026,"percent-progress":"7.63%"}
[2018-09-25 09:40:30.109171443] P2P > VALID{"byte-progress":429026,"action":{"command":"get","note":"from origin...","key":"SHA256E-s429026--12edfd6e2c58f05803cb5bff0ff3665f46229b5593354324fbecdd5b131506fc.json","file":".heudiconv/qa/ses-20171127/info/filegroup_ses-20171127.json"},"total-size":429026,"percent-progress":"100%"}
[2018-09-25 09:40:30.109206914] P2P < VALID
[2018-09-25 09:40:30.109391344] P2P > SUCCESS
[2018-09-25 09:40:30.111586675] P2P < SUCCESS{"command":"get","note":"from origin...\nchecksum...","success":true,"key":"SHA256E-s429026--12edfd6e2c58f05803cb5bff0ff3665f46229b5593354324fbecdd5b131506fc.json","file":".heudiconv/qa/ses-20171127/info/filegroup_ses-20171127.json"}
on the remote I see 5 sleeping annex-shells:
[bids@rolando ~] > ps auxw | grep git-anne[x] | nl
1 bids 17254 0.0 0.0 1074095936 12788 ? Ssl 09:40 0:00 /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/exe/git-annex-shell --library-path /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/gconv:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/audit:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//etc/ld.so.conf.d:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib64:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib/x86_64-linux-gnu:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu: /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/shimmed/git-annex-shell/git-annex-shell p2pstdio /inbox/BIDS/dbic/QA --debug ba8f2cea-f229-422c-82be-6580e5e07ed5 --uuid 6384a551-a41d-4290-b186-9258befede97
2 bids 17259 0.0 0.0 1074095936 12004 ? Ssl 09:40 0:00 /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/exe/git-annex-shell --library-path /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/gconv:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/audit:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//etc/ld.so.conf.d:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib64:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib/x86_64-linux-gnu:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu: /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/shimmed/git-annex-shell/git-annex-shell p2pstdio /inbox/BIDS/dbic/QA --debug ba8f2cea-f229-422c-82be-6580e5e07ed5 --uuid 6384a551-a41d-4290-b186-9258befede97
3 bids 17260 0.0 0.0 1074095936 12012 ? Ssl 09:40 0:00 /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/exe/git-annex-shell --library-path /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/gconv:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/audit:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//etc/ld.so.conf.d:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib64:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib/x86_64-linux-gnu:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu: /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/shimmed/git-annex-shell/git-annex-shell p2pstdio /inbox/BIDS/dbic/QA --debug ba8f2cea-f229-422c-82be-6580e5e07ed5 --uuid 6384a551-a41d-4290-b186-9258befede97
4 bids 17261 0.0 0.0 1074095936 12012 ? Ssl 09:40 0:00 /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/exe/git-annex-shell --library-path /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/gconv:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/audit:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//etc/ld.so.conf.d:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib64:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib/x86_64-linux-gnu:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu: /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/shimmed/git-annex-shell/git-annex-shell p2pstdio /inbox/BIDS/dbic/QA --debug ba8f2cea-f229-422c-82be-6580e5e07ed5 --uuid 6384a551-a41d-4290-b186-9258befede97
5 bids 17262 0.0 0.0 1074095936 12008 ? Ssl 09:40 0:00 /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/exe/git-annex-shell --library-path /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/gconv:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu/audit:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//etc/ld.so.conf.d:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib64:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//lib/x86_64-linux-gnu:/afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2//usr/lib/x86_64-linux-gnu: /afs/dbic.dartmouth.edu/usr/pkg/git-annex/git-annex_v2/shimmed/git-annex-shell/git-annex-shell p2pstdio /inbox/BIDS/dbic/QA --debug ba8f2cea-f229-422c-82be-6580e5e07ed5 --uuid 6384a551-a41d-4290-b186-9258befede97
local version is 6.20180913+git33-g2cd5a723f-1~ndall+1 and remote 6.20180808-ga1327779a (it is like Sep 13, current standalone build). There was/is no other transfer in progress when calling it, but some previously failed runs. That file gets just fine later on if run get on it alone
$> git annex get .heudiconv/qa/ses-20171113/info/filegroup_ses-20171113.json
get .heudiconv/qa/ses-20171113/info/filegroup_ses-20171113.json (from origin...)
(checksum...) ok
(recording state in git...)
so to me smells like some race condition due to high -J value.
All this debug output means nothing to me, what operation is it stalling on?
Are two files in the set of files you operated on both pointing at this key?
Do you mean that command I have pasted in the beginning
git annex get -J 16 --debug --json-progress .
? if something else, I cannot add beyond that command and--debug
output.Ha, good question... seems to be not?:
the only thing is that I have interrupted some previous stalled
get
invocations so may be it is somehow related?tried to replicate from other hosts, but without success, so might be something specific about that one (original repo is on NFS (mount options rw,relatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.232.11.5,mountvers=3,mountport=1009,mountproto=udp,local_lock=none,addr=10.232.11.5), CentOS Linux release 7.4.1708 (Core), git annex 6.20180808-ga1327779a, OpenSSH_7.4p1, OpenSSL 1.0.2k-fips). This time it is also without any message about being unable to lock:
and nothing more...
indeed though there is a good number of files pointing to the same annex keys
so only 1187 unique keys for 1439 files.
FWIW, some anecdotal evidences/observations:
After upgrade to 6.20180913+git149-g23bd27773-1~ndall+1 (resolving that
get
issue over ssh with elderly annexes) I thought that I started to see thatget
does not stall in a fresh git clone of the repository whileget
was still stalling in the git clone where I have tried to get (and stalled) with 6.20180913+git52-gdb1644808-1~ndall+1. Removing files under.git/annex/tmp
seemed to makeget
on the next run behave better... But then I still started to observe stalling from time to time, although it felt that it started to happen much more rarely.Despite -J flag, and starting a large number of
git-annex-shell 'p2pstdio'
, transmission seems to happen one file at a time, nothing is really parallelized. I see a single file being downloaded reported bydatalad
(which processes --json --json-progress output) or just observing output ofgit annex get -J16 .
(or even just -J2) having only a single % line running at the bottom.On one run I got a new beasty error reported:
hGetChar: illegal operation
:may be it is somehow related.
Raising -J to some obscene value (32) causes a collection of the
Unable to access these remotes: origin
, some include thatLost connection (fd:323: hGetChar: illegal operation (handle is closed))
, some not. Someone could argue that it would be expected...Overall: - I think that stalling situation did improve, although not entirely resolved. - Not sure if annex is actually doing any parallel transfer over ssh
Finally I've boiled down to a small reproducible case for the stall with the bleeding edge 6.20180913+git149-g23bd27773-1~ndall+1 build:
So it gets stuck after having 2 out of 3 files transmitted. With -J8 (in above it is -J2) it stalls after the first one. Similarly stalls (there is more --debug output) if I do not reuse the ssh socket:
all of those 3 files point to different keys
.git/tmp/annex
is clean (I thought it could have mattered, seems not) What other information could I provide?It seems to be stalled waiting for data to come in from the pipe from ssh. Both the protocol debug and the emailed strace look like that.
I think it's quite likely that this is something in ssh being unhappy with a large number of connections. ssh also seems to be closing some of the connections while git-annex is still using them.
It would be worth disabling annex.sshcaching to at least eliminate that part of ssh.
It would probably make sense to look at the ssh server's logs and see if it's throttling connections.
FWIW, I tried a git-annex get -J64 from localhost using ssh and it worked without any problems.
NB that server we are accessing via ssh was even rebooted today, no change
doing
git -c annex.sshcaching=no annex get
didn't make any differenceHere is some logs from the server with comments from myself (on
#
-prefixed lines):First, you have not disabled sshcaching, because the value to disable it is not "no" but "false". So please try again with sshcaching disabled..
Looks like something on the client side or in the middle is closing the TCP connection, and I think not in a clean shutdown way, but hard.
I'm fairly sure the git-annex is not stopping the ssh process it used to open that connection though. The "hGetChar: illegal operation" is because it's still trying to read from ssh after it exited. So it seems we need to find out why the ssh process on the client side is exiting (assuning the same thing happens with sshcaching disabled). strace?
You mentioned NFS.. Which repo is on NFS, the client or the server?
=false
instead of=no
seems to make no difference:in a blind attempt I have also set them within
~/.gitconfig
on the server side -- also to no effectI've provided a URL to strace log in an email from 3 Oct 2018 (msg id 20181003182901.GM3684@hopa.kiewit.dartmouth.edu). Now will collect one more with the
=false
and follow up to that email with url to this fresh trace (do not want to share them publicly happen some sensitive information is recorded/leaks)Server. I've emailed mount options in a private communication on Fri, 5 Oct 2018 (message id 20181005201504.GH3978@hopa.kiewit.dartmouth.edu)
Interestingly, the P2P debug in http://git-annex.branchable.com/bugs/annex_get_-J_16_via_ssh_stalls_/#comment-34b66436c1a2f2d9479f948655375bb5 shows two ssh processes started, but then git-annex sends "VERSION 1" three times. And the remote only replies to two of those.
Could that indicate two threads are somehow talking to the same ssh? I've instrumented --debug with more information to find out, so please try again with fcca7adaffe86673dade3790f80e22ee426b51e7.
Hmm, more likely that's debug info from the remote side coming over ssh stderr channel. Still the added instrumentation may provide a clue..
So what should I/we look for?
Trimming that to only P2P protocol and reordering by connection:
So the second connection received the GET request, but did not respond to it, or its response was not received.
Maybe try replaying the same protocol on the server manually. Eg:
If it doesn't reply with DATA, then it's something to do with that one particular key. If it does, the problem must involve concurrency or the ssh connection somehow.
I have found one possible way this protocol trace could happen, although it seems unlikely to be the actual cause here.. But possibly. Dealt with it in 0b053b96119c1a2d6c2d03ba7f98f032e89b8b66 which would need to be deployed to the server side to test.
isn't the fact that it works without -j parallelization hints that it is about concurrency and not just the key? But for the sake of completeness, I did that "manual"
GET 0 sub-qa/s...
snippet and indeed was bombarded with the data.ok, will build/test, thanks!
the 7.20181031-13-g4d24b712b (7.20181031+git7-g39bff661b did fine) is failing some tests, get/see build logs at http://neuro.debian.net/_files/_buildlogs/git-annex/7.20181031+git13-g4d24b712b/
isn't there some CI running tests for git-annex or my failures are standalone build specific?
I've seen the test fail that way, where apparently all gets from the remote repo are failing, on the autobuilders once or twice, but not reproducably, so have not been able to track it down yet.
(I was running the test suite hundreds of times overnight in a loop to reproduce and fix other intermittent test failures just before the v7 release, and never reproduced that failure pattern.)
If you can reproduce that test failure we can work on debugging it on a different bug report.
Anyway, since it replied to DATA for that key, the fix in 0b053b96119c1a2d6c2d03ba7f98f032e89b8b66 almost certianly can't be for the same problem as this bug.
I'm fairly sure that the problem with no DATA response is not due to an implementation bug in the P2P protocol code, since I've read over it all and only found the one way it could fail to send DATA, which should be reproducible w/o concurrency.
(It would be good to prove that it always sends DATA now, and it should be possible to prove things like that about the P2P protocol code, by running it in a different free monad interpreter that tests every possible input. But I don't have a framework to do that yet.)
Anyway, it would be good to check this at the server end, by stracing git-annex-shell. Can you wrap it with a shell script that straces it? Eg:
Had to call that script explicitly since my PATH is tuned to point to the directory with this "wrapper" so it would call recursively and bail with operation not permitted (to strace already straced thing)
anyways, sending straces via email
Straces confirm the hanging git-annex-shell never gets to the point of sending "DATA". It does receive the "GET".
annex.pidlock is set. That's unusual. It seems likely it's got something to do with the problem, because pidlocks and concurrency don't really mix; at best one git-annex process can hold the pid lock, so other processes will need to wait, and could also time out and fail to take the pid lock.
Yeah: Setting annex.pidlock, I can reproduce the hang!
The strace of the non-hanging process shows it took the pid lock and dropped it, twice. So the other process should have been able to take the pid lock too.
The hanging process does something around taking the pid lock just before it hangs.
It checks the side lock, which is locked (this is
trySideLock
):Then it checks the pid lock (this is
tryLock
whenlinkToLock
fails due to the side lock not being held):And gives up without overwriting it:
And then hangs waiting on stdin, it looks lile:
Seems it was not in
waitLock
, because that should loop andtryLock
again once per second.So, maybe some part of git-annex is calling
tryLock
, and when it fails, it somehow ends up blocked waiting for more protocol commands?My next step now that I can reproduce it is to manually take the pid lock, and run git-annex-shell p2pstdio, and see if it's indeed reading more commands from stdin despite not responding to GET, or is sending an error or what.
Great to hear that bottom issue is revealed!
FWIW regarding
waited for ~50 minutes,
git annex get
process is still hanging without any additional output providedRe
Unfortunately this time I was not as fortunate to reliably reproduce it (yet) for me I believe it happened for the first time ever although I could be wrong. How do you know which test(s) fail and how to restart only a failing test (e.g. by name?) -- I find it difficult to parse logged output to figure out test names/boundaries since intermixed with git output etc which also includes error messages on conflicts etc.
tryLockExclusive is returning Nothing before the hang, when runTransfer tries to lock the transfer log file. So runTransfer returns without running the transfer action.
Aha! ReadContent tries to run a transfer, and it calls sinkfile with sender in the transfer action. Never run, and so DATA never gets sent and indeed git-annex-shell is waiting for the next command.
Using alwaysRunTransfer would solve this; that's what Command.SendKey does. And it would let concurrent downloads from a annex.pidlock remote work. However, I'm not entirely happy with that, because as this shows any bug in there that prevents the transfer action being run will lead to a protocol stall. Seems it ought to detect when the transfer action didn't run, and error out, somehow.
Note that StoreContent also runs a transfer, but that does not involve running a callback. So, might be ok. But, it may not consume the DATA from the sender, so I need to look at it too.
Preliminary fix in place, but have not yet done anything more to guarantee that other failures don't prevent it from sending DATA.
I now have a comprehensive fix in place, including fixing similar bad behavior when uploading different objects concurrently to a remote with annex.pidlock=yes or the same object concurrently to a remote not using pidlock.
git-annex-shell
) and not on local (git annex get
)?Yes, git-annex-shell needs to be upgraded for the problem you reported. The other related problems I found also need git-annex to be upgraded to be fixed.