Please describe the problem.
TL;DR - it seems that obsolete .git/annex/index.lck .git/annex/journal.lck .git/annex/precommit.lck
could cause a new process to hang on git commit
or git annex add
without announcing any problem. git annex version installed is 7.20181211-g1cdb7a2. But that seems to be not a complete reason - and after removing them still gets stuck
I think we initiated git annex add
invocation over remote ssh session, and then connection eventually dropped before git annex add
finished. The repository is large - 47k files.
So we ended up with some files annex/staged, some just annexed (symlinks to keys), and some not yet tracked by git and in their original form.
I thought to commit first at least what we had already staged, but git commit
pretty much hanged there for an hour. It is an NFS mount on an HPC, so IO is not stellar but an hour with little to no CPU consumed sounds like it just got stuck.
I have checked ps and saw that it is in the pre-commit
hook with two zombies. Interrupted and ran manually with --debug
:
fmriprep001]$ git annex --debug pre-commit .
[2019-03-18 18:08:36.049933] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2019-03-18 18:08:36.060284] process done ExitSuccess
[2019-03-18 18:08:36.060364] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2019-03-18 18:08:36.081517] process done ExitSuccess
[2019-03-18 18:08:36.081776] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","diff","--cached","--name-only","-z","--diff-filter=ACMRT","--","."]
[2019-03-18 18:08:44.780003] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","diff","--name-only","--diff-filter=T","-z","--cached","--","."]
[2019-03-18 18:08:44.817133] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2019-03-18 18:08:44.827942] process done ExitSuccess
[2019-03-18 18:08:44.827996] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2019-03-18 18:08:44.837793] process done ExitSuccess
# just hanging there, no notable CPU utilization
# ps output -- I think I did mistake and didn't grep for git but for annex
5765 0.0 0.0 127208 3800 pts/145 Ss Mar07 0:00 /bin/bash
21340 0.0 0.0 113940 1400 pts/145 S+ 18:08 0:00 git annex --debug pre-commit .
21341 2.0 0.0 1074154224 19164 pts/145 Sl+ 18:08 0:01 /ihome/crc/install/datalad/python3.7/bin/git-annex --debug pre-commit .
21353 0.0 0.0 0 0 pts/145 Z+ 18:08 0:00 [git] <defunct>
21397 0.0 0.0 0 0 pts/145 Z+ 18:08 0:00 [git] <defunct>
$ git annex version
git-annex version: 7.20181211-g1cdb7a2
build flags: Assistant Webapp Pairing S3(multipartupload)(storageclasses) WebDAV Inotify TorrentParser MagicMime Feeds Testsuite
dependency versions: aws-0.20 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.2 feed-1.0.0.0 ghc-8.2.2 http-client-0.5.13.1 persistent-sqlite-2.8.1.2 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar hook external
operating system: linux x86_64
supported repository versions: 5 7
upgrade supported from repository versions: 0 1 2 3 4 5 6
Retrying to add everything again seems to lead to a stall as well:
fmriprep001]$ git annex --debug add *
[2019-03-18 18:19:54.527228] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2019-03-18 18:19:54.537078] process done ExitSuccess
[2019-03-18 18:19:54.537377] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2019-03-18 18:19:54.547586] process done ExitSuccess
[2019-03-18 18:19:54.547873] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--others","--exclude-standard","-z","--","fmriprep","freesurfer"]
[2019-03-18 18:19:55.557577] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","check-attr","-z","--stdin","annex.backend","annex.numcopies","annex.largefiles","--"]
[2019-03-18 18:19:55.558288] read: git ["--version"]
[2019-03-18 18:19:55.560309] process done ExitSuccess
add freesurfer/sub-000457/mri/lh.ribbon.mgz
and that file seems to be freshly touched and initially was not staged for a commit according to git status
:
fmriprep001]$ ls -l freesurfer/sub-000457/mri/lh.ribbon.mgz
... Mar 18 18:19 freesurfer/sub-000457/mri/lh.ribbon.mgz -> ../../../.git/annex/objects/61/87/MD5E-s103445--f9f251a9f80317fbcf46cf11fa7be520.mgz/MD5E-s103445--f9f251a9f80317fbcf46cf11fa7be520.mgz
here is ps
# ps auxw -H | grep annex -- showing only relevant portion, no other annex processes detected;
# should have grepped for git! :-/
26125 0.0 0.0 113940 1404 pts/145 S+ 18:19 0:00 git annex --debug add fmriprep freesurfer
26126 0.4 0.0 1074154224 108492 pts/145 Sl+ 18:19 0:00 /ihome/crc/install/datalad/python3.7/bin/git-annex --debug add fmriprep freesurfer
26138 0.0 0.0 115916 3432 pts/145 S+ 18:19 0:00 git --git-dir=.git --work-tree=. --literal-pathspecs check-attr -z --stdin annex.backend annex.numcopies annex.largefiles --
looking at the list of files for 26126 process I see .git/annex/journal.lck
which seems to date to Feb 27
(3 weeks ago!), and altogether there are 3 lock files from that date:
.git/annex/index.lck .git/annex/journal.lck .git/annex/precommit.lck
.
FWIW, NFS mount options type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.201.0.82,local_lock=none,addr=10.201.2.178)
I have removed them, and then called git commit
which seemed to proceed normally, got to report (recording state in git...)
but then seemed to get stuck as well:
31711 0.0 0.0 1460028 3876 pts/145 S+ 18:27 0:00 git commit -m Annexed portion of the results, probably was disconnected, so not all Also there were some stale lock files present .git/annex/index.lck .git/annex/journal.lck .git/annex/precommit.lck which caused precommit and add to hang, removed them
31788 0.0 0.0 113184 1412 pts/145 S+ 18:27 0:00 /bin/sh .git/hooks/pre-commit
31789 0.0 0.0 113940 1404 pts/145 S+ 18:27 0:00 git annex pre-commit .
31791 1.1 0.0 1074154220 19240 pts/145 Dl+ 18:27 0:04 /ihome/crc/install/datalad/python3.7/bin/git-annex pre-commit .
31802 0.0 0.0 0 0 pts/145 Z+ 18:27 0:00 [git] <defunct>
31889 0.0 0.0 0 0 pts/145 Z+ 18:27 0:00 [git] <defunct>
31901 1.6 0.0 113964 1608 pts/145 S+ 18:27 0:06 git --git-dir=.git --work-tree=. --literal-pathspecs hash-object -w --stdin-paths --no-filters
1075 0.0 0.0 113944 1624 pts/145 S+ 18:29 0:00 git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch
1076 0.0 0.0 113944 1408 pts/145 S+ 18:29 0:00 git --git-dir=.git --work-tree=. --literal-pathspecs cat-file --batch-check=%(objectname) %(objecttype) %(objectsize)
There is also two .git/annex/misctmp/jlog*
files with one from Feb 28 and one from today. Today's one is bigger, and used by 31901 but seems not to change for the last 15 minutes.
what could be a reason and any way to mitigate?
Since this is NFS, does it have annex.pidlock set?
If so, an stale pid lock (.git/annex/pidlock) can cause a stall, for up to annex.pidlockimeout seconds.
The hang points seem consistent with it preparing to commit the journal to the git-annex branch and indeed something related to locking. An strace would tell for sure.
Removing .lck, setting
annex.pidlock=true
worked! got just one complaint from git:I found that file (via
find * -lname *MD5E-s55076--a7352bf6fd37902fc1dfb6228f2e8a53*
FTR), it was not known to git yet, so I am now runningannex add
on it (takes awhile -- I guess repo is indeed got large + NFS) -- will report back when done/if there is more.Managed to login to that node again, after hours (>3) of "running"
annex add
is still there with current process tree beingand all the locks there, and I believe being generated by this
annex add
run:the process 15565 is slowly increasing its seconds (now 19)... is that a good sign (no lockup, just very slow IO) or not (it is locked up, and all the time corresponds just to the periodic querying of the lock)? if to strace -- which particular process? or should I interrupt and run the entire
annex add
under strace?Although, the haskell RTS uses a select loop with a timeout and so might use a small amount of CPU while otherwise blocked waiting on a lock..
and that 15566 is a thread:
Hard to see for sure from this strace tail end, but the fd number sequence suggests it first took the pid lock:
lrwx------ 1 USER GROUP 64 Mar 19 17:03 11 -> /dev/shm/cd6281f9a8c5f62e1da78a60ca9ca1b4D_fmriprep001_.git_annex_pidlock.lck
And since it then seems to have proceeded to start to write to this file, the pid lock is apparently not the hang point.
lrwx------ 1 USER GROUP 64 Mar 19 17:03 12 -> /zfs1/GROUP/USER/R03/WM-R03/DIAMOND/fmriprep001/.git/annex/misctmp/jlog15565-2
So stageJournal is running, after openjlog. That does not involve any further locks, and would not normally take a long time unless .git/annex/journal/ has an unusual number of files in it. It seems to have stalled for some reason. It should have started a git update-index process and fed data into it.
Do you know if this process had a child git update-index running at this point? It could be that child is what is stalled, somehow.
The number of lines in the jlog15565-2 file would also be a clue to how far it got before it got stuck.
But I think a full strace may be needed to find the sticking point.
The precommit.lck is no longer used, it was not needed since v8. But this bug does not really appear to be about that lock at all.
Pity there was no followup to my last questions. At this point I can either close the bug report, or mark it moreinfo or something, but it seems pretty unlikely we will get to the bottom of it.
I guess let's just close it for now -- I have managed to miss your reply back then (may be didn't check "email replies to me"), so the path got cold. I hope that I will get back to exercising such workflows soon, so we might get back to it in a new manifestation if it happens again. Cheers