In the datalad test suite, a test involving one of our special remotes hangs after 7245a9ed5 (Improve shutdown process for external special remotes and external backends, 2020-11-02). The hang depends on the remote program creating an SSH socket but not cleaning it up on failure. That's being addressed on our end (https://github.com/datalad/datalad/pull/5151).
While that's clearly bad handling on the part of datalad's special remote, I'm opening this bug report in case the above stall indicates that 7245a9ed5 brings an increased susceptibility to hanging that's worth worrying about. Based on 7245a9ed5's commit message, it sounds like that resolved some hanging on Windows (and generally sounds like a good change), so it probably makes sense keep that at the expense of hanging that is restricted to buggy special remotes. But I don't have a good enough understanding of the issue to know if it's limited to misbehaving special remotes, or if perhaps there's a way to avoid both hangs.
In case it's helpful, here's a demo that tweaks the example special remote to trigger a hang:
set -eu
ga_srcdir="$(realpath "${1?requires git-annex source directory as first argument}")"
host="${2?requires ssh host as second argument}"
cd "$(mktemp -d "${TMPDIR:-/tmp}"/gx-testremote-XXXXXXX)"
top="$PWD"
mkdir bin
git -C "$ga_srcdir" cat-file -p cfae74ae25a791a8dbf8b476fe17d5c6feacd8b0 \
>bin/git-annex-remote-foo
# Open a socket in setupcreds().
cmd="ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=$top/socket $host"
sed -i "s|setupcreds () {|&\\n\\t$cmd|" bin/git-annex-remote-foo
chmod +x bin/git-annex-remote-foo
export PATH="$top/bin:$PATH"
git annex version
git init a
(
cd a
git annex init main
# This will trigger an INITREMOTE-FAILURE failure because
# MYPASSWORD and MYLOGIN are not set. This hangs after 7245a9ed5.
git annex initremote d --debug \
type=external externaltype=foo directory="$top/d" encryption=none
)
$ sh /tmp/gx-failed-dir-hang.sh ~/src/haskell/git-annex smaug
git-annex version: 8.20201103
build flags: Assistant Pairing Inotify MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.20 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.4 feed-1.2.0.1 ghc-8.6.5 http-client-0.6.4 persistent-sqlite-2.10.5.2 uuid-1.3.13
[... 40 lines ...]
[2020-11-12 11:40:01.886013862] /tmp/gx-testremote-3MudMuQ/bin/git-annex-remote-foo[1] --> SETCONFIG directory /tmp/gx-testremote-3MudMuQ/d
[2020-11-12 11:40:02.451619481] /tmp/gx-testremote-3MudMuQ/bin/git-annex-remote-foo[1] --> INITREMOTE-FAILURE You need to set MYPASSWORD and MYLOGIN environment variables when running initremote.
That statement was based on me being able to trigger a stall locally in the same test on my Buster system. The changes from datalad's pr 5151 resolved the hang on my end, as well as the hang from the stripped down demo I added here.
However, with the changes from pr 5151 merged, it turns out the the stall on the Ubuntu 18.04 CI is still there [1]. So, it's unclear what's going on. Perhaps 7245a9ed5 isn't even what triggered the hang on the CI, though the stall seems to have first appeared 14 days ago on a build using 7e24b2587 [2], so the timing is at least consistent with 7245a9ed5 being the culprit.
[1] https://github.com/datalad/git-annex/runs/1404305092?check_suite_focus=true [2] https://github.com/datalad/datalad-extensions/runs/1345445794?check_suite_focus=true
The ssh process you run this way has the following file descriptor:
That pipe is connected to git-annex. git-annex is waiting to read any errors that the special remote might output to stderr, in order to relay them to the user. Yes, this was changed by the referenced commit.
Since git-annex has already waited on the process, the process is dead by the point it waits on the stderr relayer thread. The only purpose of waiting rather than closing the handle is to see anything the process might have output in its dying breath.
Also, this is very similar to the problem fixed in aa492bc65904a19f22ffdfc20d7a5e7052e2f54d and in cb74cefde782e542ad609b194792deabe55b1f5a, also involving ssh. Those were solved using a rather ugly up to 2 second wait for any late stderr to arrive, though it only delays when the handle is kept open like this.
There are surely a ton of places where this could potentially happen. Not only stderr.. it's entirely possible that process that git-annex expects to read stdout from might spawn a daemon that keeps inherited stdout open, and exit, leaving git-annex waiting forever to read from the pipe.
I'm doubtful this is a bug everywhere that git-annex reads all the stdout from a pipe. Because it seems to me a great many programs would have the same problem if a program they were piping stdout from behaved in that way. I've never seen anything concern itself with this potential problem. This is why proper daemons close their handles, certianly their stdout handle. stderr is a slightly special case maybe.
processTranscript is one example of another place in git-annex that waits to consume all stderr from a process and would be hung by such a daemon. There are a dozen in all,
git grep 'std_err =' | egrep 'CreatePipe|UseHandle | grep -v nullh
I suppose they could all be audited and maybe something abstracted out to deal with them all.One approach would be:
The versions of this currently being used to work around the past occurances of the problem wait for the process to exit, and then wait 2 seconds before canceling the action, to give it time to read anything that was buffered in the pipe.
That's annoyingly racey, what if the system somehow massively slows down for 2 seconds just after the process exits, and so buffered output is lost? Doing this in a lot of places, not just a few targeted places where maybe we didn't care too much about stderr anyway, does not feel safe.
There's GHC.IO.Handle.hWaitForInput that does not consume input from a Handle, but checks if there is any available to read. Digging into what that really means -- as well as checking its internal buffers, it calls a fdReady C function which boils down to a
poll()
. So that seems like it could be used to avoid the race. I think it would be safest to use it with 0, which avoids a case where, it interrupted too long, it can return false when there's still input.It should be possible to use that to implement this:
Wait for the process to exit, and then check hWaitForInput, and as long as there's input left, sleep breifly and loop. Once all buffered input as been consumed by the action, hClose the handle to get the action to finish reading from it.
One problem with this is that eg, this would throw an exception if it closed the handle out from under hGetLine before it read anything:
So it's not entirely safe to drop this in just anywhere. Maybe it should catch the EOF exception?
That should be fine for things like hGetLine, but would still be unsafe for hGetContents, which does not like having the handle it's reading closed out from under it ("hGetContents: illegal operation (delayed read on closed handle)"). So maybe best to specialize it:
I have a partial implementation on the ssh-hates-me branch.
Thanks for the analysis. (I wish I had some sort of intelligent input to give rather than just being the reporter of several of these ssh issues :/)
hWaitForInput does not seem to behave as documented. It does not time out, so blocks forever in this situation. This is with a 0 timeout and with larger timeouts. Unsure why, it looked like it should work. Looking at the ghc bug tracker, it seems to have some open bugs. (9af0000e0ff28c8eaae0ee1ce6838063d0a5ebef has my bench test demonstrating this problem.)
hGetLineUntilExitOrEOF could still be implemented, but rather than using hGetLine, it would need to use hGetBufNonBlocking to do one byte at a time reads, with a small sleep when there is nothing to read, until it gets an empty read at a point after the process has exited.
I can't see a way to implement it without the sleep, and that is problimatic, too long sleep will slow down reading, and too short a sleep will waste cpu. It might be possible to use hGetBufSome and cancel it once the process exits, and only then switch to hGetBufNonBlocking to read the rest of the buffer. But it's not clear what would happen hGetBufSome got as far as reading from the handle, but didn't quite manage to finish before being timed out. Seems likely any data it did read would then be lost.
Think I'll have to abandon the hGetLineUntilExitOrEOF idea.
Seems the problem with hWaitForInput happens when running it concurrently with hGetLine on the same handle. When I run it sequentially, it does seem to work. Ok.. I was able to implement hGetLineUntilExitOrEOF that seems to work.
I've converted it to use that, and the test case for this bug is fixed. Also, the sleep based workaround in aa492bc65904a19f22ffdfc20d7a5e7052e2f54d is removed, and it uses hGetLineUntilExitOrEOF instead.
However, I still want to do the audit of other things that pipe stderr. Some of them don't do line based input, so will need a different function than hGetLineUntilExitOrEOF. Leaving this bug open until that audit is done.
Thanks. The hanging datalad CI builds are now resolved.
https://github.com/datalad/git-annex/actions/runs/371469598
Finished auditing and fixing anywhere in git-annex this could possibly happen.