I'm looking into another SSH-based hang in the DataLad tests. I can't trigger the hang on my local system (Debian Buster), but in a Ubuntu Xenial VM I can. The hang bisects to 1f2e2d15e (async exception safety, 2020-06-03). It disappears on the parent of 1f2e2d15e or if I revert that commit on top of master (currently 3b6754e2a).
I was able to reduce the hang down to a git annex get
from an rsync
remote. Here is a script that triggers the hang via a Xenial Docker
container. Sorry for the length; given the system interaction, it's
the simplest reproducer I've managed to come up with.
cd "$(mktemp -d ${TMPDIR:-/tmp}/ga-XXXXXXX)"
cat >demo.sh <<'EOF'
git annex version
cd "$(mktemp -d /tmp/ga-XXXXXXX)"
remdir="$PWD/store"
mkdir "$remdir"
git init repo
(
cd repo
git annex init
git annex initremote r type=rsync rsyncurl="localhost:$remdir" encryption=none
echo 0 >f0
git annex add f0
git commit -m'f0'
git annex copy --to=r f0
git annex drop f0
)
git clone repo clone
(
cd clone
git annex init clone
git annex enableremote r
git annex get --debug f0
)
EOF
cat >Dockerfile <<'EOF'
FROM ubuntu:xenial
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
curl rsync openssh-client openssh-server ca-certificates
RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN cd /root && curl -fSsL \
https://downloads.kitenet.net/git-annex/autobuild/amd64/git-annex-standalone-amd64.tar.gz \
| tar xz
ENV PATH="/root/git-annex.linux:$PATH"
RUN git config --system user.name "u"
RUN git config --system user.email "u@e"
RUN mkdir -p /root/.ssh
RUN mkdir -p /var/run/sshd
RUN ssh-keygen -f /root/.ssh/id_rsa -N ""
RUN cat /root/.ssh/id_rsa.pub >>/root/.ssh/authorized_keys
RUN echo "Host localhost\nStrictHostKeyChecking no\n" >>/root/.ssh/config
COPY demo.sh /root/demo.sh
CMD /usr/sbin/sshd && sh /root/demo.sh
EOF
docker build -t ga-rsync-hang:latest .
docker run -it --rm ga-rsync-hang:latest
Output, where the last line stalled:
[... 50 lines ...]
git-annex version: 8.20200618-g3b6754e2a
build flags: Assistant Webapp Pairing S3 WebDAV Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite
dependency versions: aws-0.20 bloomfilter-2.0.1.0 cryptonite-0.25 DAV-1.3.3 feed-1.0.1.0 ghc-8.6.5 http-client-0.5.14 persistent-sqlite-2.9.3 torrent-10000.1.1 uuid-1.3.13 yesod-1.6.0
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs hook external
operating system: linux x86_64
supported repository versions: 8
upgrade supported from repository versions: 0 1 2 3 4 5 6 7
[... 32 lines ...]
[2020-07-07 14:28:55.252423613] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
[2020-07-07 14:28:55.256082009] process done ExitSuccess
[2020-07-07 14:28:55.256316845] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
[2020-07-07 14:28:55.26056207] process done ExitSuccess
[2020-07-07 14:28:55.260917245] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","-z","--cached","--","f0"]
get f0 [2020-07-07 14:28:55.26481346] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"]
[2020-07-07 14:28:55.268786541] process done ExitSuccess
[2020-07-07 14:28:55.268997933] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","--hash","refs/heads/git-annex"]
[2020-07-07 14:28:55.272176401] process done ExitSuccess
[2020-07-07 14:28:55.272550418] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..ea06186dfc3dab39316da534144d5d6ef3d6090c","--pretty=%H","-n1"]
[2020-07-07 14:28:55.275989985] process done ExitSuccess
[2020-07-07 14:28:55.27634403] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
[2020-07-07 14:28:55.276782737] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
[2020-07-07 14:28:55.279315206] read: git ["config","--null","--list"]
[2020-07-07 14:28:55.281990769] process done ExitSuccess
(from r...)
[2020-07-07 14:28:55.288194172] read: rsync ["-e","'ssh' '-S' '.git/annex/ssh/localhost' '-o' 'ControlMaster=auto' '-o' 'ControlPersist=yes' '-T'","--progress","--inplace","localhost:/tmp/ga-075MZKs/store/0c5/66e/'SHA256E-s2--9a271f2a916b0b6ee6cecb2426f0b3206ef074578be55d9bc94f6f3fe3ab86aa/SHA256E-s2--9a271f2a916b0b6ee6cecb2426f0b3206ef074578be55d9bc94f6f3fe3ab86aa'",".git/annex/tmp/SHA256E-s2--9a271f2a916b0b6ee6cecb2426f0b3206ef074578be55d9bc94f6f3fe3ab86aa"]
SHA256E-s2--9a271f2a916b0b6ee6cecb2426f0b3206ef074578be55d9bc94f6f3fe3ab86aa
0 0% 0.00kB/s 0:00:00
2 100% 1.95kB/s 0:00:00 (xfr#1, to-chk=0/1)
^C
Replacing "FROM ubuntu:xenial" with "FROM ubuntu:bionic" (a later release) resolves the hang, so perhaps there is some interaction with an older rsync or openssh version. Here are the versions that are present in Xenial:
OpenSSH_7.2p2 Ubuntu-4ubuntu2.10, OpenSSL 1.0.2g 1 Mar 2016
rsync version 3.1.1 protocol version 31
Copyright (C) 1996-2014 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
append, ACLs, xattrs, iconv, symtimes, prealloc
Thinking it's an older version of openssh or rsync, I tried with an older version of Debian. Using "FROM debian:stretch-slim" doesn't hang. Here are the versions there:
OpenSSH_7.4p1 Debian-10+deb9u7, OpenSSL 1.0.2u 20 Dec 2019
rsync version 3.1.2 protocol version 31
Copyright (C) 1996-2015 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
append, ACLs, xattrs, iconv, symtimes, prealloc
However, going back farther, "FROM debian:jessie-slim" does hang. The versions there:
OpenSSH_6.7p1 Debian-5+deb8u8, OpenSSL 1.0.1t 3 May 2016
rsync version 3.1.1 protocol version 31
Copyright (C) 1996-2014 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
append, ACLs, xattrs, iconv, symtimes, prealloc
So perhaps there's some interaction with openssh before 7.4p1 or rsync before 3.1.2. Those are pretty old versions, and, in the case of Jessie, the release is now EOL. But I figured it was at least worth writing up given that the hang isn't triggered until a recent commit in git-annex.
fixed made it cancel the threads 2 seconds after the process exited if still running. So there will be a bit of a slow down when using that broken ssh version, but it will work. --Joey
Hmm, my guess is that the significant and unmentioned change in that commit is, it waits for one of the stdout or stderr to be fully processed before waiting on the process to terminate. So if it somehow fails to detect EOF (maybe because old rsync and ssh somehow end up with the handle not being closed when rsync exits eg if a background ssh process inherited them?) it would block forever.
If so, this patch would avoid the hang (and if this doesn't fix it, the problem must be somehow due to using with CreateProcess).
If that patch does avoid the hang, this one would be worth a try as it tries to accomplish the goal of the patch in a better way:
I can try these if necessary but don't have docker handy, so maybe you're in a better position to?
Okay, short summary:
The demo script above no longer hangs as of 4c9ad1de4 (optimisation: stream keys through git cat-file --buffer, 2020-07-10).
Going to the parent of 4c9ad1de4 (1df9e72), the hang is still there.
The first patch advances the output by one line but still hangs, with the line at the time of the hang being "process done ExitSuccess".
The second patch removes the hang (again, this is from the parent of 4c9ad1de4).
The above is all based on working in a Xenial VM, building from source with stack. The Docker-based approach above provides a reproducer, but it depends on the autobuild downloads. It'd need quite a bit more work to have the full stack setup there. However, running the above script without Docker caching, I've confirmed the most recent autobuild, which has 4c9ad1de4, no longer hangs.
It seems at least plausible that 4c9ad1de4 affected the stall because it looks to deal with git-annex-get. However, perhaps that's just resolving the issue for this particular case, and it still makes sense to apply the second the patch?
If rsync is truely what's hanging, then commit 4c9ad1de4 can't have changed anything to do with that. Seems more likely that the hang did not actually involve rsync (but the debug output makes it seem like it is), or that the hang is not fully deterministic.
Whatever it's hanging on in the first patch is apparently not involving rsync, because rsync has exited. And it doesn't make sense that the first patch would hang while the second one doesn't, they're doing identical things except for extra background thread cleanup in the second patch.
As this stands, I don't feel confident enough that the second patch really fixes this to close this. Also I'm a little bit doubtful it's really correct, it seems possible it might cancel one of the threads before all output is processed by them. The first patch actually seems better.
Fair enough.
Perhaps. I haven't hit any divergent results yet, but that of course doesn't mean that much.
In my view, it'd be okay to close this issue, as well as the other Xenial/old system stall. These are related to SSH-based stalls that DataLad is hitting on Travis, and it would of course be nice to resolve them, but it probably isn't worth investing much more time in. And on DataLad's end, we should probably upgrade our CI's distribution from Xenial to Bionic anyway.
That being said, if you think of specific debugging information that might be useful for me to provide or if you want access to the Xenial qcow2 image (a bloated 25G :/), please let me know.
Thanks for your feedback on this issue and the other one.
I think this is worth following up still, I plan to reproduce it eventually and debug.
Reproducer script did not work for me.
But in the meantime, 4c9ad1de4 has gotten into the built that script uses, so whatever about that commit fixes or masks the problem, it seems.
But, the other stall's reproducer script did still work. And I was able to confirm that git-annex was blocked while ssh was a zombie process.. and killing sshd got git-annex unstuck. So apparently sshd is inheriting a handle keeping it open... at least sometimes.
It's really weird my first patch didn't work. The only difference between reverting commit 1f2e2d15e and my first patch is the use of withCreateProcess. So um.. withCreateProcess is somehow blocking on ssh to close stderr? Looking at its code, it calls cleanupProcess, which has this comment:
Ok, that begins to make sense.. The threads are not killed off, so are accessing the handle, so closing the handle blocks. And this explains why it got to ExitSuccess and then hung.
My second patch, by using withAsync, avoided that. But like I said, that patch could kill the threads before they have read and processed all the output. For example, if the process output an error and exited, that could cause the error message to not get displayed to the user.
So hmm, this is not reproducible, but we don't know why a recent change hid it, and the only patch I know of that fixes it is to revert commit 1f2e2d15e, which is part of the work being done for ?more extensive retries to mask transient failures. And while that todo is currently stalled and I may not use this work to resolve it at all, it seems a pity to revert that commit.
Also, before 1f2e2d15e, it would have started a thread to read from stderr, and that thread could keep running after outputFilter returned. And nothing would ever stop that thread, which would block forever due to ssh's behavior. So, before that commit, I think there was also a bug, a FD leak (and small memory leak).
This is getting kind of long, but the only way forward I see out of all this, aside from somehow determining how 4c9ad1de4 masks the problem, would be to make the stderr waiter thread only be allowed to run for a "little while" after the process exits. Similar to how cb74cefde78 fixed the other hang.
(Or well, git-annex could detect the ssh versions that have this behavior and refuse to use them..)
Just to see this in action, in the Xenial VM, I checked out the parent of 4c9ad1de4 (1df9e72a7) to get to a state that hangs with the demo script. I then applied the fix (aa492bc65) on top. The hang is avoided, and I see "git-annex: AsyncCancelled".
Thanks.
I'm a bit surprised that the AsyncCancelled exception gets displayed.
Something in git-annex must be catching that, and displaying it. The "git-annex:" says it's displayed with topLevelWarning, and the only thing that calls that to display an exception is accountCommandAction, but that uses tryNonAsync so it's supposed to not catch async exceptions like this one.
(Also why would canceling a worker thread throw an exception all the way out to accountCommandAction? And if it did, the overall command would then fail.)
Hmm. Well, I doubt it provides any useful information, but just in case, here's the output produced by the
git annex get --debug f0
call:And I should have checked the result the command rather than just noting that it didn't hang. f0 is not transferred.
Wait, are you saying the command exited nonzero, or that the transfer didn't happen, or both?
Well, I ran the reproducer script, substituting a git-annex tarball with the fix (and adding an output of the exit code and dropping the --debug), and the end was:
So, I guess this patch is ok, at least with the git-annex I've built. (Using haskell from debian unstable.) It seems possible that some older ghc, async, etc might have handled that AsyncCancelled differently and resulted in the behavior you're seeing.
Also tried with https://downloads.kitenet.net/git-annex/autobuild/i386-ancient/git-annex-standalone-i386-ancient.tar.gz and it's ok too.
Hmm, come to think of it,
wait errt
is racingcancel errt
, so what if cancel kills the errt thread, and beforerace
noticed cancel is done, the wait thread sees errt, died of an an AsyncCancelled exception and rethrows that? Result could be the exception leaking out, perhaps.It might be that async now avoids that, and some other version didn't, or you could be getting unlucky with the race.
So, I've added something to catch that exception in case it leaks.
I applied the change from c59a51a0651b and 4466c1001d88. Now I don't see "git-annex: AsyncCancelled", f0 is transferred, and the exit status is 0, so it looks to be working as intended. Thank you.
As described in comment 8, this is all in the Xenial VM on top of 1df9e72a7829, because the hang goes away with the next commit (4c9ad1de46dd). For posterity, here is a diff of the changes on top 1df9e72a7829, applied from aa492bc65904, c59a51a0651b, and 4466c1001d88.
diff