Using git annex get -J20 of 1000 files from ssh remote on localhost, I've thrice observed it to hang.

get 99 (from origin...) (checksum...) ok
get 992 (from origin...) (checksum...) ok
get 991 (from origin...) (checksum...) ok
get 993 (from origin...) (checksum...) ok
get 995 (from origin...) (checksum...) ok
get 1000 (from origin...) 
get 1 (from origin...) 
get 10 (from origin...) 
get 108 (from origin...) 
get 105 (from origin...) 
[some more]

It seems it's trying to receive content of the last files listed, but has hung somehow in the P2P protocol and not received the data. Those are the only files not present. --Joey

The particular set of files it stalls on seems somewhat deterministic; the sets have been the same at least twice.

Looking at --debug, it does not seem to get to the point of sending a P2P request for the keys of the files that it stalls on.

So, a bug setting up the P2P ssh connection, it seems.

Interestingly, the debug log shows it only ran git-annex-shell p2pstdio 6 times, despite the concurrency of 20. So, the other 14 must have stalled setting up the connection. Suggests the bug is in the connection pool code.

The hang does not involve the connection pool code itself; a call to Annex.Ssh.sshCommand is hanging. So, this likely affected git-annex before p2pstdio, although its timings of calls to sshCommand may be exposing the problem.

The hang is in prepSocket; all the threads enter makeconnection near the same time, and so all of them try to warm up the ssh connection at the same time. And somehow many of those executions of ssh hang. (Arguably there should be locking to prevent multiple threads doing this, but the actual overhead of multiple threads doing it may be smaller than the overhead of such added locking.)

Converted makeconnection to not use processTranscript, and that does seem to avoid the hang.

Why is makeconnection's use of processTranscript hanging? processTranscriot tries to read the process's output (ssh has none), and waiting for the output to get read is for some reason hanging forever, despite the ssh process becoming a zombie. I actually rewrote the part of processtranscript that reads the process's input, to be a much simpler use of async, and that new implementation has the same problem. It's hanging, not throwing an exception. Most puzzling!

Hmm.. Perhaps the write handle is staying open even after ssh exits? processTranscript sets up a pipe for the process to write to, and the ssh process may inherit the FDs for that pipe (other than as stdout and stderr). If the write handle remains open, reading from it would block. Since the ssh mux server is involved, and the handle might be passed to it or something, that seems at least possible as the cause. The windows version of createProcess does not use that pipe, and switching to use it does avoid the hang. Yep! Setting the pipe's FDs to close on exec did avoid the hang.

done --Joey