Please describe the problem.
Originally detected while testing newer (7.20200204+git1-gb2c529b88-1~ndall+1) git-annex-standalone build against DataLad: one of our tests started to hang. Even using a different way of invoking external commands which was recently added to DataLad, it looks like git annex enableremote
simply does not return:
[DEBUG ] Async run ['git-annex', 'enableremote', '--debug', '-c', 'remote.target1.annex-ssh-options=-o ControlMaster=auto -S /home/travis/.cache/datalad/sockets/3d284eaf', 'target2']
[DEBUG ] Launching process ['git-annex', 'enableremote', '--debug', '-c', 'remote.target1.annex-ssh-options=-o ControlMaster=auto -S /home/travis/.cache/datalad/sockets/3d284eaf', 'target2']
[DEBUG ] Process 16621 started
[DEBUG ] Waiting for process 16621 to complete
[Level 5] Read 46 bytes from 16621[stdout]:
[Level 9] stdout| (merging target2/git-annex into git-annex...)
|
[Level 5] Read 28 bytes from 16621[stdout]:
[Level 9] stdout| (recording state in git...)
|
No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Unfortunately ATM I do not have any further information, besides that we all failed to reproduce it locally and the issue is not happening with older build of git-annex-standalone (e.g. 7.20190819+git2-g908476a9b-1~ndall+1).
These strange options forcing use of unusual ssh sockets stick out.
You don't indicate what kind of remote is being enabled. --debug output is often useful.
The travis log is kind of enormous to take in, when all I want to know is the type of the remote, and maybe a --debug output. Grepping it seemed to suggest that target2 might be a regular git remote that has been fed into initremote, but I'm not sure?
thank you for the quick follow up joey. I got confused about
since all
--debug
output from runningenableremote
is present in the log extract I have shown AFAIK... started digging, if--debug
preceds-c
-- it is not in effect, filed a dedicated issue.I repositioned
--debug
after-c
. Results will appear later on travis log and I will update here.Re type of remote: it is actually a regular git remote. Before running
enableremote
on my laptop it looks like:and then after
enableremote
:here is the --debug output (well, logged by datalad, so there is some decoration) for the stuck
enableremote
Since this is a regular git remote, all enableremote does is the usual uuid probing. And that's where it hangs, sshing to the remote to run git-annex-shell configlist.
One thing I notice is that you went to some trouble to pass with -c annex-ssh-options for target1. I assume for some important reason, probably having to do with making ssh work in this environment. But this remote is target2, so it doesn't use those options when it sshes to it.
yes, we establish control paths (so all of us - git, git-annex, and datalad could use them and thus not re-prompting password by each tool). and I also spotted that we are missing control path for target2 -- filed an issue. But the curious point is that it doesn't stall with older git-annex. I will rerun on travis with the same level of logging, but older git-annex, to see may be that interaction is somehow different.
FWIW -- two excerpts: good session, bad session command to obtain
still too much details and too many differences unfortunately. But at least shows that the same invocation of ssh proceeds normally in the good case (first one is the only one in bad, in good ones, more come):
and there is close to no chance for
"git-annex-shell 'configlist' '/tmp/datalad_temp_test_publish_depends_q8i1lpv' '--debug'"
to actually be the one which got stuck? (ssh
is more likely of cause, but I am really not sure why...). Note, that if download full log (urls are above), in that "bad" session$> grep configlist log.txt [Level 9] stderr| [2020-02-26 20:21:35.087032521] read: ssh ["localhost","-S",".git/annex/ssh/localhost","-o","ControlMaster=auto","-o","ControlPersist=yes","-o","ControlMaster=auto","-S","/home/travis/.cache/datalad/sockets/2e18ea6b","-n","-T","git-annex-shell 'configlist' '/tmp/datalad_temp_test_publish_depends7k9yct_j' '--debug'"] [Level 9] stderr| [2020-02-26 20:21:36.785462409] read: ssh ["datalad-test","-S",".git/annex/ssh/datalad-test","-o","ControlMaster=auto","-o","ControlPersist=yes","-n","-T","git-annex-shell 'configlist' '/tmp/datalad_temp_test_publish_dependsymsjz245' '--debug'"]
so -- there is a prior session (probably to another remote) with configlist and that one works out just fine... just that it goes to localhost, not datalad-test (set to 127.0.0.1 in /etc/hosts).It seems very unlikely that git-annex-shell configlist would get stuck, all it does is run git config --list.
Not impossible of course. It would be worth checking on the server if ssh has run the process.
What git-annex version is it that does not exhibit the problem?
I know for sure that
8.20200226+git16-ge156a2b74-1~ndall+1
doesn't -- that is the one we are still distributing in neurodebian. In the past we tested other intermediate versions (like 7.20191114+git43-ge29663773) which also didn't exhibit it AT THAT TIME (different datalad, and possibly different git which gets bundled).I will try to do that later.
Wait, the intial bug report said 7.20200204 had the problem, but you say the newer 8.20200226 does not? Either 8.20200226 was a typo of the wrong version, or it seems this bug has been fixed.
8.20200226+git16-ge156a2b74-1~ndall+1
I meant7.20190819+git2-g908476a9b-1~ndall+1
which we have in neurodebian ATM.8.20200226+git16-ge156a2b74-1~ndall+1
is a bad one!Added a bunch of
ps auxw -H
calls in the code and monitoring of the process for when it calls enableremote on target2.Here is the diff for
ps auxw -H
right before runningenableremote
and then right after the outside process notes that it was ran (and probably hanged):full travis log . Unfortunately the "after" call has truncated command lines But you can see new processes:
and no
configlist
. so it does seems to be ssh stalling for some reasonIt seems to me that the process tree you pasted there comes from the host where git-annex is running, and not the host that it sshes to in order to run git-annex-shell.
What I was suggesting is, if ssh $host git-annex-shell seems to be hanging, it then makes sense to go look at $host and see if git-annex-shell is running there, or has already run there (eg looking at the ssh log). That bisects the problem space to one side or the other.
It should be the same host:
FWIW more of details of my struggle is available on datalad/pull/4265#issuecomment-597416284. The last suspicion was that may be while on travis that process simply doesn't return fully if it opens a new socket, path to which we provide but do not have ssh yet "serving" it. Another side topic (might formalize later on) which might or not be related is that git-annex uses
ControlPersist=yes
thus without any time out. FWIW, when datalad starts one, we useControlPersist=15m
(we might want to reduce it but at least it is not forever). Setting it toControlPersist=1m
(notyes
) by git-annex, and hoping that ssh would close that eventually, might unstuck us... didn't tryI'm having a hard time seeing what I can do about this. It's not clear that git-annex is at fault at all, I don't know how to reproduce it, etc.
You should be able to bisect the what changed the behavior, I suppose.
Re ControlPersist=yes, git-annex runs ssh -O stop on a socket file on exit, unless some other git-annex process is also using that socket. So they don't hangaround stale. By passing options to make ssh use other sockets, you bypass that.
I do see in https://github.com/datalad/datalad/pull/4265 that you seem to have reproduced the hang without running git-annex at all? Is that right? Seems to say not a git-annex bug. Perhaps some innocuous change in behavior by git-annex, that tickles whatever the real problem is.
overall - yes. It still might be just our "Runner" issue. I am not 100% sure in that since a reimplementation (WitlessRunner, used now in many places of the master version) which shouldn't have similar drawbacks, and should not stall -- it did stall as well and we didn't dig yet deeper. Another "anecdotal" observation was that I saw the "python -m nose" which runs our tests to not exit cleanly, waiting for some process to be done, or until I Ctrl-C it... didn't analyze that situation either. I will try to find some time to dig into this deeper and/or bisect (isn't easy/fun since didn't manage to reproduce locally yet). Thank you for the details and comments about sockets -- I am yet to fully analyze that situation as well. I just know that I do have some hanging around and they might have been initiated by a git-annex process which we start pointing to some not-yet initiated socket.
Seems likely this was the same hang as https://git-annex.branchable.com/bugs/Recent_hang_with_rsync_remote_with_older_systems___40__Xenial__44___Jessie__41__/
If you can still reproduce this let me know, but I'm closing it on that basis.