SSH Client hangs on execute command

Bug #1577632 reported by Franklin Naval on 2016-05-03
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tempest
Undecided
Franklin Naval

Bug Description

It appears that the Tempest ssh client is hanging while trying to execute a command on a virtual machine.

This change appears to have broken the ssh_client.execute() functionality: https://github.com/openstack/tempest/commit/68c197e890c3847e52704216b83c873c2313b4e2

I confirmed this by testing the ssh.py file prior to the change and the hang no longer occurs.

Steps to reproduce:
1. With devstack, pull down neutron-lbaas repo
2. export TEMPEST_CONFIG_DIR=/opt/stack/tempest/etc
3. tox -e scenartio

Result:
Here's an example log where the hang occurs:
2016-05-03 02:36:49.929 12750 DEBUG tempest.common.utils.linux.remote_client [-] Remote command: set -eu -o pipefail; PATH=$PATH:/sbi
n; while true; do sudo nc -ll -p 80 -e sh /tmp/script1; done > /dev/null & exec_command /opt/stack/neutron-lbaas/.tox/scenario/local/
lib/python2.7/site-packages/tempest/common/utils/linux/remote_client.py:43
2016-05-03 02:36:49.929 12750 INFO tempest.lib.common.ssh [-] Creating ssh connection to '172.24.4.75' as 'cirros' with public key au
thentication
2016-05-03 02:36:49.935 12750 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2012.55)
2016-05-03 02:36:50.055 12750 INFO paramiko.transport [-] Authentication (publickey) successful!
2016-05-03 02:36:50.057 12750 INFO tempest.lib.common.ssh [-] ssh connection to cirros@172.24.4.75 successfully created
<<<HANGS HERE INDEFINITELY>>>>

Note: This is preventing new code from being submitted into the repo. So please make this a high priority.

summary: - SSH Client hangs on connect to compute virtual machine
+ SSH Client hangs on execute command
Matthew Treinish (treinish) wrote :

Do you have a link to the actual failure logs on the log server available? That change has been running everywhere for sometime, so I want to check other system info when things hang

Although, looking at the command you're running that will always hang it's an infinite loop running in the fg, that will be blocking. I'm not sure this is actually a bug in tempest, but when we fixed the bug in the patch you mentioned ssh started behaving properly and blocking until it actually exited.

Changed in tempest:
status: New → Incomplete
Franklin Naval (franknaval) wrote :

Unfortunately the failure and the logs are on a virtual server that I'm running devstack on.

I've run the tests a few times and now begin to see the hanging occur again - but it's inconsistent whereas the hang occurred consistently with the newer ssh.py.

I'll try to gather more info by debugging further using testr and pdb as outlined here: https://wiki.openstack.org/wiki/Testr#Debugging_.28pdb.29_Tests

Franklin Naval (franknaval) wrote :

So, I debugged this further:

* paramiko version 1.16.0 vs. 1.17.0 - still hanging
* open SSH version: "OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6, OpenSSL 1.0.1f 6 Jan 2014” and "OpenSSH_6.9p1 Ubuntu-2ubuntu0.1, OpenSSL 1.0.2d 9 Jul 2015” - still hanging
* Ubuntu OS as I trie both 14.04 and 15.10 - still hanging
* changed virtual machine location datacenter/hypervisor - still hanging

* replaced the Tempest SSH client with another ssh client - WORKS!

So, I think it might be the Tempest SSH client that is having issues - I suspect that the transport never closes causing a deadlock.

Here's the base ssh client that I used to replace the Tempest one: https://gist.github.com/blakev/697acbc242a683d4eb38 (thanks to blakev!)

Please take a look.

Franklin Naval (franknaval) wrote :

I did this to fix the issue:
https://review.openstack.org/#/c/164828/47..52/neutron_lbaas/tests/tempest/v2/scenario/base.py

Seems like the ssh client's execute command may have an issue with the class being imported or closing.

Changed in tempest:
assignee: nobody → Franklin Naval (franknaval)
status: Incomplete → Fix Committed
Alexander Gubanov (ogubanov) wrote :

I've faced with the same problem and found that root cause deadlock in paramiko.
Here is gdb trace from out CI http://paste.openstack.org/raw/560777/

Last tempest test "tempest.scenario.test_server_basic_ops.TestServerBasicOps.test_server_basic_ops"
calls remote ssh exec_command which uses paramiko recv_exit_status method that hangs and keep waiting event from another thread (which is died and never happened) forever.
Seems, it should have some timeout for receive.

Also, I found warnings in paramiko docs about method "recv_exit_status"
http://docs.paramiko.org/en/2.0/api/channel.html#paramiko.channel.Channel.recv_exit_status
http://docs.paramiko.org/en/2.0/api/channel.html#paramiko.channel.Channel.exit_status_ready

and unresolved issues with the
https://github.com/paramiko/paramiko/search?q=recv_exit_status&type=Issues&utf8=%E2%9C%93

Do we have this issue on tempest gates ?

Michael Polenchuk (mpolenchuk) wrote :

100% always reproduces (hangs) with opendaylight on board:

2017-03-01 10:37:38.215 750 INFO tempest.lib.common.ssh [-] Creating ssh connection to '172.16.0.134:22' as 'cirros' with public key authentication
2017-03-01 10:37:53.243 750 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2015.67)
2017-03-01 10:37:53.351 750 INFO paramiko.transport [-] Authentication (publickey) successful!
2017-03-01 10:37:53.373 750 INFO tempest.lib.common.ssh [-] ssh connection to cirros@172.16.0.134 successfully created
2017-03-01 10:37:53.488 750 DEBUG tempest.common.utils.linux.remote_client [-] Remote command: set -eu -o pipefail; PATH=$PATH:/sbin; curl http://169.254.169.254/latest/meta-data/public-ipv4 exec_command tempest/common/utils/linux/remote_client.py:90
2017-03-01 10:37:53.488 750 INFO tempest.lib.common.ssh [-] Creating ssh connection to '172.16.0.134:22' as 'cirros' with public key authentication
2017-03-01 10:37:53.490 750 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2015.67)
2017-03-01 10:37:53.598 750 INFO paramiko.transport [-] Authentication (publickey) successful!
2017-03-01 10:37:53.621 750 INFO tempest.lib.common.ssh [-] ssh connection to cirros@172.16.0.134 successfully created
2017-03-01 10:40:02.446 750 INFO tempest.lib.common.rest_client [req-f562cab5-4786-4276-9492-75ce0875daf6 ] Request (TestServerBasicOps:_run_cleanups): 202 DELETE http://172.16.0.3:8774/v2.1/os-floating-ips/1213269b-3c1c-4368-ba02-86a0c7b7a714 1.517s

Changed in tempest:
status: Fix Committed → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers