SSH Client hangs on execute command

Bug #1577632 reported by Franklin Naval
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tempest
Expired
Undecided
Unassigned

Bug Description

It appears that the Tempest ssh client is hanging while trying to execute a command on a virtual machine.

This change appears to have broken the ssh_client.execute() functionality: https://github.com/openstack/tempest/commit/68c197e890c3847e52704216b83c873c2313b4e2

I confirmed this by testing the ssh.py file prior to the change and the hang no longer occurs.

Steps to reproduce:
1. With devstack, pull down neutron-lbaas repo
2. export TEMPEST_CONFIG_DIR=/opt/stack/tempest/etc
3. tox -e scenartio

Result:
Here's an example log where the hang occurs:
2016-05-03 02:36:49.929 12750 DEBUG tempest.common.utils.linux.remote_client [-] Remote command: set -eu -o pipefail; PATH=$PATH:/sbi
n; while true; do sudo nc -ll -p 80 -e sh /tmp/script1; done > /dev/null & exec_command /opt/stack/neutron-lbaas/.tox/scenario/local/
lib/python2.7/site-packages/tempest/common/utils/linux/remote_client.py:43
2016-05-03 02:36:49.929 12750 INFO tempest.lib.common.ssh [-] Creating ssh connection to '172.24.4.75' as 'cirros' with public key au
thentication
2016-05-03 02:36:49.935 12750 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2012.55)
2016-05-03 02:36:50.055 12750 INFO paramiko.transport [-] Authentication (publickey) successful!
2016-05-03 02:36:50.057 12750 INFO tempest.lib.common.ssh [-] ssh connection to cirros@172.24.4.75 successfully created
<<<HANGS HERE INDEFINITELY>>>>

Note: This is preventing new code from being submitted into the repo. So please make this a high priority.

summary: - SSH Client hangs on connect to compute virtual machine
+ SSH Client hangs on execute command
Revision history for this message
Matthew Treinish (treinish) wrote :

Do you have a link to the actual failure logs on the log server available? That change has been running everywhere for sometime, so I want to check other system info when things hang

Although, looking at the command you're running that will always hang it's an infinite loop running in the fg, that will be blocking. I'm not sure this is actually a bug in tempest, but when we fixed the bug in the patch you mentioned ssh started behaving properly and blocking until it actually exited.

Changed in tempest:
status: New → Incomplete
Revision history for this message
Franklin Naval (franknaval) wrote :

Unfortunately the failure and the logs are on a virtual server that I'm running devstack on.

I've run the tests a few times and now begin to see the hanging occur again - but it's inconsistent whereas the hang occurred consistently with the newer ssh.py.

I'll try to gather more info by debugging further using testr and pdb as outlined here: https://wiki.openstack.org/wiki/Testr#Debugging_.28pdb.29_Tests

Revision history for this message
Franklin Naval (franknaval) wrote :

So, I debugged this further:

* paramiko version 1.16.0 vs. 1.17.0 - still hanging
* open SSH version: "OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6, OpenSSL 1.0.1f 6 Jan 2014” and "OpenSSH_6.9p1 Ubuntu-2ubuntu0.1, OpenSSL 1.0.2d 9 Jul 2015” - still hanging
* Ubuntu OS as I trie both 14.04 and 15.10 - still hanging
* changed virtual machine location datacenter/hypervisor - still hanging

* replaced the Tempest SSH client with another ssh client - WORKS!

So, I think it might be the Tempest SSH client that is having issues - I suspect that the transport never closes causing a deadlock.

Here's the base ssh client that I used to replace the Tempest one: https://gist.github.com/blakev/697acbc242a683d4eb38 (thanks to blakev!)

Please take a look.

Revision history for this message
Franklin Naval (franknaval) wrote :

I did this to fix the issue:
https://review.openstack.org/#/c/164828/47..52/neutron_lbaas/tests/tempest/v2/scenario/base.py

Seems like the ssh client's execute command may have an issue with the class being imported or closing.

Changed in tempest:
assignee: nobody → Franklin Naval (franknaval)
status: Incomplete → Fix Committed
Revision history for this message
Alexander Gubanov (ogubanov) wrote :

I've faced with the same problem and found that root cause deadlock in paramiko.
Here is gdb trace from out CI http://paste.openstack.org/raw/560777/

Last tempest test "tempest.scenario.test_server_basic_ops.TestServerBasicOps.test_server_basic_ops"
calls remote ssh exec_command which uses paramiko recv_exit_status method that hangs and keep waiting event from another thread (which is died and never happened) forever.
Seems, it should have some timeout for receive.

Also, I found warnings in paramiko docs about method "recv_exit_status"
http://docs.paramiko.org/en/2.0/api/channel.html#paramiko.channel.Channel.recv_exit_status
http://docs.paramiko.org/en/2.0/api/channel.html#paramiko.channel.Channel.exit_status_ready

and unresolved issues with the
https://github.com/paramiko/paramiko/search?q=recv_exit_status&type=Issues&utf8=%E2%9C%93

Do we have this issue on tempest gates ?

Revision history for this message
Michael Polenchuk (mpolenchuk) wrote :

100% always reproduces (hangs) with opendaylight on board:

2017-03-01 10:37:38.215 750 INFO tempest.lib.common.ssh [-] Creating ssh connection to '172.16.0.134:22' as 'cirros' with public key authentication
2017-03-01 10:37:53.243 750 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2015.67)
2017-03-01 10:37:53.351 750 INFO paramiko.transport [-] Authentication (publickey) successful!
2017-03-01 10:37:53.373 750 INFO tempest.lib.common.ssh [-] ssh connection to cirros@172.16.0.134 successfully created
2017-03-01 10:37:53.488 750 DEBUG tempest.common.utils.linux.remote_client [-] Remote command: set -eu -o pipefail; PATH=$PATH:/sbin; curl http://169.254.169.254/latest/meta-data/public-ipv4 exec_command tempest/common/utils/linux/remote_client.py:90
2017-03-01 10:37:53.488 750 INFO tempest.lib.common.ssh [-] Creating ssh connection to '172.16.0.134:22' as 'cirros' with public key authentication
2017-03-01 10:37:53.490 750 INFO paramiko.transport [-] Connected (version 2.0, client dropbear_2015.67)
2017-03-01 10:37:53.598 750 INFO paramiko.transport [-] Authentication (publickey) successful!
2017-03-01 10:37:53.621 750 INFO tempest.lib.common.ssh [-] ssh connection to cirros@172.16.0.134 successfully created
2017-03-01 10:40:02.446 750 INFO tempest.lib.common.rest_client [req-f562cab5-4786-4276-9492-75ce0875daf6 ] Request (TestServerBasicOps:_run_cleanups): 202 DELETE http://172.16.0.3:8774/v2.1/os-floating-ips/1213269b-3c1c-4368-ba02-86a0c7b7a714 1.517s

Changed in tempest:
status: Fix Committed → Confirmed
Revision history for this message
Martin Kopec (mkopec) wrote :

Considering that the last update was made more than 3 years ago, the problem could be in paramiko, ssh client, tempest or neutron-lbaas and the code of any of the mentioned projects has changed significantly over the time I'll mark this as Incomplete now. Feel free to reopen this if you experience the issue again (with current steps to reproduce).

Changed in tempest:
assignee: Franklin Naval (franknaval) → nobody
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for tempest because there has been no activity for 60 days.]

Changed in tempest:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.