gate-tempest-dsvm-ironic-pxe_ssh-dib failing : test_baremetal_server_ops ssh timeout

Bug #1571492 reported by Ian Wienand
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Critical
Lucas Alvares Gomes
diskimage-builder
Fix Released
Critical
Unassigned
networking-fortinet
New
Undecided
Unassigned
tempest
Fix Released
High
Lucas Alvares Gomes

Bug Description

gate-tempest-dsvm-ironic-pxe_ssh-dib is constantly failing in test_baremetal_server_ops; unable to ssh into the host.

Here are the last 5 runs all having the same failure [1,2,3,4,5]

I checked the ironic whiteboard at [6] which currently states that everything is OK

My knowledge here is obviously rather limited, but I didn't see any smoking gun errors in the logs. Comparing ironic & nova logs to a last known good run everything seemed in order. So maybe it's in networking but obviously this needs triage.

[1] http://logs.openstack.org/16/303416/4/check/gate-tempest-dsvm-ironic-pxe_ssh-dib/a5812e2/console.html
[2] http://logs.openstack.org/53/219453/4/check/gate-tempest-dsvm-ironic-pxe_ssh-dib/77c1fc7/console.html
[3] http://logs.openstack.org/16/302516/1/check/gate-tempest-dsvm-ironic-pxe_ssh-dib/1583ef7/console.html
[4] http://logs.openstack.org/10/306510/1/check/gate-tempest-dsvm-ironic-pxe_ssh-dib/12de8f2/console.html
[5] http://logs.openstack.org/98/305498/1/check/gate-tempest-dsvm-ironic-pxe_ssh-dib/0f57bec/console.html
[6] https://etherpad.openstack.org/p/IronicWhiteBoard

Revision history for this message
Ian Wienand (iwienand) wrote :

hmm, the ironic gate [1] doesn't look that healthy either.

[1] https://review.openstack.org/#/q/project:openstack/ironic

Revision history for this message
Ian Wienand (iwienand) wrote :

This seems to affect the ironic gate too, I don't seem to see any existing bugs

summary: - gate-tempest-dsvm-ironic-pxe_ssh-dib failing : est_baremetal_server_ops
+ gate-tempest-dsvm-ironic-pxe_ssh-dib failing : test_baremetal_server_ops
+ ssh timeout
Revision history for this message
Dmitry Tantsur (divius) wrote :

Timeout happens at verify_partitions:

2016-04-14 20:07:12.806 | Captured traceback:
2016-04-14 20:07:12.806 | ~~~~~~~~~~~~~~~~~~~
2016-04-14 20:07:12.806 | Traceback (most recent call last):
2016-04-14 20:07:12.807 | File "tempest/test.py", line 113, in wrapper
2016-04-14 20:07:12.807 | return f(self, *func_args, **func_kwargs)
2016-04-14 20:07:12.807 | File "/opt/stack/new/ironic/ironic_tempest_plugin/tests/scenario/test_baremetal_basic_ops.py", line 123, in test_baremetal_server_ops
2016-04-14 20:07:12.807 | self.verify_partition(vm_client, 'ephemeral0', '/mnt', eph_size)
2016-04-14 20:07:12.807 | File "/opt/stack/new/ironic/ironic_tempest_plugin/tests/scenario/test_baremetal_basic_ops.py", line 72, in verify_partition
2016-04-14 20:07:12.807 | device = client.exec_command(cmd).rstrip('\n')
2016-04-14 20:07:12.807 | File "tempest/common/utils/linux/remote_client.py", line 44, in exec_command
2016-04-14 20:07:12.808 | return self.ssh_client.exec_command(cmd)
2016-04-14 20:07:12.808 | File "tempest/lib/common/ssh.py", line 135, in exec_command
2016-04-14 20:07:12.808 | ready = poll.poll(self.channel_timeout)
2016-04-14 20:07:12.808 | File "/usr/local/lib/python2.7/dist-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
2016-04-14 20:07:12.808 | raise TimeoutException()
2016-04-14 20:07:12.808 | fixtures._fixtures.timeout.TimeoutException

The first guess is that the instance is not reachable.

Changed in ironic:
status: New → Confirmed
importance: Undecided → Critical
Changed in diskimage-builder:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/307048

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
status: Confirmed → In Progress
Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Lucas Alvares Gomes (<email address hidden>) on branch: master
Review: https://review.openstack.org/307048
Reason: vdrok reproduced locally and the polling code here is hanging https://github.com/openstack/tempest-lib/blob/master/tempest_lib/common/ssh.py#L127-L160

non-polling works fine

Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → nobody
status: In Progress → Confirmed
Revision history for this message
Vladyslav Drok (vdrok) wrote :

After some attempts to make it work without polling, I've had not much success, it still hangs sometimes. Maybe there is a locking problem in paramiko's buffered_pipe:

Traceback (most recent call last):
  File "tempest/test.py", line 113, in wrapper
    return f(self, *func_args, **func_kwargs)
  File "/opt/stack/ironic/ironic_tempest_plugin/tests/scenario/test_baremetal_basic_ops.py", line 124, in test_baremetal_server_ops
    self.verify_partition(vm_client, 'ephemeral0', '/mnt', eph_size)
  File "/opt/stack/ironic/ironic_tempest_plugin/tests/scenario/test_baremetal_basic_ops.py", line 87, in verify_partition
    num_bytes = client.exec_command(cmd).rstrip('\n')
  File "tempest/common/utils/linux/remote_client.py", line 44, in exec_command
    return self.ssh_client.exec_command(cmd)
  File "tempest/lib/common/ssh.py", line 145, in exec_command
    if channel.recv_ready():
  File "/usr/local/lib/python2.7/dist-packages/paramiko/channel.py", line 580, in recv_ready
    return self.in_buffer.read_ready()
  File "/usr/local/lib/python2.7/dist-packages/paramiko/buffered_pipe.py", line 105, in read_ready
    self._lock.acquire()

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :
Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

This appears to fix locally: https://review.openstack.org/#/c/307720/

And this dependent patch is testing it: https://review.openstack.org/#/c/307705/

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tempest (master)

Change abandoned by Lucas Alvares Gomes (<email address hidden>) on branch: master
Review: https://review.openstack.org/307720
Reason: Actually this is not needed. The error I'm seem is that after exec_command the channel is not closed as the paramiko documentation says [0] and that causes the ssh to hang

[0] http://docs.paramiko.org/en/1.16/api/channel.html#paramiko.channel.Channel.exec_command

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/307952

Changed in tempest:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
status: New → In Progress
Changed in tempest:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/307952
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=68c197e890c3847e52704216b83c873c2313b4e2
Submitter: Jenkins
Branch: master

commit 68c197e890c3847e52704216b83c873c2313b4e2
Author: Lucas Alvares Gomes <email address hidden>
Date: Tue Apr 19 18:18:05 2016 +0100

    SSH: Do not rely on the .closed attribute

    This patch is updating the logic in the SSH module to not rely on the
    "closed" attribute of the channel object when polling the data. The
    "closed" attribute may not set to True as part of the exec_command()
    execution stack and in that case would cause the exec_command() (from
    the tempest SSH module) to hang.

    This patch also uses the context manager when creating the channel so
    it gets properly closed at the end of the execution scope.

    Closes-Bug: #1571492
    Change-Id: I5169faf25c36cfb494c9c1fafb14fb4f7f82cd9e

Changed in tempest:
status: In Progress → Fix Released
Changed in ironic:
status: Confirmed → Fix Released
Revision history for this message
Franklin Naval (franknaval) wrote :

This also appears to be happening with the neutron-lbaas tests.
Reverting this change made the ssh hang disappear: https://github.com/openstack/tempest/commit/68c197e890c3847e52704216b83c873c2313b4e2

related to: https://bugs.launchpad.net/tempest/+bug/1577632

Changed in diskimage-builder:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.