PowerVM drver uses stale ssh connection if hypervisor is rebooted

Bug #1177104 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Matt Riedemann

Bug Description

While trying to run this (on grizzly 2013.1):

tempest/tempest/tests/compute/images/test_images_oneserver.py:ImagesOneServerTestJSON.test_create_second_image_when_first_image_is_being_saved

We get this:

2013-05-06 14:53:51.344 5682 DEBUG nova.utils [-] Running cmd (SSH): lshwres -r mem --level sys -F configurable_sys_mem,curr_avail_sys_mem,sys_firmware_mem ssh_execute /usr/lib/python2.6/site-packages/nova/utils.py:288
2013-05-06 14:53:51.344 5682 ERROR nova.manager [-] Error during ComputeManager.update_available_resource: SSH session not active
2013-05-06 14:53:51.344 5682 TRACE nova.manager Traceback (most recent call last):
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/manager.py", line 244, in periodic_tasks
2013-05-06 14:53:51.344 5682 TRACE nova.manager task(self, context)
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 3898, in update_available_resource
2013-05-06 14:53:51.344 5682 TRACE nova.manager nodenames = set(self.driver.get_available_nodes())
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/virt/driver.py", line 849, in get_available_nodes
2013-05-06 14:53:51.344 5682 TRACE nova.manager stats = self.get_host_stats(refresh=True)
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/virt/powervm/driver.py", line 97, in get_host_stats
2013-05-06 14:53:51.344 5682 TRACE nova.manager return self._powervm.get_host_stats(refresh=refresh)
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/virt/powervm/operator.py", line 170, in get_host_stats
2013-05-06 14:53:51.344 5682 TRACE nova.manager self._update_host_stats()
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/virt/powervm/operator.py", line 174, in _update_host_stats
2013-05-06 14:53:51.344 5682 TRACE nova.manager memory_info = self._operator.get_memory_info()
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/virt/powervm/operator.py", line 852, in get_memory_info
2013-05-06 14:53:51.344 5682 TRACE nova.manager output = self.run_vios_command(cmd)
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/virt/powervm/operator.py", line 934, in run_vios_command
2013-05-06 14:53:51.344 5682 TRACE nova.manager check_exit_code=check_exit_code)
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/nova/utils.py", line 297, in ssh_execute
2013-05-06 14:53:51.344 5682 TRACE nova.manager stdin_stream, stdout_stream, stderr_stream = ssh.exec_command(cmd)
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/paramiko/client.py", line 364, in exec_command
2013-05-06 14:53:51.344 5682 TRACE nova.manager chan = self._transport.open_session()
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/paramiko/transport.py", line 661, in open_session
2013-05-06 14:53:51.344 5682 TRACE nova.manager return self.open_channel('session')
2013-05-06 14:53:51.344 5682 TRACE nova.manager File "/usr/lib/python2.6/site-packages/paramiko/transport.py", line 730, in open_channel
2013-05-06 14:53:51.344 5682 TRACE nova.manager raise SSHException('SSH session not active')
2013-05-06 14:53:51.344 5682 TRACE nova.manager SSHException: SSH session not active

This is because the test was ran earlier and the hypervisor was locked up. The tester rebooted the hypervisor but trying to run the test again results in the SSHException. This is due to the powervm operator.py code uses a cached paramiko ssh client connection:

https://github.com/openstack/nova/blob/master/nova/virt/powervm/operator.py#L467

The code should check the connection before assuming it's OK and then failing because it's bad.

Tags: powervm
Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/28451

Changed in nova:
status: New → In Progress
Matt Riedemann (mriedem)
tags: added: powervm
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/28451
Committed: http://github.com/openstack/nova/commit/75545ba974afb1ac76186a8dad359ae0aa751c01
Submitter: Jenkins
Branch: master

commit 75545ba974afb1ac76186a8dad359ae0aa751c01
Author: Matt Riedemann <email address hidden>
Date: Thu May 9 19:15:25 2013 -0700

    Check cached SSH connection in PowerVM driver

    The PowerVM operator code caches an SSH connection to the hypervisor
    which can become invalid if the connection to the hypervisor is not
    terminated cleanly, e.g. the hypervisor is rebooted while the compute
    node is connected to it.

    This code checks an existing SSH connection object to see if it's
    transport is still alive and if not, re-establishes the connection
    before attempting to run SSH commands on the hypervisor.

    Also added some basic unit tests to cover the new check_connection
    function in nova.virt.powervm.common.

    Fixes bug 1177104

    Change-Id: I08079cf0d9e60e1d8902d32d684d979b06f7f287

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/29110

Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-1
status: Fix Committed → Fix Released
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-1 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.