Suspended instances cannot resume after hypervisor reboot

Bug #1052696 reported by Rafi Khardalian
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Rafi Khardalian

Bug Description

Suspended instances cannot resume after hypervisor reboot using the libvirt driver. The method to reproduce the problem is simple:

1. Create a new tenant (VLAN manager network model should be configured) and launch an instance within this tenant.

2. Suspend the instance.

3. Reboot the hypervisor on which the suspended instance rests. Assume for the sake of discussion it comes back up without a problem and restarts the compute service.

4. Resume the instance. It will fail.

The reason it fails is that we're expecting the physical system to be in a state which it is not. The networking is not in place (bridge, VLAN, iptables rules, etc.), nor are any block device connections. The resume() method calls _create_domain(), which will not rebuild any of these dependencies. We should call _create_domain_and_network() instead, so that we eliminate any assumptions about the state of the hypervisor.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/13251

Changed in nova:
assignee: nobody → Rafi Khardalian (rkhardalian)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/13251
Committed: http://github.com/openstack/nova/commit/99b5e96795b8475f14d53bbc3845e7bace730963
Submitter: Jenkins
Branch: master

commit 99b5e96795b8475f14d53bbc3845e7bace730963
Author: Rafi Khardalian <email address hidden>
Date: Tue Sep 4 13:37:46 2012 +0000

    Allow VMs to be resumed after a hypervisor reboot

    Fixes bug 1052696.

    Update the compute manager to pass network_info and block_device_info
    to the driver.resume() and update all virtualization drivers to accept
    the new arguments.

    For libvirt, change resume() to use _create_domain_and_network()
    rather than _create_domain(). This eliminates the assumption that the
    network and block device connections remained in place from the period
    between the VM being suspended and resumed. Instead, all the
    networking and block connections will be rebuilt on resume (in case
    they are missing) as is the case after a hypervisor reboot.

    Change-Id: I6e19ec42f7e929678abce8f276c0a6e91f1fa8af

Changed in nova:
status: In Progress → Fix Committed
tags: added: folsom-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/folsom)

Fix proposed to branch: stable/folsom
Review: https://review.openstack.org/16877

Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-2
status: Fix Committed → Fix Released
Revision history for this message
Mark McLoughlin (markmc) wrote :

From the review:

  zu: I would prefer abandoned because this dont meet the guidlines in my opinion.
  rmk: I disagree with keeping a bug like this around in a stable release but I'm abandoning by request.
  vishy: We should re-examine this. This actually is a problem because security group rules don't get recreated on reboot. So basically doing a suspend / reboot host /resume means you could have an instance without security group rules.

tags: removed: folsom-backport-potential
Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-2 → 2013.1
Sean Dague (sdague)
no longer affects: nova/folsom
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.