Block storage connections are NOT restored on system reboot

Bug #1036902 reported by Rafi Khardalian
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Rafi Khardalian
Essex
Fix Released
Undecided
Unassigned
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

There are a number of cases where block storage connections are not properly restored, impacting libvirt in particular. The most common case is a VM which has block storage attached via iSCSI, whereby the physical system is rebooted. When the system comes back up and starts nova-compute, the iSCSI connections are NOT recreated for the instances slated to be resumed (assuming resume_guests_state_on_host_boot is set).

Nova properly updates a VM's libvirt XML to reference attached block devices, such as follows:

<source dev='/dev/disk/by-path/ip-10.255.171.82:3260-iscsi-iqn.2010-10.org.openstack:volume-00000005-lun-1'/>

However, any attempts to recover the instance (via hard_reboot or otherwise) will fail, as the /dev/disk location is invalid until the iSCSI connections are re-established. In effect, there is no way for a user to recover a VM in this state unless iscsiadm/tgt is used to manually reconnect to all the targets.

This issue impacts both the latest Folsom and Essex branches. I'll be submitting patches to fix both versions.

Related branches

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/11375

Changed in nova:
assignee: nobody → Rafi Khardalian (rkhardalian)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/essex)

Fix proposed to branch: stable/essex
Review: https://review.openstack.org/11387

tags: added: essex-backport
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/11375
Committed: http://github.com/openstack/nova/commit/9fffd28cee6669089159047b2bbb5e0539ab4299
Submitter: Jenkins
Branch: master

commit 9fffd28cee6669089159047b2bbb5e0539ab4299
Author: Rafi Khardalian <email address hidden>
Date: Tue Aug 14 13:42:22 2012 +0000

    Restore libvirt block storage connections on reboot.

    Fixes bug 1036902.

    There are a number of cases where block storage connections are not
    properly restored, impacting libvirt in particular. The most common
    case is a VM which has block storage attached via iSCSI, whereby the
    physical system is rebooted. When the system comes back up and
    starts nova-compute, the iSCSI connections are NOT recreated for the
    instances slated to be resumed (assuming
    resume_guests_state_on_host_boot is set).

    The patch changes the compute manager to pass block_storage_info via
    driver.reboot() and driver.resume_state_on_host_boot(). The fix is
    actually only present in the libvirt driver. However, all the other
    drivers were updated to accept the additional, optional function
    arg.

    With the changes in place, iSCSI connections for libvirt are
    re-established after a hypervisor reboot with
    resume_guests_state_on_host_boot=True and on every hard_reboot.
    The latter is intended so that users have a last ditch option for
    recovering their VMs without administrative involvement.

    Change-Id: Idf5d53f21991a359bec6ce26ae9fe3bd61800ce3

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/essex)

Reviewed: https://review.openstack.org/11387
Committed: http://github.com/openstack/nova/commit/09217abddc07bd4fbaca6c300075df2c68ffedf7
Submitter: Jenkins
Branch: stable/essex

commit 09217abddc07bd4fbaca6c300075df2c68ffedf7
Author: Rafi Khardalian <email address hidden>
Date: Mon Aug 13 20:53:43 2012 +0000

    Restore libvirt block storage connections on reboot.

    Fixes bug 1036902 -- Backported version for stable/essex.

    There are a number of cases where block storage connections are not
    properly restored, impacting libvirt in particular. The most common
    case is a VM which has block storage attached via iSCSI, whereby the
    physical system is rebooted. When the system comes back up and
    starts nova-compute, the iSCSI connections are NOT recreated for the
    instances slated to be resumed (assuming
    resume_guests_state_on_host_boot is set).

    The patch changes the compute manager to pass block_storage_info via
    driver.reboot() and driver.resume_state_on_host_boot(). The fix is
    actually only present in the libvirt driver. However, all the other
    drivers were updated to accept the additional, optional function
    arg.

    With the changes in place, iSCSI connections for libvirt are
    re-established after a hypervisor reboot with
    resume_guests_state_on_host_boot=True and on every hard_reboot.
    The latter is intended so that users have a last ditch option for
    recovering their VMs without administrative involvement.

    Change-Id: I8ab3a138b559ee0aa1535a928282e9c372ec5651

tags: added: in-stable-essex
Dave Walker (davewalker)
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Adam Gandelman (gandelman-a) wrote : Verification report.

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/11375
Stable review: https://review.openstack.org/11387

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package nova - 2012.1.3+stable-20120827-4d2a4afe-0ubuntu1

---------------
nova (2012.1.3+stable-20120827-4d2a4afe-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot, fixes FTBFS in -proposed. (LP: #1041120)
  * Resynchronize with stable/essex (4d2a4afe):
    - [5d63601] Inappropriate exception handling on kvm live/block migration
      (LP: #917615)
    - [ae280ca] Deleted floating ips can cause instance delete to fail
      (LP: #1038266)

nova (2012.1.3+stable-20120824-86fb7362-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot. (LP: #1041120)
  * Dropped, superseded by new snapshot:
    - debian/patches/CVE-2012-3447.patch: [d9577ce]
    - debian/patches/CVE-2012-3371.patch: [25f5bd3]
    - debian/patches/CVE-2012-3360+3361.patch: [b0feaff]
  * Resynchronize with stable/essex (86fb7362):
    - [86fb736] Libvirt driver reports incorrect error when volume-detach fails
      (LP: #1029463)
    - [272b98d] nova delete lxc-instance umounts the wrong rootfs (LP: #971621)
    - [09217ab] Block storage connections are NOT restored on system reboot
      (LP: #1036902)
    - [d9577ce] CVE-2012-3361 not fully addressed (LP: #1031311)
    - [e8ef050] pycrypto is unused and the existing code is potentially insecure
      to use (LP: #1033178)
    - [3b4ac31] cannot umount guestfs (LP: #1013689)
    - [f8255f3] qpid_heartbeat setting in ineffective (LP: #1030430)
    - [413c641] Deallocation of fixed IP occurs before security group refresh
      leading to potential security issue in error / race conditions
      (LP: #1021352)
    - [219c5ca] Race condition in network/deallocate_for_instance() leads to
      security issue (LP: #1021340)
    - [f2bc403] cleanup_file_locks does not remove stale sentinel files
      (LP: #1018586)
    - [4c7d671] Deleting Flavor currently in use by instance creates error
      (LP: #994935)
    - [7e88e39] nova testsuite errors on newer versions of python-boto (e.g.
      2.5.2) (LP: #1027984)
    - [80d3026] NoMoreFloatingIps: Zero floating ips available after repeatedly
      creating and destroying instances over time (LP: #1017418)
    - [4d74631] Launching with source groups under load produces lazy load error
      (LP: #1018721)
    - [08e5128] API 'v1.1/{tenant_id}/os-hosts' does not return a list of hosts
      (LP: #1014925)
    - [801b94a] Restarting nova-compute removes ip packet filters (LP: #1027105)
    - [f6d1f55] instance live migration should create virtual_size disk image
      (LP: #977007)
    - [4b89b4f] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [6e873bc] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [7b215ed] Use default qemu-img cluster size in libvirt connection driver
    - [d3a87a2] Listing flavors with marker set returns 400 (LP: #956096)
    - [cf6a85a] nova-rootwrap hardcodes paths instead of using
      /sbin:/usr/sbin:/usr/bin:/bin (LP: #1013147)
    - [2efc87c] affinity filters don't work if scheduler_hints is None
      (LP: #1007573)
  ...

Read more...

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx)
Changed in nova:
milestone: none → folsom-rc1
status: Fix Committed → Fix Released
Revision history for this message
Mark McLoughlin (markmc) wrote :

https://review.openstack.org/11387 was merged into stable/essex

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-rc1 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.