Reboot with bad volume fails ungracefully

Bug #1148614 reported by Rick Harris
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Rick Harris

Bug Description

If a user has an instance that has a Cinder volume attached and, for whatever reason, that volume becomes inoperable, a subsequent reboot operation may cause the instance to go into a permanent halted state.

This affects the `xenapi` driver for sure; it's unknown whether a similar issue exists in the other virt-drivers.

Steps to replicate:

1. Build an instance
2. Attach a cinder-volume (using lvm+iscsi driver)
3. Sever the iscsi connection: killall -s9 tgtd on the cinder volume server
4. Reboot instance
5. Verify that instance goes to halted and can't be started

Proposed solution:

The proposed solution as a few different steps:

1. Detect that reboot failed due to bad-volumes being attached
2. Detect exactly which volumes are bad
3. Detach these volumes in the virt-layer so that the VM operation can be retried
4. Raise an exception to notify the compute-manager layer that a driver operation had the *side-effect* of detaching a set of 'bad' volumes so that any compute level cleanups (destroy BDM, Cinder volume detach) can be made

Note:

The current method of detecting which volume is 'bad' indirectly makes use of a 120 sec timeout within the XenServer code. An upstream patch from Citrix to so that we can 'fail-fast' here would speed up error recover dramatically.

For example, on a given network, we might want to say that a connection hung for more than 10 secs is in accessible rather than having to wait a full two minutes.

Changed in nova:
assignee: nobody → Rick Harris (rconradharris)
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/23662

Sina Sadeghi (sina-sa)
Changed in nova:
status: In Progress → Confirmed
tags: added: cloud-archive
Changed in nova:
status: Confirmed → In Progress
Changed in nova:
milestone: none → grizzly-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/23662
Committed: http://github.com/openstack/nova/commit/40feb35898ed0a6d57b1f481c165e683796b045c
Submitter: Jenkins
Branch: master

commit 40feb35898ed0a6d57b1f481c165e683796b045c
Author: Rick Harris <email address hidden>
Date: Wed Mar 6 05:28:41 2013 +0000

    xenapi: Fix reboot with hung volumes

    If a volume becomes inoperable (e.g. the ISCSI connection is severed)
    and the user goes to reboot, the instance may enter a permanently halted
    state.

    The root cause is that a VBD that points to 'bad' volume prevents VM
    operations ('reboot', 'start') from completing under XenServer.

    The work around is to detect which volumes are bad, detach in the
    virt-layer, retry the operation (or in the case of reboot, just 'start'
    the halted instance), and then notify the compute manager via a
    callback so it can detach the volume in Cinder.

    Fixes bug 1148614

    Change-Id: Id4e8e84bb5748cfa267c2a418f9405fd86829e8f

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-rc1 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.