After the detach volume timeout, the disk is lost after soft reboot

Bug #1942766 reported by wlfightup
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Low
Unassigned

Bug Description

Description
==================
When the detach disk timeout, and then soft reboot the virtual machine, the disk that was detach timeout was lost, but it is displayed in nova database and bind the vm in nova database.

Steps to reproduce
==================
1. create a windows vm,attach a disk to vm. make a big io to the disk.
2. detach disk timeout
3. soft reboot
4. disk is lost

Cause Analysis
==================

Because the detach disk first releases the persist xml now, when the live detach the disk timeout, the persist xml is also gone.
If the virtual machine is soft rebooted at this time, the virtual machine will be lost disk because the persistent xml is missing.

def _detach_with_retry(
       if persistent_dev:
            try:
                self._detach_from_persistent(
                    guest, instance_uuid, persistent_dev, get_device_conf_func,
                    device_name)

wlfightup (wlfightup)
description: updated
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

The description make sense, this is how detach is implemented now.

One way to improve this is we change the order, first try to remove from the live domain as that fails more frequently than the detach from the persistent domain.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Lee noted on IRC that a hard reboot would fix the instance as nova re-creates the persistent domain from the DB during hard reboot. So marking this as Low priority.

tags: added: compute libvirt
Changed in nova:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Jorhson Deng (jorhson) wrote :

Except to hard reboot the instance, there are any other ways to improve the success of living detach-volume? If there is continuous IO with the volume and it will be failure probably.
And after the first failure detaching, if we detach volume later, the libvirt will report that cannot fand the device, because the qemu has detach the volume in the first time.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.