OpenStack Compute (nova)

After the detach volume timeout, the disk is lost after soft reboot

Bug #1942766 reported by wlfightup on 2021-09-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Triaged	Low	Unassigned

Bug Description

Description
==================
When the detach disk timeout, and then soft reboot the virtual machine, the disk that was detach timeout was lost, but it is displayed in nova database and bind the vm in nova database.

Steps to reproduce
==================
1. create a windows vm，attach a disk to vm. make a big io to the disk.
2. detach disk timeout
3. soft reboot
4. disk is lost

Cause Analysis
==================

Because the detach disk first releases the persist xml now, when the live detach the disk timeout, the persist xml is also gone.
If the virtual machine is soft rebooted at this time, the virtual machine will be lost disk because the persistent xml is missing.

def _detach_with_retry(
       if persistent_dev:
            try:
                self._detach_from_persistent(
                    guest, instance_uuid, persistent_dev, get_device_conf_func,
                    device_name)

See original description

Tags:

wlfightup (wlfightup) on 2021-09-06

description:

updated

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-09-07:

The description make sense, this is how detach is implemented now.

One way to improve this is we change the order, first try to remove from the live domain as that fails more frequently than the detach from the persistent domain.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2021-09-07:

Lee noted on IRC that a hard reboot would fix the instance as nova re-creates the persistent domain from the DB during hard reboot. So marking this as Low priority.

tags:	added: compute libvirt
Changed in nova:
status:	New → Triaged
importance:	Undecided → Low

Revision history for this message

Jorhson Deng (jorhson) wrote on 2021-12-16:

Except to hard reboot the instance, there are any other ways to improve the success of living detach-volume? If there is continuous IO with the volume and it will be failure probably.
And after the first failure detaching, if we detach volume later, the libvirt will report that cannot fand the device, because the qemu has detach the volume in the first time.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.