DiskNotFound exeption is rised when attaching disk just after VM creation

Bug #2012056 reported by Ilya Popov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned

Bug Description

Description
===========
Faced problem while performing tempest.api.compute.admin.test_volume_swap.TestVolumeSwap.test_volume_swap.

Test create 2 volumes, than create VM and than attaches one volume to VM:

https://github.com/openstack/tempest/blob/master/tempest/api/compute/admin/test_volume_swap.py#L105

Error appears in 2 ways:

1. exception.DiskNotFound

2023-03-13 17:20:41.406 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 2171, in swap_volume
2023-03-13 17:20:41.406 7 ERROR oslo_messaging.rpc.server raise exception.DiskNotFound(location=disk_dev)
2023-03-13 17:20:41.406 7 ERROR oslo_messaging.rpc.server nova.exception.DiskNotFound: No disk at vdb

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L2270

2. exception.VolumeRebaseFailed

2023-03-13 23:15:30.319 7 ERROR oslo_messaging.rpc.server raise exception.VolumeRebaseFailed(reason=str(exc))
2023-03-13 23:15:30.319 7 ERROR oslo_messaging.rpc.server nova.exception.VolumeRebaseFailed: Volume rebase failed: invalid argument: disk 'vdb' not found in domain
2023-03-13 23:15:30.319 7 ERROR oslo_messaging.rpc.server

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L2255

It looks like race condition problem because:

- both ways described before have the same root cause but in second case we are moved a bit further cause race condition event came a bit later
- if we add sleep in tempest, the error will disappear and won't been fixed anymore:

        server = self.create_test_server(wait_until='ACTIVE')
        time.sleep(2)

        # Attach "volume1" to server
        self.attach_volume(server, volume1)

Environment
===========

stable/xena
libvirt/KVM/QEMU

Revision history for this message
Ilya Popov (ilya-p) wrote :

Looks like it is in QEMU/libvirt area as I didn't find any additional calls to qemu from nova libvirt driver, but there is interesting message in nova-compute log just after disk was attached:

2023-03-16 12:04:46.105 7 WARNING nova.virt.libvirt.driver [req-1acb0047-73a3-485b-b4cf-a20f1210e7c0 - - - - -] Received event <DeviceRemovedEvent: 1678957486.1054864, 92b8a8d2-287d-4d6c-b65f-9bc675ecd74c => virtio-disk1> from libvirt but the driver is not waiting for it; ignored.

Looks like QEMU removes attached disk by its own, if attachment was made just in time VM was created and started. If we wait couple of seconds and than attach disk - everything goes well

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Ilya: do you have a link to the logs of a failed tempest run? Does the step 'self.attach_volume(server, volume1)' in https://github.com/openstack/tempest/blob/3c7eebaaf35c9e8a3f00c76cd1741457bdec9fab/tempest/api/compute/admin/test_volume_swap.py#L119 that fails?

tags: added: libvirt volumes
Changed in nova:
status: New → Incomplete
Revision history for this message
Ilya Popov (ilya-p) wrote :

Hello, Balazs. Thanks for attention to this bug.

Here is the link to log: https://pastebin.com/kmca5WUG

It fails on forever waiting of new volume status "available" here:

https://github.com/openstack/tempest/blob/3c7eebaaf35c9e8a3f00c76cd1741457bdec9fab/tempest/api/compute/admin/test_volume_swap.py#L123

It waits volume status changes after execution of volume swap via update_attached_volume.

In tempest log we see transition of volume 8a8fe339-c533-4199-b8db-89ffd0136aa8 state from "in-use"
to "detaching" and finally back to "in-use".

on nova-compute we see one of two errors described above. And there is no record about volume 8a8fe339-c533-4199-b8db-89ffd0136aa8 in domain XML anymore. Thats why nova can't find vdb - it search in domain definition.

Revision history for this message
Ilya Popov (ilya-p) wrote :

One additional interesting thing:

If we set metadata on cirros image to use scsi bus type like this: --property hw_disk_bus='scsi' --property hw_scsi_model='virtio-scsi'

we won't face error anymore.

IMHO looks like QEMU remove vdb (maybe cause of some internal error) and this event registered by nova

2023-03-16 12:04:46.105 7 WARNING nova.virt.libvirt.driver [req-1acb0047-73a3-485b-b4cf-a20f1210e7c0 - - - - -] Received event <DeviceRemovedEvent: 1678957486.1054864, 92b8a8d2-287d-4d6c-b65f-9bc675ecd74c => virtio-disk1> from libvirt but the driver is not waiting for it; ignored.

In this case nova don't execute swap volume and finally volume revert to in-use state from openstack point of view, but there is no such disk in domain xml anymore.

I see this error only on 2 installations, but it doesn't reproduced on each test environment, i tried.

There were some issues several years before, looks like more or less close to this one:

https://bugzilla.redhat.com/show_bug.cgi?id=666130
https://bugzilla.redhat.com/show_bug.cgi?id=696596

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.