OpenStack Compute (nova)

DiskNotFound exeption is rised when attaching disk just after VM creation

Bug #2012056 reported by Ilya Popov on 2023-03-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Incomplete	Undecided	Unassigned

Bug Description

Description
===========
Faced problem while performing tempest.api.compute.admin.test_volume_swap.TestVolumeSwap.test_volume_swap.

Test create 2 volumes, than create VM and than attaches one volume to VM:

https://github.com/openstack/tempest/blob/master/tempest/api/compute/admin/test_volume_swap.py#L105

Error appears in 2 ways:

1. exception.DiskNotFound

2023-03-13 17:20:41.406 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 2171, in swap_volume
2023-03-13 17:20:41.406 7 ERROR oslo_messaging.rpc.server raise exception.DiskNotFound(location=disk_dev)
2023-03-13 17:20:41.406 7 ERROR oslo_messaging.rpc.server nova.exception.DiskNotFound: No disk at vdb

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L2270

2. exception.VolumeRebaseFailed

2023-03-13 23:15:30.319 7 ERROR oslo_messaging.rpc.server raise exception.VolumeRebaseFailed(reason=str(exc))
2023-03-13 23:15:30.319 7 ERROR oslo_messaging.rpc.server nova.exception.VolumeRebaseFailed: Volume rebase failed: invalid argument: disk 'vdb' not found in domain
2023-03-13 23:15:30.319 7 ERROR oslo_messaging.rpc.server

https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L2255

It looks like race condition problem because:

- both ways described before have the same root cause but in second case we are moved a bit further cause race condition event came a bit later
- if we add sleep in tempest, the error will disappear and won't been fixed anymore:

server = self.create_test_server(wait_until='ACTIVE')
time.sleep(2)

# Attach "volume1" to server
self.attach_volume(server, volume1)

Environment
===========

stable/xena
libvirt/KVM/QEMU

Tags:

Revision history for this message

Ilya Popov (ilya-p) wrote on 2023-03-17:

Looks like it is in QEMU/libvirt area as I didn't find any additional calls to qemu from nova libvirt driver, but there is interesting message in nova-compute log just after disk was attached:

2023-03-16 12:04:46.105 7 WARNING nova.virt.libvirt.driver [req-1acb0047-73a3-485b-b4cf-a20f1210e7c0 - - - - -] Received event <DeviceRemovedEvent: 1678957486.1054864, 92b8a8d2-287d-4d6c-b65f-9bc675ecd74c => virtio-disk1> from libvirt but the driver is not waiting for it; ignored.

Looks like QEMU removes attached disk by its own, if attachment was made just in time VM was created and started. If we wait couple of seconds and than attach disk - everything goes well

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2023-04-12:

@Ilya: do you have a link to the logs of a failed tempest run? Does the step 'self.attach_volume(server, volume1)' in https://github.com/openstack/tempest/blob/3c7eebaaf35c9e8a3f00c76cd1741457bdec9fab/tempest/api/compute/admin/test_volume_swap.py#L119 that fails?

tags:	added: libvirt volumes
Changed in nova:
status:	New → Incomplete

Revision history for this message

Ilya Popov (ilya-p) wrote on 2023-04-14:

Hello, Balazs. Thanks for attention to this bug.

Here is the link to log: https://pastebin.com/kmca5WUG

It fails on forever waiting of new volume status "available" here:

https://github.com/openstack/tempest/blob/3c7eebaaf35c9e8a3f00c76cd1741457bdec9fab/tempest/api/compute/admin/test_volume_swap.py#L123

It waits volume status changes after execution of volume swap via update_attached_volume.

In tempest log we see transition of volume 8a8fe339-c533-4199-b8db-89ffd0136aa8 state from "in-use"
to "detaching" and finally back to "in-use".

on nova-compute we see one of two errors described above. And there is no record about volume 8a8fe339-c533-4199-b8db-89ffd0136aa8 in domain XML anymore. Thats why nova can't find vdb - it search in domain definition.

Revision history for this message

Ilya Popov (ilya-p) wrote on 2023-04-14:

One additional interesting thing:

If we set metadata on cirros image to use scsi bus type like this: --property hw_disk_bus='scsi' --property hw_scsi_model='virtio-scsi'

we won't face error anymore.

IMHO looks like QEMU remove vdb (maybe cause of some internal error) and this event registered by nova

In this case nova don't execute swap volume and finally volume revert to in-use state from openstack point of view, but there is no such disk in domain xml anymore.

I see this error only on 2 installations, but it doesn't reproduced on each test environment, i tried.

There were some issues several years before, looks like more or less close to this one:

https://bugzilla.redhat.com/show_bug.cgi?id=666130
https://bugzilla.redhat.com/show_bug.cgi?id=696596

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #666130
[CLOSED INSUFFICIENT_DATA] Edit
redhat-bugs #696596
[CLOSED ERRATA] Edit

Bug watches keep track of this bug in other bug trackers.