stable/victoria tests are very unstable because of tempest test_live_block_migration_with_attached_volume

Bug #2009156 reported by Alexey Stupnikov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

stable/victoria merge process is effectively blocked because test_live_block_migration_with_attached_volume tempest test (called as tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_live_block_migration_with_attached_volume and tempest.api.compute.admin.test_live_migration.LiveAutoBlockMigrationV225Test.test_live_block_migration_with_attached_volume) is failing sporadically. As a result, nova-grenade-multinode or nova-live-migration are failing sporadically: one of these jobs fails almost every time.

I have used failed tests for review https://review.opendev.org/c/openstack/nova/+/833436 to analyze the problem.

Picture is always the same:

1. Tempest deletes volume attachment [1]
2. nova-compute fails to detach volume because of error [2]

It looks like a communication problem between nova-compute and libvirt.

[1]
2023-02-28 12:43:25.610324 | controller | 2023-02-28 12:40:08,991 6532 INFO [tempest.lib.common.rest_client] Request (LiveAutoBlockMigrationV225Test:_run_cleanups): 202 DELETE http://213.32.76.87/compute/v2.1/servers/e49ccfac-97fc-4b8d-a15a-6a319d8c9a9b/os-volume_attachments/e0d103c8-cc39-4ccc-a91e-a415ff68bdee 0.217s

[2]
Feb 28 12:45:26.707716 np0033291896 nova-compute[15752]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/virt/libvirt/guest.py", line 472, in _do_wait_and_retry_detach
Feb 28 12:45:26.707716 np0033291896 nova-compute[15752]: ERROR oslo_messaging.rpc.server device=alternative_device_name, reason=reason)
Feb 28 12:45:26.707716 np0033291896 nova-compute[15752]: ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

Tags: testing
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :
Download full text (8.0 KiB)

Notes about failed tests:

Feb 28 2:05 PM
nova-grenade-multinode
https://zuul.opendev.org/t/openstack/build/4ed95ecdc74d4171a252a10dbd7394cd

2023-02-28 12:43:25.610324 | controller | 2023-02-28 12:40:08,991 6532 INFO [tempest.lib.common.rest_client] Request (LiveAutoBlockMigrationV225Test:_run_cleanups): 202 DELETE http://213.32.76.87/compute/v2.1/servers/e49ccfac-97fc-4b8d-a15a-6a319d8c9a9b/os-volume_attachments/e0d103c8-cc39-4ccc-a91e-a415ff68bdee 0.217s
req-5258fab6-17b6-4229-a468-b8f631adb22f

Feb 28 12:45:26.707716 np0033291896 nova-compute[15752]: ERROR oslo_messaging.rpc.server File "/opt/stack/old/nova/nova/virt/libvirt/guest.py", line 472, in _do_wait_and_retry_detach
Feb 28 12:45:26.707716 np0033291896 nova-compute[15752]: ERROR oslo_messaging.rpc.server device=alternative_device_name, reason=reason)
Feb 28 12:45:26.707716 np0033291896 nova-compute[15752]: ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.

Feb 28 11:30 PM
nova-live-migration
https://zuul.opendev.org/t/openstack/build/cbf3a07faab5440eb8a91becef2655c7
tempest.api.compute.admin.test_live_migration.LiveMigrationTest.test_live_block_migration_with_attached_volume

2023-02-28 21:58:26.301717 | controller | 2023-02-28 21:55:10,018 99895 INFO [tempest.lib.common.rest_client] Request (LiveMigrationTest:_run_cleanups): 202 DELETE https://10.209.98.43/compute/v2.1/servers/b94a1419-155e-41ca-84da-fe2bdc10e5fe/os-volume_attachments/64154b1e-47e2-4845-a2a3-6fbbf6b034cf 0.266s
req-66c2cbd5-7cc1-4095-8bdc-b7b47adb5460

Feb 28 22:00:25.255650 np0033301385 nova-compute[90639]: WARNING nova.virt.block_device [None req-66c2cbd5-7cc1-4095-8bdc-b7b47adb5460 tempest-LiveMigrationTest-202272944 tempest-LiveMigrationTest-202272944-project] [instance: b94a1419-155e-41ca-84da-fe2bdc10e5fe] Guest refused to detach volume 64154b1e-47e2-4845-a2a3-6fbbf6b034cf: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Unable to detach the device from the live config.
instance-0000000d
There are no errors/failures in libvirt logs. Only successful requests to remove device:

2023-02-28 21:59:15.232+0000: 59397: debug : qemuDomainObjBeginJobInternal:9470 : Started job: modify (async=none vm=0x7f9474014b30 name=instance-0000000d)
2023-02-28 21:59:15.233+0000: 59397: debug : qemuDomainObjEnterMonitorInternal:9869 : Entering monitor (mon=0x7f9474054dd0 vm=0x7f9474014b30 name=instance-0000000d)
2023-02-28 21:59:15.233+0000: 59397: debug : qemuMonitorDelDevice:2848 : devalias=virtio-disk1
2023-02-28 21:59:15.233+0000: 59397: debug : qemuMonitorDelDevice:2850 : mon:0x7f9474054dd0 vm:0x7f9474014b30 fd:34
2023-02-28 21:59:15.233+0000: 59397: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f9474054dd0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-398"}
 fd=-1
2023-02-28 21:59:15.233+0000: 59392: info : qemuMonitorIOWrite:450 : QEMU_MONITOR_IO_WRITE: mon=0x7f9474054dd0 buf={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-398"}
 len=79 ret=79 errno=0
2023-02-28 21:59:15.234+0000: 59392: debug : qemuMonitorJSONIOProcessLine:220 : ...

Read more...

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

I don't see any related errors/problems in libvirt logs: it looks like libvirt was able to remove volume attachment without any issues:

2023-02-28 21:59:15.232+0000: 59397: debug : qemuDomainObjBeginJobInternal:9470 : Started job: modify (async=none vm=0x7f9474014b30 name=instance-0000000d)
2023-02-28 21:59:15.233+0000: 59397: debug : qemuDomainObjEnterMonitorInternal:9869 : Entering monitor (mon=0x7f9474054dd0 vm=0x7f9474014b30 name=instance-0000000d)
2023-02-28 21:59:15.233+0000: 59397: debug : qemuMonitorDelDevice:2848 : devalias=virtio-disk1
2023-02-28 21:59:15.233+0000: 59397: debug : qemuMonitorDelDevice:2850 : mon:0x7f9474054dd0 vm:0x7f9474014b30 fd:34
2023-02-28 21:59:15.233+0000: 59397: info : qemuMonitorSend:993 : QEMU_MONITOR_SEND_MSG: mon=0x7f9474054dd0 msg={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-398"}
 fd=-1
2023-02-28 21:59:15.233+0000: 59392: info : qemuMonitorIOWrite:450 : QEMU_MONITOR_IO_WRITE: mon=0x7f9474054dd0 buf={"execute":"device_del","arguments":{"id":"virtio-disk1"},"id":"libvirt-398"}
 len=79 ret=79 errno=0
2023-02-28 21:59:15.234+0000: 59392: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"return": {}, "id": "libvirt-398"}]
2023-02-28 21:59:15.234+0000: 59392: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f9474054dd0 reply={"return": {}, "id": "libvirt-398"}
2023-02-28 21:59:15.234+0000: 59397: debug : qemuDomainObjExitMonitorInternal:9892 : Exited monitor (mon=0x7f9474054dd0 vm=0x7f9474014b30 name=instance-0000000d)
2023-02-28 21:59:20.234+0000: 59397: debug : qemuDomainObjEnterMonitorInternal:9869 : Entering monitor (mon=0x7f9474054dd0 vm=0x7f9474014b30 name=instance-0000000d)

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ):

This is very similar to the volume detach bugs we have. See https://gitlab.com/libvirt/libvirt/-/issues/309 To workaround that in tempest we added a waiter before doing volume attach / detach on the guest to wait for the guest to fully boot https://review.opendev.org/q/topic:wait_until_sshable_pingable These workarounds are not part of the victoria-last version of tempest as far as I see.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I will mark this invalid as we cannot really fix this is nova.

Changed in nova:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.