Bug #1957060 “centos 9 featureset 1 ovb fails multiple tempest c...” : Bugs : tripleo

Revision history for this message

chandan kumar (chkumar246) wrote on 2022-01-11:

#1

On FS02 deployment,
tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops is also failing [1].

``
{0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops [148.564774s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):

File "/usr/lib/python3.9/site-packages/tempest/common/utils/__init__.py", line 70, in wrapper
return f(*func_args, **func_kwargs)

File "/usr/lib/python3.9/site-packages/tempest/scenario/test_network_basic_ops.py", line 436, in test_network_basic_ops
self._check_public_network_connectivity(should_connect=True)

File "/usr/lib/python3.9/site-packages/tempest/scenario/test_network_basic_ops.py", line 212, in _check_public_network_connectivity
self.check_vm_connectivity(

File "/usr/lib/python3.9/site-packages/tempest/scenario/manager.py", line 945, in check_vm_connectivity
self.assertTrue(self.ping_ip_address(ip_address,

File "/usr/lib64/python3.9/unittest/case.py", line 688, in assertTrue
raise self.failureException(msg)

AssertionError: False is not true : Public network connectivity check failed
Timed out waiting for 10.0.0.166 to become reachable

```
By scanning the compute logs [2], I am not able to find any thing suspecious.

Logs:
[1]. https://logserver.rdoproject.org/13/36713/23/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/ee2e466/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz

[2]. https://logserver.rdoproject.org/13/36713/23/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/ee2e466/logs/overcloud-novacompute-0/var/log/extra/podman/containers/nova_compute/log/nova/nova-compute.log.txt.gz

Revision history for this message

Alan Bishop (alan-bishop) wrote on 2022-01-11:

#2

I would focus on the apparent network issue. The repeating error noted in the description ("Process execution error in _get_host_uuid: Unexpected error while running command.") is a red herring caused by nova's get_volume_connector() [1] calling os-brick's get_connector_properties() function.

[1] https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/driver.py#L1682
[2] https://opendev.org/openstack/os-brick/src/branch/master/os_brick/initiator/connector.py#L196

The os-brick code loops over every connection protocol, some of which are likely not available in the underlying hardware. In this case, the log message is due to the nvmeof protocol not being supported. There's a similar message just before it that states, "No Fibre Channel support detected on system."

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-01-11:

#3

newer logs from today for fs1 at https://logserver.rdoproject.org/67/36267/45/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/42e75f7/logs/undercloud/var/log/tempest/stestr_results.html.gz

but there is a subset of the tests listed in the description failing in this case

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-01-12:

#5

did some digging in logs at [1] but can't find something yet ... per comment #2 above i focused initially on neutron container logs ... there are plenty of errors but they are also in a 'good' centos8 job - I used [2] for comparison

I checked the compute and controller logs too... one interesting thing I found was that controller2 rabbit has some issue [3]:

* 2022-01-05 15:32:28.925763+00:00 [error] <0.26873.0> closing AMQP connection <0.26873.0> (172.17.0.181:58096 -> 172.17.0.124:5672 - mod_wsgi:9:cf648468-d776-4edc-8b4b-bea38fed48ae):
2022-01-05 15:32:28.925763+00:00 [error] <0.26873.0> missed heartbeats from client, timeout: 60s

and it repeats several times (but only once in the 'good' log :/)

[1] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/a1a7959/logs/undercloud/var/log/extra/errors.txt.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/621b120/logs/undercloud/var/log/extra/errors.txt.gz
[3] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/a1a7959<email address hidden>

Revision history for this message

melanie witt (melwitt) wrote on 2022-01-13:

#6

Download full text (13.9 KiB)

Taking an example from comment 0 [1], this test failed because the guest refused to detach a volume from its live domain config.

Test result:

test_update_attached_volume_with_nonexistent_volume_in_body[id-7dcac15a-b107-46d3-a5f6-cb863f4e454a,negative]

traceback-1: {{{
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/tempest/common/waiters.py", line 312, in wait_for_volume_resource_status
raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53 failed to reach available status (current in-use) within the required time (300 s).
}}}

Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/tempest/common/waiters.py", line 380, in wait_for_volume_attachment_remove_from_server
raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: Volume b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53 failed to detach from server 59b55be0-0ac9-42a1-a084-592b78fd1b50 within the required time (300 s) from the compute API perspective

nova-compute.log excerpts [2]:

First detach attempt failure:

2022-01-05 12:58:42.735 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2399

Retry of detach attempt try 2 out of 8:

2022-01-05 12:58:42.736 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] (2/8): Attempting to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2381

This repeats more times until attempt 7 fails and attempt 8 is started:

2022-01-05 13:01:12.869 2 ERROR nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 is timed out.
2022-01-05 13:01:12.873 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2399
2022-01-05 13:01:12.873 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498...

Taking an example from comment 0 [1], this test failed because the guest refused to detach a volume from its live domain config.

Test result:

test_update_attached_volume_with_nonexistent_volume_in_body[id-7dcac15a-b107-46d3-a5f6-cb863f4e454a,negative]

traceback-1: {{{
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/tempest/common/waiters.py", line 312, in wait_for_volume_resource_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53 failed to reach available status (current in-use) within the required time (300 s).
}}}

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/tempest/common/waiters.py", line 380, in wait_for_volume_attachment_remove_from_server
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: Volume b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53 failed to detach from server 59b55be0-0ac9-42a1-a084-592b78fd1b50 within the required time (300 s) from the compute API perspective

nova-compute.log excerpts [2]:

First detach attempt failure:

2022-01-05 12:58:42.735 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2399

Retry of detach attempt try 2 out of 8:

2022-01-05 12:58:42.736 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] (2/8): Attempting to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2381

This repeats more times until attempt 7 fails and attempt 8 is started:

2022-01-05 13:01:12.869 2 ERROR nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 is timed out.
2022-01-05 13:01:12.873 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2399
2022-01-05 13:01:12.873 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] (8/8): Attempting to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2381
2022-01-05 13:01:12.874 2 DEBUG nova.virt.libvirt.guest [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] detach device xml: <disk type="block" device="disk">
  <driver name="qemu" type="raw" cache="none" io="native"/>
  <source dev="/dev/sda"/>
  <target dev="vdb" bus="virtio"/>
  <serial>b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53</serial>
  <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
</disk>
 detach_device /usr/lib/python3.9/site-packages/nova/virt/libvirt/guest.py:440

And that also fails and since the total 8/8 attempts were failed, a traceback is logged:

2022-01-05 13:01:37.885 2 ERROR nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 is timed out.
2022-01-05 13:01:37.889 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2399
2022-01-05 13:01:37.890 2 ERROR nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Run out of retry while detaching device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Device is still attached to the guest.
2022-01-05 13:01:37.891 2 WARNING nova.virt.block_device [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] [instance: 59b55be0-0ac9-42a1-a084-592b78fd1b50] Guest refused to detach volume b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Run out of retry while detaching device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Device is still attached to the guest.
2022-01-05 13:01:37.959 2 DEBUG ovsdbapp.backend.ovs_idl.vlog [-] [POLLIN] on fd 28 __log_wakeup /usr/lib64/python3.9/site-packages/ovs/poller.py:263
2022-01-05 13:01:38.083 2 DEBUG oslo_concurrency.lockutils [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Lock "59b55be0-0ac9-42a1-a084-592b78fd1b50" "released" by "nova.compute.manager.ComputeManager.detach_volume.<locals>.do_detach_volume" :: held 200.511s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:400
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Exception during message handling: nova.exception.DeviceDetachFailed: Device detach failed for vdb: Run out of retry while detaching device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Device is still attached to the guest.
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 71, in wrapped
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     _emit_versioned_exception_notification(
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self.force_reraise()
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     raise self.value
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1437, in decorated_function
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 211, in decorated_function
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     compute_utils.add_instance_fault_from_exc(context,
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self.force_reraise()
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     raise self.value
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 200, in decorated_function
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 7232, in detach_volume
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     do_detach_volume(context, volume_id, instance, attachment_id)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 391, in inner
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 7229, in do_detach_volume
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self._detach_volume(context, bdm, instance,
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 7180, in _detach_volume
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     driver_bdm.detach(context, instance, self.volume_api, self.driver,
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/block_device.py", line 462, in detach
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self._do_detach(context, instance, volume_api, virt_driver,
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/block_device.py", line 394, in _do_detach
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self.driver_detach(context, instance, volume_api, virt_driver)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/block_device.py", line 326, in driver_detach
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     volume_api.roll_detaching(context, volume_id)
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self.force_reraise()
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     raise self.value
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/block_device.py", line 314, in driver_detach
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     virt_driver.detach_volume(context, connection_info, instance, mp,
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 2591, in detach_volume
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self._detach_with_retry(
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 2331, in _detach_with_retry
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     self._detach_from_live_with_retry(
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 2410, in _detach_from_live_with_retry
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server     raise exception.DeviceDetachFailed(
2022-01-05 13:01:38.156 2 ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for vdb: Run out of retry while detaching device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Device is still attached to the guest.

The above ^ causes the test to fail as the volume never detaches to return to available status.

[1] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/a1a7959/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/a1a7959/logs/overcloud-novacompute-0/var/log/containers/nova/nova-compute.log.1.gz

Revision history for this message

Artom Lifshitz (notartom) wrote on 2022-01-13:

#7

The more I look at this, the more I suspect these are all caused by the guest OS not coming up correctly.

Going by the first (chronologically) test that failed, it's test_verify_created_server_ephemeral_disk starting at 2022-01-05 07:56:44,753. We're also lucky in the sense that the test doesn't do anything with volumes or other device attachments, it's a pure "boot server and wait for it to become validatable before doing other things" test.

Well, the server appears to boot fine, but when the validation code attempts to get its console output, it keeps coming up empty, and this fails the test. The server boot request ID is req-ce7cf376-140c-411b-969d-cbad243dab59, and it appears to work. From the generated XML, its libvirt instance name is instance-0000000e. Its qemu instance log doesn't show anything interesting:

<snip>
-msg timestamp=on
char device redirected to /dev/pts/0 (label charserial0)
2022-01-05T13:07:16.488524Z qemu-kvm: terminating on signal 15 from pid 20662 (/usr/sbin/virtqemud)
2022-01-05 13:07:16.685+0000: shutting down, reason=destroyed

This fits with Mel's previous comment - they both imply the guest OS not actually running properly. This is most likely caused by use of the q35 machine type, though to my knowledge it's the first time we see it happen without volumes being attached and/or detached on a "plain" instance.

We'll need to get the qemu folks involved, and we can figure out the best way to either reproduce this consistently, or to gather the required information from a failing CI run.

Revision history for this message

Artom Lifshitz (notartom) wrote on 2022-01-13:

#8

Yeah, this is looking identical to [1], which itself is a duplicate of [2].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2024662
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2007129

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-01-14 (last edit on 2022-01-14):

#9

adding here for tracking ... chkumar moved the test_network_basic_ops to skiplist with https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/824119

the tests failing in the more recent runs, like in comment #3 above are pointing to this being duplicate for LP https://bugs.launchpad.net/tripleo/+bug/1957031

I don't think we'll root cause this today so we will move the remaining tests to skiplist too with https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/824689

results from latest periodic run @ [1] has these failing:

        * tearDownClass (tempest.api.compute.servers.test_server_actions
        * tearDownClass (tempest.api.compute.servers.test_server_rescue_negative
        * tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON
        * tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON
        * tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON
        * tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main-cs9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/9001ddc/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message

David Vallee Delisle (valleedelisle) wrote on 2022-01-14:

#10

After nailing a few issues regarding cold migration broken because ssh client is now calling sftp-server instead of scp, and also some selinux issue, I finally can reproduce some of these volumes errors.

The one I'm working on right now is trying to detach a volume after rebuilding the server. So far, it looks like the volume was used as the boot device after the rebuild. I still need to confirme, validate and reproduce more, but we're onto something here.

Revision history for this message

David Vallee Delisle (valleedelisle) wrote on 2022-01-18:

#11

I believe the volume detach problem is a bug with libvirt 7.9.0 / 7.10.0, when used in combination with qemu 6.2.0.

https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg00542.html

Workaround is to add capability_filters = [ "device.json" ] to qemu.conf on the compute. There's no puppet or THT parameter for this unfortunately so I added this after the deployment:

ansible -i /home/stack/overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba "grep -q '^capability_filter' /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/qemu.conf || (echo 'capability_filters = [ \"device.json\" ]' >> /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/qemu.conf && systemctl restart tripleo_nova_virtqemud)" Compute

This is until danpb's patch is merged.

I confirmed that it works in my environment.

Revision history for this message

Marios Andreou (marios-b) wrote on 2022-01-20:

#12

The fix in comment #11 looks like it merged @ https://gitlab.com/qemu-project/qemu/-/commit/64b4529a432507ee84a924be69a03432639e87ba

I ran the job with a revert on the skips and results look good at https://review.rdoproject.org/r/c/testproject/+/36267/48#message-8f96aaed486c8bfe6bf637d5bc70eb1c887563e8

e.g. @ https://logserver.rdoproject.org/67/36267/48/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/56bf173/logs/undercloud/var/log/tempest/stestr_results.html.gz

        * test_attach_volume_shelved_or_offload_server[id-13a940b6-3474-4c3c-b03f-29b89112bfee,slow] pass
        * test_attach_detach_volume[id-52e9045a-e90d-4c0d-9087-79d657faffff,slow] pass
        * test_rebuild_server_with_volume_attached[id-b68bd8d6-855d-4212-b59b-2e704044dace,slow,volume] pass
        * test_rescued_vm_detach_volume[id-f56e465b-fe10-48bf-b75d-646cda3a8bc9,negative,volume] pass

So lets merge that https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/825220

Note that we already merged a related revert at https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/824965

Ronelle Landy (rlandy) on 2022-01-21

Changed in tripleo:
status:	Triaged → Fix Released

tripleo

centos 9 featureset 1 ovb fails multiple tempest compute tests

Bug Description

Other bug subscribers

Remote bug watches