centos 9 featureset 1 ovb fails multiple tempest compute tests

Bug #1957060 reported by Marios Andreou
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

at [1][2] periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master the centos9 featureset 1 job gets as far as tempest but fails multiple compute tests:

  tempest.api.compute.admin.test_volumes_negative
  tempest.api.compute.servers.test_server_actions
  tempest.api.compute.servers.test_server_rescue_negative
  tempest.api.compute.volumes.test_attach_volume_negative
  tempest.api.compute.admin.test_create_server.ServersWithSpecificFlavorTestJSON
  tempest.api.compute.admin.test_volumes_negative.VolumesAdminNegativeTest
  tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON
  tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesUnderV243Test
  tempest.api.compute.servers.test_create_server.ServersTestBootFromVolume
  tempest.api.compute.servers.test_create_server.ServersTestJSON
  tempest.api.compute.servers.test_create_server.ServersTestManualDisk
  tempest.api.compute.servers.test_device_tagging.TaggedAttachmentsTest
  tempest.api.compute.servers.test_device_tagging.TaggedBootDevicesTest
  tempest.api.compute.servers.test_device_tagging.TaggedBootDevicesTest_v242
  tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON
  tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON
  tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON
  tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON
  tempest.api.compute.volumes.test_attach_volume_negative.AttachVolumeNegativeTest
  tempest.api.volume.test_volumes_snapshots.VolumesSnapshotTestJSON
  tempest.scenario.test_network_basic_ops.TestNetworkBasicOps

There is a repeating error in the compute log at [3] that may be related

        2022-01-03 12:56:38.526 2 WARNING os_brick.initiator.connectors.nvmeof [req-02eb6c77-807e-4d93-87d9-a2411db486e6 97f5a5419fd24cbb8928c4d11f8fa5fc c0cf6736afe8437ca21454929cac2680 - default default] Process execution error in _get_host_uuid: Unexpected error while running command.
 Command: blkid overlay -s UUID -o value
 Exit code: 2
 Stdout: ''
 Stderr: '': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

[1] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/a1a7959/logs/undercloud/var/log/tempest/stestr_results.html.gz
[2] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/ba83163/logs/undercloud/var/log/tempest/stestr_results.html.gz
[3] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/ba83163/logs/overcloud-novacompute-0/var/log/containers/nova/nova-compute.log.1.gz

Revision history for this message
chandan kumar (chkumar246) wrote :

On FS02 deployment,
tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops is also failing [1].

``
{0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops [148.564774s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "/usr/lib/python3.9/site-packages/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)

      File "/usr/lib/python3.9/site-packages/tempest/scenario/test_network_basic_ops.py", line 436, in test_network_basic_ops
    self._check_public_network_connectivity(should_connect=True)

      File "/usr/lib/python3.9/site-packages/tempest/scenario/test_network_basic_ops.py", line 212, in _check_public_network_connectivity
    self.check_vm_connectivity(

      File "/usr/lib/python3.9/site-packages/tempest/scenario/manager.py", line 945, in check_vm_connectivity
    self.assertTrue(self.ping_ip_address(ip_address,

      File "/usr/lib64/python3.9/unittest/case.py", line 688, in assertTrue
    raise self.failureException(msg)

    AssertionError: False is not true : Public network connectivity check failed
Timed out waiting for 10.0.0.166 to become reachable

```
By scanning the compute logs [2], I am not able to find any thing suspecious.

Logs:
[1]. https://logserver.rdoproject.org/13/36713/23/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/ee2e466/logs/undercloud/var/log/tempest/tempest_run.log.txt.gz

[2]. https://logserver.rdoproject.org/13/36713/23/check/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-master/ee2e466/logs/overcloud-novacompute-0/var/log/extra/podman/containers/nova_compute/log/nova/nova-compute.log.txt.gz

Revision history for this message
Alan Bishop (alan-bishop) wrote :

I would focus on the apparent network issue. The repeating error noted in the description ("Process execution error in _get_host_uuid: Unexpected error while running command.") is a red herring caused by nova's get_volume_connector() [1] calling os-brick's get_connector_properties() function.

[1] https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/driver.py#L1682
[2] https://opendev.org/openstack/os-brick/src/branch/master/os_brick/initiator/connector.py#L196

The os-brick code loops over every connection protocol, some of which are likely not available in the underlying hardware. In this case, the log message is due to the nvmeof protocol not being supported. There's a similar message just before it that states, "No Fibre Channel support detected on system."

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

did some digging in logs at [1] but can't find something yet ... per comment #2 above i focused initially on neutron container logs ... there are plenty of errors but they are also in a 'good' centos8 job - I used [2] for comparison

I checked the compute and controller logs too... one interesting thing I found was that controller2 rabbit has some issue [3]:

        * 2022-01-05 15:32:28.925763+00:00 [error] <0.26873.0> closing AMQP connection <0.26873.0> (172.17.0.181:58096 -> 172.17.0.124:5672 - mod_wsgi:9:cf648468-d776-4edc-8b4b-bea38fed48ae):
2022-01-05 15:32:28.925763+00:00 [error] <0.26873.0> missed heartbeats from client, timeout: 60s

and it repeats several times (but only once in the 'good' log :/)

[1] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/a1a7959/logs/undercloud/var/log/extra/errors.txt.gz
[2] https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/621b120/logs/undercloud/var/log/extra/errors.txt.gz
[3] https://logserver.rdoproject.org/67/36267/44/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/a1a7959<email address hidden>

Revision history for this message
melanie witt (melwitt) wrote :
Download full text (13.9 KiB)

Taking an example from comment 0 [1], this test failed because the guest refused to detach a volume from its live domain config.

Test result:

test_update_attached_volume_with_nonexistent_volume_in_body[id-7dcac15a-b107-46d3-a5f6-cb863f4e454a,negative]

traceback-1: {{{
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/tempest/common/waiters.py", line 312, in wait_for_volume_resource_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53 failed to reach available status (current in-use) within the required time (300 s).
}}}

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/tempest/common/waiters.py", line 380, in wait_for_volume_attachment_remove_from_server
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: Volume b2ce2e09-c9a9-46d3-9fb4-99f6ff188e53 failed to detach from server 59b55be0-0ac9-42a1-a084-592b78fd1b50 within the required time (300 s) from the compute API perspective

nova-compute.log excerpts [2]:

First detach attempt failure:

2022-01-05 12:58:42.735 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2399

Retry of detach attempt try 2 out of 8:

2022-01-05 12:58:42.736 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] (2/8): Attempting to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2381

This repeats more times until attempt 7 fails and attempt 8 is started:

2022-01-05 13:01:12.869 2 ERROR nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 is timed out.
2022-01-05 13:01:12.873 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498b7f2596 - default default] Failed to detach device vdb with device alias virtio-disk1 from instance 59b55be0-0ac9-42a1-a084-592b78fd1b50 from the live domain config. Libvirt did not report any error but the device is still in the config. _detach_from_live_with_retry /usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py:2399
2022-01-05 13:01:12.873 2 DEBUG nova.virt.libvirt.driver [req-e020770a-2ca4-4f89-a4e3-d955214ae522 639091a8b4c8483899343f5019c5545f 605a59a0db6e43e8b4e3a5498...

Revision history for this message
Artom Lifshitz (notartom) wrote :

The more I look at this, the more I suspect these are all caused by the guest OS not coming up correctly.

Going by the first (chronologically) test that failed, it's test_verify_created_server_ephemeral_disk starting at 2022-01-05 07:56:44,753. We're also lucky in the sense that the test doesn't do anything with volumes or other device attachments, it's a pure "boot server and wait for it to become validatable before doing other things" test.

Well, the server appears to boot fine, but when the validation code attempts to get its console output, it keeps coming up empty, and this fails the test. The server boot request ID is req-ce7cf376-140c-411b-969d-cbad243dab59, and it appears to work. From the generated XML, its libvirt instance name is instance-0000000e. Its qemu instance log doesn't show anything interesting:

<snip>
-msg timestamp=on
char device redirected to /dev/pts/0 (label charserial0)
2022-01-05T13:07:16.488524Z qemu-kvm: terminating on signal 15 from pid 20662 (/usr/sbin/virtqemud)
2022-01-05 13:07:16.685+0000: shutting down, reason=destroyed

This fits with Mel's previous comment - they both imply the guest OS not actually running properly. This is most likely caused by use of the q35 machine type, though to my knowledge it's the first time we see it happen without volumes being attached and/or detached on a "plain" instance.

We'll need to get the qemu folks involved, and we can figure out the best way to either reproduce this consistently, or to gather the required information from a failing CI run.

Revision history for this message
Artom Lifshitz (notartom) wrote :

Yeah, this is looking identical to [1], which itself is a duplicate of [2].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2024662
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2007129

Revision history for this message
Marios Andreou (marios-b) wrote (last edit ):

adding here for tracking ... chkumar moved the test_network_basic_ops to skiplist with https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/824119

the tests failing in the more recent runs, like in comment #3 above are pointing to this being duplicate for LP https://bugs.launchpad.net/tripleo/+bug/1957031

I don't think we'll root cause this today so we will move the remaining tests to skiplist too with https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/824689

results from latest periodic run @ [1] has these failing:

        * tearDownClass (tempest.api.compute.servers.test_server_actions
        * tearDownClass (tempest.api.compute.servers.test_server_rescue_negative
        * tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON
        * tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON
        * tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON
        * tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON

[1] https://logserver.rdoproject.org/openstack-periodic-integration-main-cs9/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/9001ddc/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message
David Vallee Delisle (valleedelisle) wrote :

After nailing a few issues regarding cold migration broken because ssh client is now calling sftp-server instead of scp, and also some selinux issue, I finally can reproduce some of these volumes errors.

The one I'm working on right now is trying to detach a volume after rebuilding the server. So far, it looks like the volume was used as the boot device after the rebuild. I still need to confirme, validate and reproduce more, but we're onto something here.

Revision history for this message
David Vallee Delisle (valleedelisle) wrote :

I believe the volume detach problem is a bug with libvirt 7.9.0 / 7.10.0, when used in combination with qemu 6.2.0.

https://lists.gnu.org/archive/html/qemu-devel/2022-01/msg00542.html

Workaround is to add capability_filters = [ "device.json" ] to qemu.conf on the compute. There's no puppet or THT parameter for this unfortunately so I added this after the deployment:

ansible -i /home/stack/overcloud-deploy/overcloud/tripleo-ansible-inventory.yaml -m shell -ba "grep -q '^capability_filter' /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/qemu.conf || (echo 'capability_filters = [ \"device.json\" ]' >> /var/lib/config-data/puppet-generated/nova_libvirt/etc/libvirt/qemu.conf && systemctl restart tripleo_nova_virtqemud)" Compute

This is until danpb's patch is merged.

I confirmed that it works in my environment.

Revision history for this message
Marios Andreou (marios-b) wrote :

The fix in comment #11 looks like it merged @ https://gitlab.com/qemu-project/qemu/-/commit/64b4529a432507ee84a924be69a03432639e87ba

I ran the job with a revert on the skips and results look good at https://review.rdoproject.org/r/c/testproject/+/36267/48#message-8f96aaed486c8bfe6bf637d5bc70eb1c887563e8

e.g. @ https://logserver.rdoproject.org/67/36267/48/check/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-master/56bf173/logs/undercloud/var/log/tempest/stestr_results.html.gz

        * test_attach_volume_shelved_or_offload_server[id-13a940b6-3474-4c3c-b03f-29b89112bfee,slow] pass
        * test_attach_detach_volume[id-52e9045a-e90d-4c0d-9087-79d657faffff,slow] pass
        * test_rebuild_server_with_volume_attached[id-b68bd8d6-855d-4212-b59b-2e704044dace,slow,volume] pass
        * test_rescued_vm_detach_volume[id-f56e465b-fe10-48bf-b75d-646cda3a8bc9,negative,volume] pass

So lets merge that https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/825220

Note that we already merged a related revert at https://review.opendev.org/c/openstack/openstack-tempest-skiplist/+/824965

Ronelle Landy (rlandy)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.