qemu 6.1 in CentOS Stream 8/9 regression in q35, unable to add more than 15 pcie-root-ports

Bug #1950916 reported by Ronelle Landy
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

tripleo-ci-centos-8-standalone (with other check/gate/periodic tests) is failing tempest:

tempest.scenario.test_network_basic_ops.TestNetworkBasicOps

tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern

The failure reports show connectivity issues:

For
tempest.scenario.test_network_basic_ops.TestNetworkBasicOps:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_basic_ops.py", line 436, in test_network_basic_ops
    self._check_public_network_connectivity(should_connect=True)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/test_network_basic_ops.py", line 214, in _check_public_network_connectivity
    message, server, mtu=mtu)
  File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 951, in check_vm_connectivity
    msg=msg)
  File "/usr/lib/python3.6/site-packages/unittest2/case.py", line 705, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true : Public network connectivity check failed
Timed out waiting for 192.168.24.187 to become reachable

For tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern:

021-11-14 05:26:11,087 201102 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.164 after 4 attempts. Proxy client: no proxy client
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 112, in _get_ssh_connection
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh sock=proxy_chan)
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 349, in connect
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/util.py", line 283, in retry_on_signal
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh return function()
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh File "/usr/lib/python3.6/site-packages/paramiko/client.py", line 349, in <lambda>
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh retry_on_signal(lambda: sock.connect(addr))
2021-11-14 05:26:11.087 201102 ERROR tempest.lib.common.ssh socket.timeout: timed out

tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.164 via SSH timed out.
User: cirros, Password: None

Example log is included below:

https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3c4/periodic/opendev.org/openstack/tripleo-ci/master/tripleo-ci-centos-8-standalone/3c4b9f5/logs/undercloud/var/log/tempest/stestr_results.html

Revision history for this message
Ronelle Landy (rlandy) wrote :
Download full text (4.0 KiB)

logs/undercloud/var/log/extra/errors.txt shows:

2021-11-14 05:26:11.097 ERROR /var/log/containers/nova/nova-api.log: 11 ERROR oslo.messaging._drivers.impl_rabbit [-] [03e67fe0-daf8-4ac3-8bda-beae53b27e78] AMQP server on standalone.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova Traceback (most recent call last):
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/bin/nova-conductor", line 10, in <module>
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova sys.exit(main())
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/cmd/conductor.py", line 46, in main
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova topic=rpcapi.RPC_TOPIC)
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/service.py", line 256, in create
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova periodic_interval_max=periodic_interval_max)
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/service.py", line 116, in __init__
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova self.manager = manager_class(host=self.host, *args, **kwargs)
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 119, in __init__
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova self.compute_task_mgr = ComputeTaskManager()
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/conductor/manager.py", line 244, in __init__
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova self.report_client = report.SchedulerReportClient()
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 188, in __init__
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova self._client = self._create_client()
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/scheduler/client/report.py", line 231, in _create_client
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova client = self._adapter or utils.get_sdk_adapter('placement')
2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova File "/usr/lib/python3.6/site-packages/nova/utils.py", line 984, in get_sdk_adapter
2021-11-14 04:48:11.612 ERROR /var/log/con...

Read more...

Changed in tripleo:
milestone: none → yoga-1
importance: Undecided → Critical
status: New → Triaged
tags: added: ci promotion-blocker
Revision history for this message
Ronelle Landy (rlandy) wrote :

Extracting the error:

2021-11-14 04:48:11.612 ERROR /var/log/containers/nova/nova-conductor.log: 7 ERROR nova openstack.exceptions.NotSupported: The placement service for 192.168.24.3:regionOne exists but does not have any supported versions.

Revision history for this message
Ronelle Landy (rlandy) wrote :

https://trunk.rdoproject.org/centos8-master/report.html

dlrn openstack/neuron is also failing from 11/13

Ronelle Landy (rlandy)
summary: - master check/gate/periodic tests are failing tempest - connectivity
- issues
+ master and wallaby check/gate/periodic tests are failing tempest -
+ connectivity issues
Revision history for this message
Ronelle Landy (rlandy) wrote : Re: master and wallaby check/gate/periodic tests are failing tempest - connectivity issues

https://logserver.rdoproject.org/56/36256/18/check/periodic-tripleo-ci-centos-8-scenario001-standalone-wallaby/beef770/logs/undercloud/var/log/extra/podman/containers/neutron_api/log/neutron/server.log.txt.gz

2021-11-14 20:09:03.802 29 ERROR neutron_lib.callbacks.manager [req-a0e7bce2-e36b-4eef-bfac-eb0deff38b82 368880513bcf4433b237d4a88fb84ffb 5be0afd7bbfb4cfdaaeccca5d14e055c - default default] Error during notification for neutron.services.logapi.logging_plugin.LoggingPlugin._clean_security_group_logs-72805 security_group, after_delete: TypeError: _clean_security_group_logs() got an unexpected keyword argument 'context'
2021-11-14 20:09:03.802 29 ERROR neutron_lib.callbacks.manager Traceback (most recent call last):
2021-11-14 20:09:03.802 29 ERROR neutron_lib.callbacks.manager File "/usr/lib/python3.6/site-packages/neutron_lib/callbacks/manager.py", line 197, in _notify_loop
2021-11-14 20:09:03.802 29 ERROR neutron_lib.callbacks.manager callback(resource, event, trigger, **kwargs)
2021-11-14 20:09:03.802 29 ERROR neutron_lib.callbacks.manager TypeError: _clean_security_group_logs() got an unexpected keyword argument 'context'
2021-11-14 20:09:03.802 29 ERROR neutron_lib.callbacks.manager

Revision history for this message
Amol Kahat (amolkahat) wrote :
Download full text (8.0 KiB)

Tempest Test found router:
1-11-15 01:57:27,120 158054 INFO [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:test_network_basic_ops): 201 POST http://192.168.24.3:9696/v2.0/routers 3.188s
2021-11-15 01:57:27,121 158054 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"name": "tempest-TestNetworkBasicOps-router-1232673168", "admin_state_up": true, "project_id": "08157c21409d440dba084ccabed97461", "external_gateway_info": {"network_id": "aa047c87-117f-404e-98ff-5ccb675ff93a"}}}
    Response - Headers: {'content-type': 'application/json', 'content-length': '690', 'x-openstack-request-id': 'req-bf5c0242-4cea-4480-9dc7-f75cefbb98e2', 'date': 'Mon, 15 Nov 2021 01:57:27 GMT', 'connection': 'close', 'status': '201', 'content-location': 'http://192.168.24.3:9696/v2.0/routers'}
        Body: b'{"router": {"id": "93922895-a67b-4e2a-abee-ddff06583470", "name": "tempest-TestNetworkBasicOps-router-1232673168", "tenant_id": "08157c21409d440dba084ccabed97461", "admin_state_up": true, "status": "ACTIVE", "external_gateway_info": {"network_id": "aa047c87-117f-404e-98ff-5ccb675ff93a", "external_fixed_ips": [{"subnet_id": "34588ca7-5b76-4dee-9909-3926138ce219", "ip_address": "192.168.24.181"}], "enable_snat": true}, "description": "", "availability_zones": [], "availability_zone_hints": [], "routes": [], "flavor_id": null, "tags": [], "created_at": "2021-11-15T01:57:24Z", "updated_at": "2021-11-15T01:57:25Z", "revision_number": 3, "project_id": "08157c21409d440dba084ccabed97461"}}'
2021-11-15 01:57:27,480 158054 INFO [tempest.lib.common.rest_client] Request (TestNetworkBasicOps:test_network_basic_ops): 200 GET http://192.168.24.3:9696/v2.0/subnets?project_id=08157c21409d440dba084ccabed97461&cidr=10.100.0.0%2F28 0.358s
2021-11-15 01:57:27,481 158054 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}

- https://logserver.rdoproject.org/openstack-periodic-integration-main/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-scenario007-standalone-master/8664fc6/logs/undercloud/var/log/tempest/stestr_results.html.gz

At the same time on neutron side, router is missing:
2021-11-15 01:57:26.855 129917 DEBUG neutron.agent.l3.agent [req-bf5c0242-4cea-4480-9dc7-f75cefbb98e2 c7a8f9be62a04c3d8a1407a13c46b26a 08157c21409d440dba084ccabed97461 - - -] Got routers updated notification :['93922895-a67b-4e2a-abee-ddff06583470'] routers_updated /usr/lib/python3.6/site-packages/neutron/agent/l3/agent.py:582
2021-11-15 01:57:26.857 129917 INFO neutron.agent.l3.agent [-] Starting processing update 93922895-a67b-4e2a-abee-ddff06583470, action 3, priority 1, update_id bd1a5495-227d-45a6-91d8-303df87f84fb. Wait time elapsed: 0.001
2021-11-15 01:57:26.858 129917 INFO neutron.agent.l3.agent [-] Starting router update for 93922895-a67b-4e2a-abee-ddff06583470, action 3, priority 1, update_id bd1a5495-227d-45a6-91d8-303df87f84fb. Wait time elapsed: 0.001
2021-11-15 01:57:26.858 129917 DEBUG neutron.common.utils [-] Time...

Read more...

Revision history for this message
Ronelle Landy (rlandy) wrote :

slaweq> but I don't think that this is an issue really
<slaweq> it seems for me that vms created there aren't started
<slaweq> if You check console log of any of them VMs there, it is always empty
<slaweq> I booted one VM manually and its console log is empty also
<slaweq> I think that someone from the compute team should take a look into that first

Revision history for this message
Ronelle Landy (rlandy) wrote :

This failure started on 11/13 - impacting only wallaby and master - train, ussuri and victoria are still running - as are all the component lines

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

https://91ddfc31996f32105a65-e86ceb7af3919e0f43bac254438f726c.ssl.cf5.rackcdn.com/817548/1/gate/tripleo-ci-centos-8-standalone/822b8ee/logs/undercloud/var/log/containers/nova/nova-compute.log

shows:

2021-11-15 11:59:02.487 7 WARNING os_brick.initiator.connectors.nvmeof [req-8e3ef8ff-57bc-462f-99b0-3887fc839c88 72284933f018441c9ac44e33f762c911 fa49b82a9af1415ca9bf83c9e8baa080 - default default] Process execution error in _get_host_uuid: Unexpected error while running command.
Command: blkid overlay -s UUID -o value
Exit code: 2

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Artom Lifshitz (notartom) wrote :
Download full text (7.5 KiB)

Using zuul@38.102.83.147 as an investigation environment, using the test_snapshot_pattern failed test with the following traceback:

    Traceback (most recent call last):
      File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 70, in wrapper
        return f(*func_args, **func_kwargs)
      File "/usr/lib/python3.6/site-packages/tempest/scenario/test_snapshot_pattern.py", line 63, in test_snapshot_pattern
        server=server)
      File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 1068, in create_timestamp
        username=username)
      File "/usr/lib/python3.6/site-packages/tempest/scenario/manager.py", line 726, in get_remote_client
        linux_client.validate_authentication()
      File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 31, in wrapper
        return function(self, *args, **kwargs)
      File "/usr/lib/python3.6/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 114, in validate_authentication
        self.ssh_client.test_connection_auth()
      File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 216, in test_connection_auth
        connection = self._get_ssh_connection()
      File "/usr/lib/python3.6/site-packages/tempest/lib/common/ssh.py", line 128, in _get_ssh_connection
        password=self.password)
    tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.152 via SSH timed out.
    User: cirros, Password: None

That test's boot request was:

    2021-11-15 12:02:30,178 315743 INFO [tempest.lib.common.rest_client] Request (TestSnapshotPattern:test_snapshot_pattern): 202 POST http://192.168.24.3:8774/v2.1/servers 2.413s

With request ID req-301843dd-1ad1-4548-a5d4-d1f10f57b5bb:

        Response - Headers: {'date': 'Mon, 15 Nov 2021 12:02:27 GMT', 'server': 'Apache', 'content-length': '400', 'location': 'http://192.168.24.3:8774/v2.1/servers/e0021d5d-daa0-40bb-9be6-f2d203dc222e', 'openstack-api-version': 'compute 2.1', 'x-openstack-nova-api-version': '2.1', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-request-id': 'req-301843dd-1ad1-4548-a5d4-d1f10f57b5bb', 'x-compute-request-id': 'req-301843dd-1ad1-4548-a5d4-d1f10f57b5bb', 'connection': 'close', 'content-type': 'application/json', 'status': '202', 'content-location': 'http://192.168.24.3:8774/v2.1/servers'}

The instance ID is e0021d5d-daa0-40bb-9be6-f2d203dc222e:

2021-11-15 12:02:31.607 7 INFO nova.compute.claims [req-301843dd-1ad1-4548-a5d4-d1f10f57b5bb 091e701229054f2793ed6f17dc4cf675 8ee1e30c6e7f4fa0a346af8667b600db - default default] [instance: e0021d5d-daa0-40bb-9be6-f2d203dc222e] Claim successful on node standalone.localdomain

As far as nova-compute knows, it built just fine:

2021-11-15 12:02:45.126 7 INFO nova.compute.manager [req-301843dd-1ad1-4548-a5d4-d1f10f57b5bb 091e701229054f2793ed6f17dc4cf675 8ee1e30c6e7f4fa0a346af8667b600db - default default] [instance:
e0021d5d-daa0-40bb-9be6-f2d203dc222e] Took 13.59 seconds to build instance.

Nothing suspicious in the instance qemu log at qemu/instance-00000001.log either:

[...]
char device redirected to /dev/pts/0 (label charserial...

Read more...

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart/+/818043

Revision history for this message
Marios Andreou (marios-b) wrote : Re: master and wallaby check/gate/periodic tests are failing tempest - connectivity issues

ykarel patch in comment #13 is trying to pin to the 'older' (working) versions of libvirt/qemu

10:55 < marios> ykarel: so you think it is to do with some conflicting versions of those packages ? coming from different repos?
10:56 < ykarel> marios, i just think it's those newver versions that caused the issue, the proposed fix is just workaround to clear gates until actual fix available
10:57 < ykarel> based on some of my local tests this should clear the ci
10:57 < ykarel> just waiting for ci result, but i believe it will pass

hopefully that can unblock us until the issue with the newer versions is resolved - waiting for ykarel to confirm his tests and we can try to merge it

Revision history for this message
yatin (yatinkarel) wrote :

https://review.opendev.org/c/openstack/tripleo-quickstart/+/818043 is Temporary patch to exclude libvirt/qemu from AppStream repo to unblock the CI.

Based on some local tests, qemu-kvm exclude was enough but ok to exclude libvirt too considering the workaround as temporary.

Good version:- qemu-kvm-6.0.0-33.el8s.x86_64
Bad version:- qemu-kvm-6.1.0-4.module_el8.6.0+983+a7505f3f.x86_64

Also checking further it seems issue is specific to q35 machine type which is set for TripleO deployments[1] since wallaby. Just to test tried manually changing machine type to victoria equivalent and instance booted fine. Also NOTE issue is not seen in devstack/packstac/puppet jobs as those don't have q35 machine type set.

I think someone from Compute will have more insights on why latest qemu-kvm have issues with q35 machine type instances. The workaround can be reverted once the issue is root caused and fixed properly.

[1] https://github.com/openstack/tripleo-heat-templates/commit/4efd15e15a2db4307bed625cf85973ef093a2b6e

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

After digging into this issue, i have some thoughts. Some facts:

- New version of qemu-kvm pushed to CentOS Stream 8 OS break oooq jobs because of the details showed in previous comments.
- If i understood correctly from conversation in #oooq, all reviews in tripleo repos are creating new containers from current-tripleo dlrn repo + current status of deps, OS and other dependencies.
- In this case, recreating containers on each review made all jobs to fail, not only promotion ones but also check and gate ones.
- As per conversations, this is a design decision, where rebuilding and serving containers (from provider job) was preferred over pulling and doing controlled update (only the packages being reviewed) because of problems with rate limiting in registry.

And now, just my opinion (as external non-ci team guy):

- Rebuilding the containers on check/gate jobs breaks the idea of promoting content, and limits it to only dlrn content.
- One of the advantages of including containers in the promotions was to extend the isolation of changes to not only the dlrn content but to the whole content of the containers. This still can have issues with packages installed in the hosts themselves, but provide a better protection against OS or deps changes.

I understand that this was evaluated when making the decision of rebuilding containers on each review, so this behaviour was expected, but it may be good to still think about alternatives (using a different registry?). There may be a viable one?

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@Alfredo, before rebuilding containers they were pulled from docker.io and updated on the host with same deps and OS packages. It's not a question of rebuilding them or pulling from a registry, but about their updates after the host get all of them.
We provide a list of repos that should be active when updating the containers, so that they can include built packages and newest tripleo code. This was always the way and "Stream" is known by its breakages for CI.
We have an option to disable OS repos (or any other) when updating the containers, but I'm afraid it wouldn't be supported in the community.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

note: there should be no libvirt on hosts, only in containers

Ronelle Landy (rlandy)
summary: master and wallaby check/gate/periodic tests are failing tempest -
- connectivity issues
+ updated libvirt/qemu to 6.1 from CentOS appstream
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote : Re: master and wallaby check/gate/periodic tests are failing tempest - updated libvirt/qemu to 6.1 from CentOS appstream

Please attach these things to the bug:

- The buggy libvirt guest XML
- The full QEMU command-line of the above guest XML (/var/log/libvirt/qemu/instance-yyyyyyy.log)

So, I was told moving to QEMU 6.1 the issue goes away (i.e. the console log is created); however with QEMU 6.0 the console.log is not created.

If we know the buggy QEMU version and the working QEMU version, that can allow the QEMU folks to `git bisect` QEMU to find the problem commit.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

My understanding is that, before moving to rebuilding all containers, the packages update was done in a way that only the packages in the temporary repo built by dlrn in the same job and the tripleo packages in delorean-current where updated.

There was some logic in the tooling to not update deps and os packages in the container. I remember we hit some issues with some tooling but i'm pretty sure the update was done in a controlled way.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart/+/818043
Committed: https://opendev.org/openstack/tripleo-quickstart/commit/1d5262a32a2cd8d2b6a3e4659812641f49e17658
Submitter: "Zuul (22348)"
Branch: master

commit 1d5262a32a2cd8d2b6a3e4659812641f49e17658
Author: yatinkarel <email address hidden>
Date: Tue Nov 16 10:41:05 2021 +0530

    Exclude libvirt/qemu from AppStream repo

    libvirt/qemu are rebased to newer version in
    AppStream repo than advanced-virtualization, and
    likely due to that jobs started failing. Somehow
    jobs are passing in releases before wallaby even
    with newver version.

    Unless it get's root caused and get fixed temporary
    exclude libvirt/qemu from AppStream repo.

    Related-Bug: #1950916
    Change-Id: Ia6a9e01ca2adbde1e7a0b7cce9fe5842a0d54b1b

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ci (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/818139

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ci (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ci/+/818139
Committed: https://opendev.org/openstack/tripleo-ci/commit/dd077fbaaf5cd90345e2daa8fbe510dec80059d9
Submitter: "Zuul (22348)"
Branch: master

commit dd077fbaaf5cd90345e2daa8fbe510dec80059d9
Author: Ronelle Landy <email address hidden>
Date: Tue Nov 16 15:47:28 2021 -0500

    Exclude libvirt/qemu from AppStream: container

    libvirt/qemu are rebased to newer version in
    AppStream repo than advanced-virtualization, and
    likely due to that jobs started failing. Somehow
    jobs are passing in releases before wallaby even
    with newver version.

    This patch is a temporary fix to
    exclude libvirt/qemu from AppStream repo.

    Change-Id: Iac4425d655df2f519f5b467e3754dad08ac9b1f5
    Related-Bug: #1950916

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
Ronelle Landy (rlandy) wrote : Re: master and wallaby check/gate/periodic tests are failing tempest - updated libvirt/qemu to 6.1 from CentOS appstream

A second fix was required for container builds: https://review.opendev.org/c/openstack/tripleo-ci/+/818139

Revision history for this message
Lee Yarwood (lyarwood) wrote :

Along with using q35 these jobs are also using accel=tcg ([libvirt]virt_type=qemu) and might be seeing increased memory pressure during tempest runs due to https://bugs.launchpad.net/nova/+bug/1949606 (QEMU >= 5.0.0 with -accel tcg uses a tb-size of 1GB causing OOM issues in CI) . Others have already ruled that out as an underlying cause here but I think we should keep it in mind here especially when combined with q35.

Revision history for this message
Lee Yarwood (lyarwood) wrote :
Revision history for this message
yatin (yatinkarel) wrote :

<< nvm it looks like this is https://bugzilla.redhat.com/show_bug.cgi?id=2006409 with another workaround being tested here https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/818229

The workaround(NovaLibvirtNumPciePorts: 12) works fine[1] with qemu-kvm-6.1.0 + q35 instances for CentOS8 stream too.

[1] https://logserver.rdoproject.org/72/36772/6/check/tripleo-ci-centos-8-standalone/b6c7cfd/logs/undercloud/var/log/tempest/stestr_results.html.gz

Revision history for this message
Alan Pevec (apevec) wrote :
Revision history for this message
Alan Pevec (apevec) wrote :

And existing RHEL 8 rhbz is https://bugzilla.redhat.com/show_bug.cgi?id=2007129
the new one by oVirt is likely a dup.

summary: - master and wallaby check/gate/periodic tests are failing tempest -
- updated libvirt/qemu to 6.1 from CentOS appstream
+ qemu 6.1 in CentOS Stream 8/9 regression in q35, unable to add more than
+ 15 pcie-root-ports
Revision history for this message
Artom Lifshitz (notartom) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/822344
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/37a64ea0ca7d5f409dd46b38a534a4a863041ce9
Submitter: "Zuul (22348)"
Branch: master

commit 37a64ea0ca7d5f409dd46b38a534a4a863041ce9
Author: Amol Kahat <email address hidden>
Date: Mon Dec 20 22:32:50 2021 +0530

    set NovaLibvirtNumPciePorts to 12

    New qemu-kvm-6.1.0 version is buggy version and excluded in master
    and wallaby (Ia6a9e01ca2adbde1e7a0b7cce9fe5842a0d54b1b,
    Iac4425d655df2f519f5b467e3754dad08ac9b1f5).
    New version was impacted q35 machines for CentOS-8 and Stream-9.

    qemu-kvm-6.1.0 + NovaLibvirtNumPciePorts: 12 + q35 instances
    for CentOS8 and Stream works fine.

    Related-Bug: #1950916
    Signed-off-by: Amol Kahat <email address hidden>
    Change-Id: I39c00edd97e321963d1ced74dcf5a2a27fa032cc

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by "Takashi Kajinami <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/818229
Reason: This looks like a stale DNM patch. Abandoning it to clean up our queue.

Revision history for this message
Alan Pevec (apevec) wrote :

CS9 fix qemu-kvm-6.2.0-1.el9

Alan Pevec (apevec)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ci (master)

Change abandoned by "Ghanshyam <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ci/+/818904
Reason: TrieplO project is retiring now, for details, please see https://review.opendev.org/c/openstack/governance/+/905145 or reach out to OpenStack TC.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.