[centos-9-stream] jobs fails as nova-compute stuck at libvirt connect since systemd-252-16.el9

Bug #2029335 reported by yatin
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Invalid
Undecided
Unassigned
devstack
Fix Released
Undecided
Unassigned
neutron
New
Medium
yatin

Bug Description

CentOS 9-stream jobs fails as below:-
2023-08-01 04:38:23.344538 | controller | + functions:wait_for_compute:494 : echo 'Didn'\''t find service registered by hostname after 60 seconds'
2023-08-01 04:38:23.344586 | controller | Didn't find service registered by hostname after 60 seconds
2023-08-01 04:38:23.347431 | controller | + functions:wait_for_compute:495 : openstack --os-cloud devstack-admin --os-region RegionOne compute service list
2023-08-01 04:38:24.703419 | controller | +--------------------------------------+----------------+--------------+----------+---------+-------+----------------------------+
2023-08-01 04:38:24.703477 | controller | | ID | Binary | Host | Zone | Status | State | Updated At |
2023-08-01 04:38:24.703483 | controller | +--------------------------------------+----------------+--------------+----------+---------+-------+----------------------------+
2023-08-01 04:38:24.703488 | controller | | f00443c2-4813-4f38-b13e-8694c6cabe58 | nova-conductor | np0034822320 | internal | enabled | up | 2023-08-01T04:38:23.000000 |
2023-08-01 04:38:24.703492 | controller | | ed6b12c6-25ee-46a8-b3ae-2ccf523ae39e | nova-scheduler | np0034822320 | internal | enabled | up | 2023-08-01T04:38:15.000000 |
2023-08-01 04:38:24.703497 | controller | | de8e03bc-140f-4ee9-ba3a-dc3f11807ec6 | nova-conductor | np0034822320 | internal | enabled | up | 2023-08-01T04:38:21.000000 |
2023-08-01 04:38:24.703501 | controller | +--------------------------------------+----------------+--------------+----------+---------+-------+----------------------------+
2023-08-01 04:38:24.915144 | controller | + functions:wait_for_compute:497 : return 124
2023-08-01 04:38:24.918022 | controller | + lib/nova:is_nova_ready:1 : exit_trap
2023-08-01 04:38:24.920886 | controller | + ./stack.sh:exit_trap:550 : local r=124

It's because nova-compute stuck at start while Connecting to libvirt: qemu:///system.

Example logs:-
- https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e50/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-centos-9-stream/e5086a4/controller/logs/devstacklog.txt
- https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e50/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-centos-9-stream/e5086a4/controller/logs/screen-n-cpu.txt
- https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e50/periodic/opendev.org/openstack/neutron/master/neutron-ovn-tempest-ovs-master-centos-9-stream/e5086a4/controller/logs/services.txt
libvirtd status shows:- Main PID: 52485 (code=exited, status=0/SUCCESS)

This is a regression caused in systemd with https://github.com/systemd/systemd/commit/ff32060f2ed37b68dc26256b05e2e69013b0ecfe and this is included as part of systemd-252-16.el9 in CentOS 9-stream.

It's reverted in systemd as part of https://github.com/systemd/systemd/pull/28000

Found a known issue in RHEL 9 and to consider the backport of systemd revert https://bugzilla.redhat.com/show_bug.cgi?id=2225667

The workaround for now can be any of:-
- Downgrade systemd to good version i.e systemd-252-16 - to avoid the regression
- Restart libvirtd - to trigger respawn of processes
- kill dnsmasq processes - to trigger respawn of processes
- Configure libvirtd to not use --timeout 120 so the process don't exit

Builds:- https://zuul.openstack.org/builds?job_name=tempest-full-centos-9-stream&job_name=devstack-platform-centos-9-stream&job_name=neutron-ovn-tempest-ovs-master-centos-9-stream&job_name=neutron-ovn-tempest-ovs-release-fips&branch=master&skip=0

Revision history for this message
yatin (yatinkarel) wrote :

Proposed workaround:- https://review.opendev.org/c/openstack/devstack/+/890280, job passing now:-
devstack-platform-centos-9-stream https://zuul.opendev.org/t/openstack/build/04d4007c56634b71bbf929399f338996 : SUCCESS in 1h 21m 36s (non-voting)

Revision history for this message
Riccardo Pittau (rpittau) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to devstack (master)

Reviewed: https://review.opendev.org/c/openstack/devstack/+/890280
Committed: https://opendev.org/openstack/devstack/commit/113689ee4694de20c019735fdace447225aa18f7
Submitter: "Zuul (22348)"
Branch: master

commit 113689ee4694de20c019735fdace447225aa18f7
Author: yatinkarel <email address hidden>
Date: Wed Aug 2 12:58:45 2023 +0530

    Woraround systemd issue on CentOS 9-stream

    systemd-252-16.el9 introduced a regression
    where libvirtd process exits after 120s of
    inactivity.
    Add a workaround to unset 120s timeout for
    libvirtd, the workaround can be removed once
    the fix is available in systemd rpm.

    Related-Bug: #2029335
    Change-Id: Id6db6c17518b54d5fef7c381c509066a569aff6d

Revision history for this message
yatin (yatinkarel) wrote :

New build of systemd https://kojihub.stream.centos.org/koji/buildinfo?buildID=35705 available with the required revert, next compose will have it included.

Changed in neutron:
importance: Undecided → Medium
assignee: nobody → yatin (yatinkarel)
Revision history for this message
yatin (yatinkarel) wrote :

<< New build of systemd https://kojihub.stream.centos.org/koji/buildinfo?buildID=35705 available with the required revert, next compose will have it included.

repos now include the fix:- http://mirror.rackspace.com/centos-stream/9-stream/BaseOS/x86_64/os/Packages/systemd-252-17.el9.x86_64.rpm

Revision history for this message
yatin (yatinkarel) wrote :
Changed in devstack:
status: New → Fix Released
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Marking invalid, impacted Ironic but fix was in devstack.

Changed in ironic:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.