tripleo

Periodic wallaby jobs failing in vexxhost

Bug #2023764 reported by Arx Cruz on 2023-06-14

8

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo antelope-1

Bug Description

Starting today, there were a bunch of failures on Wallaby periodic jobs, all of those running on vexxhost.

Checking some of the logs, those who passed tempest, are getting ssh timeout, or ssh connection error, while others fail to connect to some service during deploy.

You can see the here that mostly are failing:

https://review.rdoproject.org/zuul/builds?pipeline=openstack-periodic-integration-stable1&result=FAILURE&skip=0

Examples of failing jobs log:

https://logserver.rdoproject.org/73/37973/78/check/periodic-tripleo-ci-centos-9-scenario007-multinode-oooq-container-wallaby/ae6d443/job-output.txt

https://logserver.rdoproject.org/73/37973/78/check/periodic-tripleo-ci-centos-9-standalone-wallaby/d29c1b1/logs/undercloud/var/log/tempest/stestr_results.html.gz

https://logserver.rdoproject.org/73/37973/78/check/periodic-tripleo-ci-centos-9-scenario010-kvm-standalone-wallaby/604cf09/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

https://logserver.rdoproject.org/73/37973/78/check/periodic-tripleo-ci-centos-9-scenario007-multinode-oooq-container-wallaby/ae6d443/logs/undercloud/var/log/tempest/stestr_results.html.gz

https://logserver.rdoproject.org/73/37973/78/check/periodic-tripleo-ci-centos-9-scenario003-standalone-wallaby/382aa08/logs/undercloud/var/log/tempest/stestr_results.html.gz

https://logserver.rdoproject.org/73/37973/78/check/periodic-tripleo-ci-centos-9-standalone-on-multinode-ipa-wallaby/564b46b/logs/undercloud/var/log/tempest/stestr_results.html.gz

Tags:

Revision history for this message

Arx Cruz (arxcruz) wrote on 2023-06-14:

#1

I'm rerunning these jobs on ibm cloud to check if the issue is related to vexxhost or not.

Revision history for this message

Arx Cruz (arxcruz) wrote on 2023-06-14:

#2

Some packages change between the latest good run and failing run as can see here https://www.diffchecker.com/ob2CD4jb/

Ronelle Landy (rlandy) on 2023-06-14

Changed in tripleo:
importance:	Undecided → Critical
milestone:	zed-1 → antelope-1

Revision history for this message

Arx Cruz (arxcruz) wrote on 2023-06-14:

#3

Here's what happening, when we run a test, the VM is spawned, and goes to ACTIVE, however, the VM get stucked in cloud-init initialization:

currently loaded modules: 8139cp 8390 9pnet 9pnet_virtio ahci drm drm_kms_helper e1000 failover fb_sys_fops hid hid_generic ip_tables isofs libahci mii ne2k_pci net_failover nls_ascii nls_iso8859_1 nls_utf8 pcnet32 qemu_fw_cfg syscopyarea sysfillrect sysimgblt ttm usbhid virtio_blk virtio_gpu virtio_input virtio_net virtio_rng virtio_scsi x_tables
Initializing random number generator... done.
Starting acpid: OK
Starting network: udhcpc: started, v1.29.3
udhcpc: sending discover
udhcpc: sending select for 10.100.0.3
udhcpc: lease of 10.100.0.3 obtained, lease time 43200
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 9.85. request failed
failed 2/20: up 58.99. request failed
failed 3/20: up 108.09. request failed
failed 4/20: up 157.19. request failed
failed 5/20: up 206.29. request failed

Meanwhile, the test tries to ssh, and it can't, because the ssh service on VM did not start yet.

Locally, when I ran the test manually, I was able to ping the vm, but not ssh, which proves that ssh service did not start.

Revision history for this message

Alan Pevec (apevec) wrote on 2023-06-14 (last edit on 2023-06-14):

#4

Thanks to investigation by Yatin, this was narrowed down to the CS9 kernel regression:
LAST GOOD kernel-5.14.0-319.el9 RPM timestamp: 2023-05-27
FIRST BAD kernel-5.14.0-325.el9 RPM timestamp: 2023-06-12

Changelog diff is huge: https://gitlab.com/redhat/centos-stream/rpms/kernel/-/compare/kernel-5.14.0-319.el9...kernel-5.14.0-325.el9

Revision history for this message

yatin (yatinkarel) wrote on 2023-06-15:

#5

To be more specific [1][2] in kernel-5.14.0-325.el9 triggered this issue. pyroute2 0.6.10 contains the fix[3] for it. It's spotted in Fedora[4]/Kinetic[5] before. I tested the manual backport of [3] in the impacted environment and issue is not seen. Seems we would need to do backport for it in pyroute2 package.

[1] - net: rtnetlink: add bulk delete support flag (Ivan Vecera) [2193176]
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a6cec0bcd34264be8887791594be793b3f12719f
[3] https://github.com/svinota/pyroute2/commit/1eb08312de30a083bcfddfaa9c1d5e124b6368df
[4] https://bugzilla.redhat.com/show_bug.cgi?id=2094986
[5] https://bugs.launchpad.net/ubuntu/+source/pyroute2/+bug/1995469

Revision history for this message

yatin (yatinkarel) wrote on 2023-06-15:

#6

pyroute2 is being updated(to include the backport) in RDO with https://review.rdoproject.org/r/q/topic:bug%252F2023764

Revision history for this message

chandan kumar (chkumar246) wrote on 2023-06-19:

#7

CentOS 9 wallaby got promoted on Jun 18, 2023. All the affected jobs are passing now. We can move this bug to fixed release.

Changed in tripleo:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #2094986
[CLOSED ERRATA] Edit

Bug watches keep track of this bug in other bug trackers.