Periodic wallaby jobs failing in vexxhost

Bug #2023764 reported by Arx Cruz
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Revision history for this message
Arx Cruz (arxcruz) wrote :

I'm rerunning these jobs on ibm cloud to check if the issue is related to vexxhost or not.

Revision history for this message
Arx Cruz (arxcruz) wrote :

Some packages change between the latest good run and failing run as can see here https://www.diffchecker.com/ob2CD4jb/

Ronelle Landy (rlandy)
Changed in tripleo:
importance: Undecided → Critical
milestone: zed-1 → antelope-1
Revision history for this message
Arx Cruz (arxcruz) wrote :

Here's what happening, when we run a test, the VM is spawned, and goes to ACTIVE, however, the VM get stucked in cloud-init initialization:

currently loaded modules: 8139cp 8390 9pnet 9pnet_virtio ahci drm drm_kms_helper e1000 failover fb_sys_fops hid hid_generic ip_tables isofs libahci mii ne2k_pci net_failover nls_ascii nls_iso8859_1 nls_utf8 pcnet32 qemu_fw_cfg syscopyarea sysfillrect sysimgblt ttm usbhid virtio_blk virtio_gpu virtio_input virtio_net virtio_rng virtio_scsi x_tables
Initializing random number generator... done.
Starting acpid: OK
Starting network: udhcpc: started, v1.29.3
udhcpc: sending discover
udhcpc: sending select for 10.100.0.3
udhcpc: lease of 10.100.0.3 obtained, lease time 43200
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.100.0.1"
OK
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 9.85. request failed
failed 2/20: up 58.99. request failed
failed 3/20: up 108.09. request failed
failed 4/20: up 157.19. request failed
failed 5/20: up 206.29. request failed

Meanwhile, the test tries to ssh, and it can't, because the ssh service on VM did not start yet.

Locally, when I ran the test manually, I was able to ping the vm, but not ssh, which proves that ssh service did not start.

Revision history for this message
Alan Pevec (apevec) wrote (last edit ):

Thanks to investigation by Yatin, this was narrowed down to the CS9 kernel regression:
LAST GOOD kernel-5.14.0-319.el9 RPM timestamp: 2023-05-27
FIRST BAD kernel-5.14.0-325.el9 RPM timestamp: 2023-06-12

Changelog diff is huge: https://gitlab.com/redhat/centos-stream/rpms/kernel/-/compare/kernel-5.14.0-319.el9...kernel-5.14.0-325.el9

Revision history for this message
yatin (yatinkarel) wrote :

To be more specific [1][2] in kernel-5.14.0-325.el9 triggered this issue. pyroute2 0.6.10 contains the fix[3] for it. It's spotted in Fedora[4]/Kinetic[5] before. I tested the manual backport of [3] in the impacted environment and issue is not seen. Seems we would need to do backport for it in pyroute2 package.

[1] - net: rtnetlink: add bulk delete support flag (Ivan Vecera) [2193176]
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a6cec0bcd34264be8887791594be793b3f12719f
[3] https://github.com/svinota/pyroute2/commit/1eb08312de30a083bcfddfaa9c1d5e124b6368df
[4] https://bugzilla.redhat.com/show_bug.cgi?id=2094986
[5] https://bugs.launchpad.net/ubuntu/+source/pyroute2/+bug/1995469

Revision history for this message
yatin (yatinkarel) wrote :

pyroute2 is being updated(to include the backport) in RDO with https://review.rdoproject.org/r/q/topic:bug%252F2023764

Revision history for this message
chandan kumar (chkumar246) wrote :

CentOS 9 wallaby got promoted on Jun 18, 2023. All the affected jobs are passing now. We can move this bug to fixed release.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.