Comment 2 for bug 1885286

Revision history for this message
wes hayutin (weshayutin) wrote :

[07:51:25] <zbr> mwhahaha: clarkb: hi! regarding centos 8.2/openvswitch issue -- what do you think we should we do?
[07:51:51] <mwhahaha> need to reproduce it outside of ci :/
[07:52:06] <mwhahaha> i'm wondering if we can hit it with a reproducer of some sort
[07:52:33] <clarkb> zbr: the two ideas I had last week were to continue to try and catch a failed node with a hold then hope that we catch one and a reboot makes the node useable enough to get logs and further debug. Or possibly spin up some test nodes in inap (where we seem to see it happen more which could be due to cpu speeds tripping the race or similar) and try to reproduce
[07:54:36] <mwhahaha> my thought might be that it's https://bugzilla.redhat.com/show_bug.cgi?id=1757933 but we didn't see any traces on the console on friday
[07:54:38] <openstack> mwhahaha: Error: Error getting bugzilla.redhat.com bug #1757933: NotPermitted
[07:55:24] <mwhahaha> meh it's private. but there' s a kernel deadlock issue with 8.2 that pops up with iptables
[07:56:13] <clarkb> do you have any non ovs jobs? that could be a way to narrow it down
[07:56:33] <mwhahaha> no
[07:57:03] <mwhahaha> the build jobs don't do ovs outside of whatever zuul does and I don't think they retry_limit in the same way
[08:00:58] <clarkb> mwhahaha: zbr: if there is a specific job that seems to hit this frequently we can set a hold with a limit of like 10 or something across all changes and projects and see if we get a crash held that way
[08:01:13] <mwhahaha> didn't look like it, seemed to be any
[08:01:18] <clarkb> I don't have enough context to know what the job may be but can set up the hold and then help cross check what we get