Increase in RETRY_LIMIT errors in zuul.openstack.org is preventing jobs from passing check/gate

Bug #1885286 reported by Ronelle Landy
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Ronelle Landy

Bug Description

We are tracking an increase in RETRY_LIMIT showing up in jobs running in check and gate in zuul.openstack.org.

The RETRY_LIMIT jobs fail out in pre - which often leave no logs. The errors have not been traced to any particular job or provider.

This is possibly related to the CentOS 8.2 update - looking at the time line: https://zuul.opendev.org/t/openstack/builds?result=RETRY_LIMIT&project=openstack%2Ftripleo-heat-templates

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → victoria-1
assignee: nobody → Ronelle Landy (rlandy)
importance: Undecided → Critical
status: New → Triaged
tags: added: ci promotion-blocker
Ronelle Landy (rlandy)
description: updated
Revision history for this message
Ronelle Landy (rlandy) wrote :

Possibly we have two different failures but the latest set are failing about 30 min in.
Previously we were looking at failures around 10-15 mins.

Current speculation is container prep

Revision history for this message
wes hayutin (weshayutin) wrote :

[07:51:25] <zbr> mwhahaha: clarkb: hi! regarding centos 8.2/openvswitch issue -- what do you think we should we do?
[07:51:51] <mwhahaha> need to reproduce it outside of ci :/
[07:52:06] <mwhahaha> i'm wondering if we can hit it with a reproducer of some sort
[07:52:33] <clarkb> zbr: the two ideas I had last week were to continue to try and catch a failed node with a hold then hope that we catch one and a reboot makes the node useable enough to get logs and further debug. Or possibly spin up some test nodes in inap (where we seem to see it happen more which could be due to cpu speeds tripping the race or similar) and try to reproduce
[07:54:36] <mwhahaha> my thought might be that it's https://bugzilla.redhat.com/show_bug.cgi?id=1757933 but we didn't see any traces on the console on friday
[07:54:38] <openstack> mwhahaha: Error: Error getting bugzilla.redhat.com bug #1757933: NotPermitted
[07:55:24] <mwhahaha> meh it's private. but there' s a kernel deadlock issue with 8.2 that pops up with iptables
[07:56:13] <clarkb> do you have any non ovs jobs? that could be a way to narrow it down
[07:56:33] <mwhahaha> no
[07:57:03] <mwhahaha> the build jobs don't do ovs outside of whatever zuul does and I don't think they retry_limit in the same way
[08:00:58] <clarkb> mwhahaha: zbr: if there is a specific job that seems to hit this frequently we can set a hold with a limit of like 10 or something across all changes and projects and see if we get a crash held that way
[08:01:13] <mwhahaha> didn't look like it, seemed to be any
[08:01:18] <clarkb> I don't have enough context to know what the job may be but can set up the hold and then help cross check what we get

tags: added: alert
Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (7.9 KiB)

We got a node from infra. It shows that when we restart legacy networking, networkmanager is restarting ens3 and attempting to DHCP this interface. This is the primary interface for the node and is already configured statically by glean[0]. This causes the node to become unavailable and is only available on reboot. This appears to be an issue because there is no DHCP on the network that ens3 is on so the system just hangs without a network. On reboot, the static address is reconfigured and the node was available again.

[0] https://opendev.org/opendev/glean/src/branch/master/glean/cmd.py

Here is the messages block for the network restart

Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 systemd[1]: Starting LSB: Bring up/down networking...
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 network[48127]: WARN : [network] You are using 'network' service provided by 'network-scripts', which are now deprecated.
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 network[48140]: You are using 'network' service provided by 'network-scripts', which are now deprecated.
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 network[48127]: WARN : [network] 'network-scripts' will be removed in one of the next major releases of RHEL.
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 network[48141]: 'network-scripts' will be removed in one of the next major releases of RHEL.
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 network[48127]: WARN : [network] It is advised to switch to 'NetworkManager' instead for network management.
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 network[48142]: It is advised to switch to 'NetworkManager' instead for network management.
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.4888] device (ens3): state change: activated -> deactivating (reason 'connection-removed', sys-iface-state: 'managed')
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.4893] manager: NetworkManager state is now DISCONNECTING
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 dbus-daemon[738]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.3' (uid=0
pid=1023 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.4912] audit: op="connections-reload" pid=48177 uid=0 result="success"
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 dbus-daemon[738]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.5158] device (ens3): state change: deactivating -> disconnected (reason 'connection-removed', sys-iface-state: 'managed')
Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.5205] settings: (ens3): cr...

Read more...

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (12.4 KiB)

So from the logs it looks like network manager is losing the connection. On boot glean creates the network-scripts/ifcfg-ens3 config. When NetworkManager starts up, it creates a connection called 'System ens3'

Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2798] device (lo): carrier: link connected
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2800] manager: (lo): new Generic device (/org/freedesktop/NetworkManager/Devices/1)
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2812] manager: (ens3): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2822] device (ens3): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 kernel: IPv6: ADDRCONF(NETDEV_UP): ens3: link is not ready
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2873] device (ens3): carrier: link connected
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2934] device (ens3): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'managed')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2970] policy: auto-activating connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085)
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2978] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085)
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2979] device (ens3): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2983] manager: NetworkManager state is now CONNECTING
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2985] device (ens3): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.2990] device (ens3): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.3043] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.3083] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.3086] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Jul 1 13:49:29 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593611369.3091] manager:...

Revision history for this message
Alex Schultz (alex-schultz) wrote :

The behavior is the same as if you just rm -rf /etc/sysconfig/network-script/ifcfg-ens3 and systemctl restart network. Somewhere ifcfg-ens3 is going missing

Revision history for this message
Alex Schultz (alex-schultz) wrote :

It's os-net-config

[2020/07/01 08:13:29 PM] [ERROR] Unable to read mac address: ''
Traceback (most recent call last):
  File "/usr/bin/os-net-config", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python3.6/site-packages/os_net_config/cli.py", line 337, in main
    provider.add_object(obj)
  File "/usr/lib/python3.6/site-packages/os_net_config/__init__.py", line 64, in add_object
    self.add_bridge(obj)
  File "/usr/lib/python3.6/site-packages/os_net_config/impl_ifcfg.py", line 807, in add_bridge
    data = self._add_common(bridge)
  File "/usr/lib/python3.6/site-packages/os_net_config/impl_ifcfg.py", line 475, in _add_common
    mac = utils.interface_mac(base_opt.primary_interface_name)
  File "/usr/lib/python3.6/site-packages/os_net_config/utils.py", line 116, in interface_mac
    with open(_SYS_CLASS_NET + '/%s/address' % name, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: "/sys/class/net/''/address"

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (6.3 KiB)

os-net-config switches all the interfaces to dhcp

root@centos-8-inap-mtl01-0017550599 scripts]# bash run_os_net_config.sh
+ '[' -n '{"network_config": [{"addresses": [{"ip_netmask": "192.168.24.1/24"}, {"ip_netmask": "192.168.24.1/32"}, {"ip_netmask": "192.168.24.1/32"}], "dns_servers": ["127.0.0.1", "1.1.1.1"], "domain": [], "members": [{"mtu": 1500, "name": "interface_name", "primary": true, "type": "interface"}], "mtu": 1500, "name": "br-ctlplane", "ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}' ']'
+ '[' -z '' ']'
+ trap configure_safe_defaults EXIT
++ date +%Y-%m-%dT%H:%M:%S
+ DATETIME=2020-07-01T20:13:27
+ '[' -f /etc/os-net-config/config.json ']'
+ mv /etc/os-net-config/config.json /etc/os-net-config/config.json.2020-07-01T20:13:27
+ mkdir -p /etc/os-net-config
+ echo '{"network_config": [{"addresses": [{"ip_netmask": "192.168.24.1/24"}, {"ip_netmask": "192.168.24.1/32"}, {"ip_netmask": "192.168.24.1/32"}], "dns_servers": ["127.0.0.1", "1.1.1.1"], "domain": [], "members": [{"mtu": 1500, "name": "interface_name", "primary": true, "type": "interface"}], "mtu": 1500, "name": "br-ctlplane", "ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}'
++ type -t network_config_hook
+ '[' '' = function ']'
+ sed -i 's/: "bridge_name/: "'\'''\''/g' /etc/os-net-config/config.json
+ sed -i 's/interface_name/'\'''\''/g' /etc/os-net-config/config.json
+ set +e
+ os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes
[2020/07/01 08:13:29 PM] [INFO] Using config file at: /etc/os-net-config/config.json
[2020/07/01 08:13:29 PM] [INFO] Ifcfg net config provider created.
[2020/07/01 08:13:29 PM] [INFO] Not using any mapping file.
[2020/07/01 08:13:29 PM] [INFO] Finding active nics
[2020/07/01 08:13:29 PM] [INFO] br-ctlplane is not an active nic
[2020/07/01 08:13:29 PM] [INFO] ens3 is an active nic
[2020/07/01 08:13:29 PM] [INFO] lo is not an active nic
[2020/07/01 08:13:29 PM] [INFO] br-ex is not an active nic
[2020/07/01 08:13:29 PM] [INFO] ovs-system is not an active nic
[2020/07/01 08:13:29 PM] [INFO] No DPDK mapping available in path (/var/lib/os-net-config/dpdk_mapping.yaml)
[2020/07/01 08:13:29 PM] [INFO] Active nics are ['ens3']
[2020/07/01 08:13:29 PM] [INFO] nic1 mapped to: ens3
[2020/07/01 08:13:29 PM] [INFO] adding bridge: br-ctlplane
[2020/07/01 08:13:29 PM] [ERROR] Unable to read mac address: ''
Traceback (most recent call last):
  File "/usr/bin/os-net-config", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python3.6/site-packages/os_net_config/cli.py", line 337, in main
    provider.add_object(obj)
  File "/usr/lib/python3.6/site-packages/os_net_config/__init__.py", line 64, in add_object
    self.add_bridge(obj)
  File "/usr/lib/python3.6/site-packages/os_net_config/impl_ifcfg.py", line 807, in add_bridge
    data = self._add_common(bridge)
  File "/usr/lib/python3.6/site-packages/os_net_config/impl_ifcfg.py", line 475, in _add_common
    mac = utils.interface_mac(base_opt.primary_interface_name)
  File "/usr/lib/python3.6/site-packages/os_net_config/utils.py...

Read more...

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Revision history for this message
Marios Andreou (marios-b) wrote :

marios|ruck: I am going to close this bug out.

As far as I know we aren't seeing this any more.

Move back to in progress and add some pointers and context if you have information to the contrary please thank you.

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.