OVB on centos8 fails because of networking failures

Bug #1866202 reported by Sagi (Sergey) Shnaidman
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
wes hayutin
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Luke Short (ekultails) wrote :

I've been doing some troubleshooting in this bug, too: https://bugs.launchpad.net/tripleo/+bug/1866204

Revision history for this message
Luke Short (ekultails) wrote :

I think this boils down to EL 8 not supporting the legacy eth* interface naming. We can see from here that the ens3 device is trying to be brought up but for some strange reason it complains about eth0 when doing that. I do not see eth0 in the ens3 configuration (or vice versa).

http://paste.openstack.org/show/790356/
https://logserver.rdoproject.org/66/25666/7/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/4fc1241/logs/overcloud-controller-0/var/log/journal.txt.gz
https://logserver.rdoproject.org/66/25666/7/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/4fc1241/logs/overcloud-controller-0/etc/sysconfig/network-scripts/

Revision history for this message
Alan Pevec (apevec) wrote :

> EL 8 not supporting the legacy eth* interface naming

How did it work in rhel8 jobs?
We've added net.ifnames=0 in c8 images to match what rhel8 images had:
https://softwarefactory-project.io/r/#/c/17697/

Revision history for this message
wes hayutin (weshayutin) wrote :
Download full text (7.2 KiB)

GRUB_CMDLINE_LINUX="console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto"

https://logserver.rdoproject.org/66/25666/7/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/4fc1241/logs/overcloud-controller-0/etc/sysconfig/grub.gz

Mar 05 16:58:01 localhost kernel: Command line: BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-147.5.1.el8_1.x86_64 root=UUID=e1f55697-6b08-41d8-a6df-6a6ce21dc875 ro console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto

device (eth0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')

Mar 05 16:58:08 overcloud-controller-0 kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Mar 05 16:58:08 overcloud-controller-0 NetworkManager[1285]: <info> [1583427488.7311] device (eth0): Activation: starting connection 'System eth0' (5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03)
Mar 05 16:58:08 overcloud-controller-0 NetworkManager[1285]: <info> [1583427488.7349] device (eth0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Mar 05 16:58:08 overcloud-controller-0 NetworkManager[1285]: <info> [1583427488.7363] manager: NetworkManager state is now CONNECTING
Mar 05 16:58:08 overcloud-controller-0 NetworkManager[1285]: <info> [1583427488.7454] device (eth0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
info> [1583427488.7922] policy: set 'System eth0' (eth0) as default for IPv4 routing and DNS
Mar 05 16:58:08 overcloud-controller-0 dbus-daemon[1182]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.8' (uid=0 pid=1285 comm="/usr/sbin/NetworkManager --no-daemon " label="system_u:system_r:NetworkManager_t:s0")
Mar 05 16:58:08 overcloud-controller-0 NetworkManager[1285]: <info> [1583427488.7993] device (eth0): Activation: successful, device activated.
reason="No suitable device found for this connection (device eth0 not available because profile is not compatible with device (mismatching interface name))."
Mar 05 16:58:39 overcloud-controller-0 network[1489]: Bringing up interface ens3: Error: Connection activation failed: No suitable device found for this connection (device eth0 not available because profile is not compatible with device (mismatching interface name)).
Mar 05 16:58:39 overcloud-controller-0 network[1489]: [FAILED]
Mar 05 16:58:39 overcloud-controller-0 NetworkManager[1285]: <info> [1583427519.0678] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth0" pid=1671 uid=0 result="success"
Mar 05 16:58:39 overcloud-controller-0 network[1489]: Bringing up interface eth0: [ OK ]

Mar 05 16:58:39 overcloud-controller-0 cloud-init[1714]: Cloud-init v. 18.5 running 'init' at Thu, 05 Mar 2020 16:58:39 +0000. Up 39.98 seconds.
Mar 05 16:58:39 overcloud-controller-0 cloud-init[1714]: ci-info: +++++++++++++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++++++++++++
Mar 05 16:58:39 overcloud-controller-0 cloud-init[1714]: ci-info: +--------+------+------------------------------+---------------+--------+-------------------+
Mar 05 16:58:39 overcloud-co...

Read more...

Revision history for this message
wes hayutin (weshayutin) wrote :

I see the following errors br-ex is added to the nic.

## CENTOS-8 ##

Mar 05 16:58:08 overcloud-controller-0 ovs-vsctl[1378]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=8.0.0

https://logserver.rdoproject.org/66/25666/7/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/4fc1241/logs/overcloud-controller-0/var/log/journal.txt.gz

Mar 05 17:12:00 overcloud-controller-0 NetworkManager[1285]: <info> [1583428320.4203] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=10569 uid=0 result="success"
Mar 05 17:12:00 overcloud-controller-0 NetworkManager[1285]: <info> [1583428320.4446] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=10580 uid=0 result="success"
Mar 05 17:12:00 overcloud-controller-0 NetworkManager[1285]: <info> [1583428320.4795] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=10596 uid=0 result="success"
Mar 05 17:12:00 overcloud-controller-0 network[10217]: net_mlx5: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
Mar 05 17:12:00 overcloud-controller-0 network[10217]: net_mlx5: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx5)
Mar 05 17:12:00 overcloud-controller-0 network[10217]: PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
Mar 05 17:12:00 overcloud-controller-0 network[10217]: PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)
Mar 05 17:12:00 overcloud-controller-0 ovs-vsctl[10613]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-ex eth1 -- add-port br-ex eth1

## RHEL 8 ##

Feb 25 02:15:45 overcloud-controller-0 ovs-vsctl[1360]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=7.16.1

Feb 25 02:35:50 overcloud-controller-0 NetworkManager[1284]: <info> [1582598150.8678] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=6646 uid=0 result="success"
Feb 25 02:35:50 overcloud-controller-0 NetworkManager[1284]: <info> [1582598150.8828] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=6654 uid=0 result="success"
Feb 25 02:35:50 overcloud-controller-0 NetworkManager[1284]: <info> [1582598150.9095] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=6665 uid=0 result="success"
Feb 25 02:35:50 overcloud-controller-0 NetworkManager[1284]: <info> [1582598150.9284] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=6676 uid=0 result="success"
Feb 25 02:35:50 overcloud-controller-0 NetworkManager[1284]: <info> [1582598150.9605] audit: op="connections-load" args="/etc/sysconfig/network-scripts/ifcfg-eth1" pid=6692 uid=0 result="success"
Feb 25 02:35:50 overcloud-controller-0 ovs-vsctl[6709]: ovs|00001|vsctl|INFO|Called as ovs-vsctl -t 10 -- --if-exists del-port br-ex eth1 -- add-port br-ex eth1

Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :

centos-8 openvswitch-2.12.0-1.el8.x86_64

rhel-8 openvswitch2.11-2.11.0-0.20190129gitd3a10db.el8fdb.x86_64

not sure if that makes a diff..

This is pretty noisy though
https://logserver.rdoproject.org/66/25666/7/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/4fc1241/logs/overcloud-controller-0/var/log/extra/network-bridges.gz

wes hayutin (weshayutin)
tags: added: promotion-blocker
Revision history for this message
yatin (yatinkarel) wrote :

Some updates, it turned out to be garbage ifcfg-ens3 network-scripts on overcloud nodes, to confirm this for testing we cleaned it up before running NetworkConfig https://review.opendev.org/#/c/711755/ and the deployment moved forward and we seen a green run for CentOS8. Test results can be seen in https://review.rdoproject.org/r/#/c/25332/.

Green runs:-
https://logserver.rdoproject.org/32/25332/53/check/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset001-master/820ce3c/
https://logserver.rdoproject.org/32/25332/53/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/d02fa66/

Fix for the ens3 issue:-
ens3 can be cleaned up in the base image used to prepare overcloud-full or in some disk image element.

Though we seen two green runs for c8 ovb, there were couple of random issues noticed during multiple runs:-

1) Mar 07 17:13:27 overcloud-novacompute-0 network[10026]: Bringing up interface eth4: Error: Connection activation failed: IP configuration could not be reserved (no available address, timeout, etc.)
Mar 07 17:13:27 overcloud-novacompute-0 network[10026]: Hint: use 'journalctl -xe NM_CONNECTION=84d43311-57c8-8986-f205-9c78cd6ef5d2 + NM_DEVICE=eth4' to get more details.
Mar 07 17:13:27 overcloud-novacompute-0 network[10026]: [FAILED]
Logs:- https://logserver.rdoproject.org/32/25332/52/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/7d36dbc/logs/overcloud-novacompute-0/var/log/journal.txt.gz

2) Baremetal nodes goes to deploy_failed state randomly:-
Logs:-
https://logserver.rdoproject.org/32/25332/53/check/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset001-master-test1/2edc8af/logs/undercloud/var/log/extra/baremetal_list.txt.gz

3) Tempest failures:-
Logs:- https://logserver.rdoproject.org/32/25332/53/check/periodic-tripleo-ci-centos-8-ovb-1ctlr_1comp-featureset001-master-test2/0c36de4/logs/undercloud/var/log/tempest/stestr_results.html.gz

4) Baremetal nodes provision failed due to qemu scientific notation bug:- https://bugs.launchpad.net/oslo.utils/+bug/1864529
This is fixed already in oslo.utils but required patch is not yet available in current-tripleo. /me didn't got why it failed randomly with same overcloud-full image.

5) overcloud deployment just stucks
For this logs didn't get collected, but i noticed in other underecloud job too, and there we got collected vm console log and it turned out to be kernel panic http://paste.openstack.org/show/790477/, so same could be true here. From kernel panic logs we seen "atop Tainted", don't know if atop binary from el7 on CentOS8 can cause that, but that's being fixed at https://review.opendev.org/711894.

I think all these random issues need to be fixed/diagnosed seperately as needs different expertise.

Revision history for this message
yatin (yatinkarel) wrote :

<< Fix for the ens3 issue:-
<< ens3 can be cleaned up in the base image used to prepare overcloud-full or in some disk image element.
Reported bug against CentOS https://bugs.centos.org/view.php?id=17133, but this may take some time, we can customize that image on our side and use that to clear this issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (master)

Fix proposed to branch: master
Review: https://review.opendev.org/712487

Changed in tripleo:
assignee: nobody → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Changed in tripleo:
assignee: Alex Schultz (alex-schultz) → wes hayutin (weshayutin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-puppet-elements (master)

Fix proposed to branch: master
Review: https://review.opendev.org/712521

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-image-elements (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/712487
Reason: This doesn't actually execute because it's a dependency on the os-net-config element but we don't use that element. I'm actually kinda concerned that this element isn't executed at all. Anyway https://review.opendev.org/#/c/712521/ will handle this

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/712571

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (master)

Reviewed: https://review.opendev.org/712487
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=d641e019b90f9318a38d67b3910cb4e4b1c09ec0
Submitter: Zuul
Branch: master

commit d641e019b90f9318a38d67b3910cb4e4b1c09ec0
Author: Alex Schultz <email address hidden>
Date: Wed Mar 11 10:00:44 2020 -0600

    Cleanup stale interface if exists

    There is a bug in the CentOS 8 image where the ens3 interface file
    exists. We should clean that up if it exists to prevent issues when
    booting. We manage the interfaces later with os-net-config so we don't
    want them to exist in the image.

    Change-Id: I95d851f194a524caad639188e7df8f041fa2a248
    Closes-Bug: #1866202

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/712571
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=2589b8f22ad2d263bcecbb8278e73f0822e7eba3
Submitter: Zuul
Branch: master

commit 2589b8f22ad2d263bcecbb8278e73f0822e7eba3
Author: Alex Schultz <email address hidden>
Date: Wed Mar 11 15:20:22 2020 -0600

    Add interface-names to centos8 images

    Back in Bug #1841441, we disabled the net.ifnames because of the
    RHEL7->RHEL8 changes to interface names. Now that we have centos8, we
    need to ensure this action is also run on those images.

    Depends-On: https://review.opendev.org/#/c/712487/
    Change-Id: Ice40fec0eacefd9778614996fb3417b78cdd17d3
    Related-Bug: #1866202
    Related-Bug: #1841441

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-image-elements (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/712947

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/712949

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-image-elements (stable/train)

Reviewed: https://review.opendev.org/712947
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=8c91b4651e9dff695866da970961e3a9b352433e
Submitter: Zuul
Branch: stable/train

commit 8c91b4651e9dff695866da970961e3a9b352433e
Author: Alex Schultz <email address hidden>
Date: Wed Mar 11 10:00:44 2020 -0600

    Cleanup stale interface if exists

    There is a bug in the CentOS 8 image where the ens3 interface file
    exists. We should clean that up if it exists to prevent issues when
    booting. We manage the interfaces later with os-net-config so we don't
    want them to exist in the image.

    Change-Id: I95d851f194a524caad639188e7df8f041fa2a248
    Closes-Bug: #1866202
    (cherry picked from commit d641e019b90f9318a38d67b3910cb4e4b1c09ec0)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/712949
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=e64cccf5e6cb6bd923efbc6b82eca95cf0b9a119
Submitter: Zuul
Branch: stable/train

commit e64cccf5e6cb6bd923efbc6b82eca95cf0b9a119
Author: Alex Schultz <email address hidden>
Date: Wed Mar 11 15:20:22 2020 -0600

    Add interface-names to centos8 images

    Back in Bug #1841441, we disabled the net.ifnames because of the
    RHEL7->RHEL8 changes to interface names. Now that we have centos8, we
    need to ensure this action is also run on those images.

    Depends-On: https://review.opendev.org/#/c/712947/
    Change-Id: Ice40fec0eacefd9778614996fb3417b78cdd17d3
    Related-Bug: #1866202
    Related-Bug: #1841441
    (cherry picked from commit 2589b8f22ad2d263bcecbb8278e73f0822e7eba3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-puppet-elements (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/712521
Reason: https://review.opendev.org/#/c/712571/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-image-elements 11.0.2

This issue was fixed in the openstack/tripleo-image-elements 11.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/728405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by Sagi Shnaidman (<email address hidden>) on branch: master
Review: https://review.opendev.org/728405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-image-elements 10.6.2

This issue was fixed in the openstack/tripleo-image-elements 10.6.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/728405
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/15b7b6ce5d244d876b78b84677d10448bd86825a
Submitter: "Zuul (22348)"
Branch: master

commit 15b7b6ce5d244d876b78b84677d10448bd86825a
Author: yatinkarel <email address hidden>
Date: Fri May 15 15:18:48 2020 +0530

    Install atop in CentOS8 from RDO CentOS8 repo

    Currently atop in CentOS8 jobs is installed from
    Epel 7 repo which is wrong. Since atop is available
    in RDO build deps repo let's use that when running
    on CentOS8[1].

    Also seen kernel panic related to atop as described
    in related bug.

    [1] https://review.rdoproject.org/r/#/q/topic:rdo-centos8-remove-epel

    Related-Bug: #1866202
    Change-Id: I4f605615fb1bdc720194244cb43be14648033271

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/798412
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/89b425b81b23a50ba7977018450d7e2825711d12
Submitter: "Zuul (22348)"
Branch: master

commit 89b425b81b23a50ba7977018450d7e2825711d12
Author: yatinkarel <email address hidden>
Date: Tue Jun 29 10:18:33 2021 +0530

    Move atop installation after repo setup

    [1] switched atop installation from RDO repos but
    repos are getting setup later, so let's move atop
    installation after repo setup.

    Related-Bug: #1866202
    Change-Id: I378e58eacd17d96c88352f06d1d42a1df765558b

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.