config failing with unable to find interface (Mellanox family interface not visible in linux)

Bug #1860347 reported by Wendy Mitchell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Jim Somerville

Bug Description

Brief Description
-----------------
Configuration of labs with mellanox interfaces fail. The interface is not visible in linux output (ifconfig -a).

Severity
--------
Major

Steps to Reproduce
------------------
Interfaces (mellanox) do not appear to be visible in the linux output (while they do appear to be there).

This is preventing install/configuration from successfully completing
The following error is reported during configure controller step on these labs.

Errors are being reported during configure controller indicates that it is unable to find the interface ens785f0
eg.

{"log":"E0117 20:18:30.697446 1 common.go:226] controller/host \"msg\"=\"user data error\" \"error\"=\"unable to find interface UUID for port: ens785f0\" \"request\"=
{\"Namespace\":\"deployment\",\"Name\":\"controller-0\"}
\n","stream":"stderr","time":"2020-01-17T20:18:30.704588677Z"}
{"log":"E0117 20:19:31.405278 1 common.go:226] controller/host \"msg\"=\"user data error\" \"error\"=\"unable to find interface UUID for port: ens785f0\" \"request\"=
{\"Namespace\":\"deployment\",\"Name\":\"controller-0\"}
\n","stream":"stderr","time":"2020-01-17T20:19:31.405496902Z"}

The mellanox interfaces do not appear to be visible. Only these 2 interfaces are reported in the linux output.

$ system host-port-list controller-0
+--------------------------------------+----------+----------+--------------+--------+-----------+-------------+----------------------------------+
| uuid | name | type | pci address | device | processor | accelerated | device type |
+--------------------------------------+----------+----------+--------------+--------+-----------+-------------+----------------------------------+
| 2b181f2c-f742-41f4-b718-c9d62da8067e | enp3s0f0 | ethernet | 0000:03:00.0 | 0 | 0 | True | I350 Gigabit Network Connection |
| | | | | | | | [1521] |
| | | | | | | | |
| f02d98df-86dc-4386-9707-b12aa7dff50a | enp3s0f3 | ethernet | 0000:03:00.3 | 0 | 0 | True | I350 Gigabit Network Connection |
| | | | | | | | [1521]

controller-0:~$ ifconfig -a
cali675b11b2b81: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet6 fe80::ecee:eeff:feee:eeee prefixlen 64 scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee txqueuelen 0 (Ethernet)
        RX packets 0 bytes 0 (0.0 B)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 0 bytes 0 (0.0 B)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

cali8c7155b8c25: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet6 fe80::ecee:eeff:feee:eeee prefixlen 64 scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee txqueuelen 0 (Ethernet)
        RX packets 40971661 bytes 14589139995 (13.5 GiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 40971661 bytes 14589139995 (13.5 GiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

calid2619c1ab5f: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet6 fe80::ecee:eeff:feee:eeee prefixlen 64 scopeid 0x20<link>
        ether ee:ee:ee:ee:ee:ee txqueuelen 0 (Ethernet)
        RX packets 40971661 bytes 14589139995 (13.5 GiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 40971661 bytes 14589139995 (13.5 GiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
        inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
        ether 02:42:70:69:02:fc txqueuelen 0 (Ethernet)
        RX packets 0 bytes 0 (0.0 B)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 0 bytes 0 (0.0 B)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

enp3s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 128.224.151.80 netmask 255.255.254.0 broadcast 128.224.151.255
        inet6 2620:10a:a001:a103:a6bf:1ff:fe00:690 prefixlen 64 scopeid 0x0<global>
        inet6 fe80::a6bf:1ff:fe00:690 prefixlen 64 scopeid 0x20<link>
        ether a4:bf:01:00:06:90 txqueuelen 1000 (Ethernet)
        RX packets 11602842 bytes 3055925470 (2.8 GiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 423747 bytes 35184409 (33.5 MiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
        device memory 0x91920000-9193ffff

enp3s0f3: flags=4098<BROADCAST,MULTICAST> mtu 1500
        ether a4:bf:01:00:06:91 txqueuelen 1000 (Ethernet)
        RX packets 0 bytes 0 (0.0 B)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 0 bytes 0 (0.0 B)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
        device memory 0x91900000-9191ffff

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 40971661 bytes 14589139995 (13.5 GiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 40971661 bytes 14589139995 (13.5 GiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo:1: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 192.168.204.2 netmask 255.255.255.0
        loop txqueuelen 1000 (Local Loopback)

lo:5: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 192.168.206.2 netmask 255.255.255.0
        loop txqueuelen 1000 (Local Loopback)

tunl0: flags=193<UP,RUNNING,NOARP> mtu 1440
        inet 172.16.192.64 netmask 255.255.255.255
        tunnel txqueuelen 1000 (IPIP Tunnel)
        RX packets 0 bytes 0 (0.0 B)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 0 bytes 0 (0.0 B)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Expected Behavior
------------------
Expect the lab install/configuration to succeed (as previously load did 2019-12-13_19-03-42)

Actual Behavior
----------------
Configuration fails as interface expected is not visible in linux.

Reproducibility
---------------
yes

System Configuration
--------------------
Any system with Mellanox NICs
Lab-name:
wcp_61-62
wcp_63-66

Branch/Pull Time/Commit
-----------------------
master as of 20200117T023005Z
(also seen with a master load from Jan 10)

Last Pass
---------
The same configuration for lab wcp 63-66 worked on the following build
2019-12-13_19-03-42

Note: Output for december load (for lab wcp63-66) was as following.
The two mellanox ports are missing on the latest master load.
[sysadmin@controller-0 ~(keystone_admin)]$ system host-port-list controller-0
+--------------------------------------+----------+----------+--------------+--------+-----------+-------------+----------------------------------------+
| uuid | name | type | pci address | device | processor | accelerated | device type |
+--------------------------------------+----------+----------+--------------+--------+-----------+-------------+----------------------------------------+
| 6f857eaa-f2ca-4d38-9849-c0ca37d8184f | eno1 | ethernet | 0000:03:00.0 | 0 | 0 | False | I350 Gigabit Network Connection [1521] |
| bf93c6d6-4127-44eb-bce8-460327d43082 | eno2 | ethernet | 0000:03:00.3 | 0 | 0 | False | I350 Gigabit Network Connection [1521] |
| 12bc274e-3248-458b-ac6a-f051e7f8201a | ens785f0 | ethernet | 0000:05:00.0 | 0 | 0 | False | MT27710 Family [ConnectX-4 Lx] [1015] |
| 0c72b5e3-e8b2-4abb-bb2b-fe15e34e4d0b | ens785f1 | ethernet | 0000:05:00.1 | 0 | 0 | False | MT27710 Family [ConnectX-4 Lx] [1015] |
+--------------------------------------+----------+----------+--------------+--------+-----------+-------------+----------------------------------------+

Timestamp/Logs
--------------

Test Activity
-------------
Install/configuration

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on the load info, this maybe related to the kernel upversion which merged on Jan 2:
https://review.opendev.org/#/c/695355/

Perhaps there is a compatibility issue with the mellanox drivers.

Assigning to Jim Somerville to investigate since he has access to the WR labs that include the mlx NICs.

description: updated
Changed in starlingx:
importance: Undecided → High
importance: High → Undecided
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Jim Somerville (jsomervi)
description: updated
description: updated
summary: - Config controller failing with unable to find interface (Mellanox family
- interface not visible in linux)
+ config failing with unable to find interface (Mellanox family interface
+ not visible in linux)
Revision history for this message
Jim Somerville (jsomervi) wrote :

From dmesg:

[ 15.807582] compat.git: mlnx_ofed/mlnx-ofa_kernel-4.0.git
[ 15.851641] mlx5_core: Unknown symbol page_pool_destroy (err 0)
[ 15.851881] mlx5_core: Unknown symbol page_pool_create (err 0)
[ 15.851950] mlx5_core: Unknown symbol page_pool_alloc_pages (err 0)
[ 15.851992] mlx5_core: Unknown symbol __page_pool_put_page (err 0)
[ 15.863838] mlx5_core: Unknown symbol page_pool_destroy (err 0)
[ 15.864080] mlx5_core: Unknown symbol page_pool_create (err 0)
[ 15.864152] mlx5_core: Unknown symbol page_pool_alloc_pages (err 0)
[ 15.864193] mlx5_core: Unknown symbol __page_pool_put_page (err 0)

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking for stx.4.0, 3.0 & 2.0 / high priority -- given it is suspected this is introduced by the kernel upversion which as cherry-picked to the previous releases.

tags: added: stx.2.0 stx.3.0 stx.4.0 stx.distro.other
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Jim Somerville (jsomervi) wrote :

The out-of-tree Mellanox driver detects the presence of page pool support in the new kernel, and thus wants to use it. However, page pools are not configured to be on in the new kernel by default (CONFIG_PAGE_POOL), and not only that, the config option is hidden ie. it is not user selectable. The built-in Mellanox driver selects it. But we don't use the built-in driver.

My current approach is to patch the kernel to make it user selectable, and then change our config fragment to select it. The other possible approach is to just make it default y in the kernel, though I prefer the first approach as it clearly calls out the change to the option.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/703514

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/704180

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/704180
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=424ba94a9aa9e64fe1c0d2099b63e7d979b492cd
Submitter: Zuul
Branch: master

commit 424ba94a9aa9e64fe1c0d2099b63e7d979b492cd
Author: Jim Somerville <email address hidden>
Date: Fri Jan 24 12:36:06 2020 -0500

    Mellanox Driver: Disable use of kernel page pool functionality

    Problem: The out-of-tree Mellanox driver detects the presence
    of page pool support in the new kernel, and thus wants to use it.
    However, page pools are not configured to be on in the new kernel
    by default (CONFIG_PAGE_POOL), and not only that, the config
    option is hidden ie. it is not user selectable. The built-in
    Mellanox driver selects it, but we don't use the built-in driver.
    The out-of-tree driver does compile but not all pieces of it
    will load properly, specifically the mlx5 pieces which rely on
    page pool functionality being enabled in the kernel.

    Solution: Simply disable kernel page pool use in the
    out-of-tree Mellanox driver, making it work the same way as
    it did with the older kernel.

    Change-Id: If7e7155867d539352fcd0ea3acd5a17dd9d9579f
    Closes-Bug: 1860347
    Signed-off-by: Jim Somerville <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on integ (master)

Change abandoned by Jim Somerville (<email address hidden>) on branch: master
Review: https://review.opendev.org/703514
Reason: We've gone with the other solution which keeps the out-of-tree mellanox driver away from using page pools.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/705066

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/705112

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (r/stx.3.0)

Reviewed: https://review.opendev.org/705112
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=3b9601b70a07f894a8d1cd3906e917a3871c170e
Submitter: Zuul
Branch: r/stx.3.0

commit 3b9601b70a07f894a8d1cd3906e917a3871c170e
Author: Jim Somerville <email address hidden>
Date: Fri Jan 24 12:36:06 2020 -0500

    Mellanox Driver: Disable use of kernel page pool functionality

    Problem: The out-of-tree Mellanox driver detects the presence
    of page pool support in the new kernel, and thus wants to use it.
    However, page pools are not configured to be on in the new kernel
    by default (CONFIG_PAGE_POOL), and not only that, the config
    option is hidden ie. it is not user selectable. The built-in
    Mellanox driver selects it, but we don't use the built-in driver.
    The out-of-tree driver does compile but not all pieces of it
    will load properly, specifically the mlx5 pieces which rely on
    page pool functionality being enabled in the kernel.

    Solution: Simply disable kernel page pool use in the
    out-of-tree Mellanox driver, making it work the same way as
    it did with the older kernel.

    Change-Id: If7e7155867d539352fcd0ea3acd5a17dd9d9579f
    Closes-Bug: 1860347
    Signed-off-by: Jim Somerville <email address hidden>
    (cherry picked from commit 424ba94a9aa9e64fe1c0d2099b63e7d979b492cd)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (r/stx.2.0)

Reviewed: https://review.opendev.org/705066
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=7e27429ba3703ef02e21947810c1822f95b6da59
Submitter: Zuul
Branch: r/stx.2.0

commit 7e27429ba3703ef02e21947810c1822f95b6da59
Author: Jim Somerville <email address hidden>
Date: Fri Jan 24 12:36:06 2020 -0500

    Mellanox Driver: Disable use of kernel page pool functionality

    Problem: The out-of-tree Mellanox driver detects the presence
    of page pool support in the new kernel, and thus wants to use it.
    However, page pools are not configured to be on in the new kernel
    by default (CONFIG_PAGE_POOL), and not only that, the config
    option is hidden ie. it is not user selectable. The built-in
    Mellanox driver selects it, but we don't use the built-in driver.
    The out-of-tree driver does compile but not all pieces of it
    will load properly, specifically the mlx5 pieces which rely on
    page pool functionality being enabled in the kernel.

    Solution: Simply disable kernel page pool use in the
    out-of-tree Mellanox driver, making it work the same way as
    it did with the older kernel.

    Change-Id: If7e7155867d539352fcd0ea3acd5a17dd9d9579f
    Closes-Bug: 1860347
    Signed-off-by: Jim Somerville <email address hidden>
    (cherry picked from commit 424ba94a9aa9e64fe1c0d2099b63e7d979b492cd)

Ghada Khalil (gkhalil)
tags: added: in-r-stx20 in-r-stx30
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/705861

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (8.0 KiB)

Reviewed: https://review.opendev.org/705861
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=a3267c2016e1805f05e72e9063b2db8a227891c2
Submitter: Zuul
Branch: f/centos8

commit 77b632e28f27ab53a840f098fcfbba3db2714a1f
Author: Don Penney <email address hidden>
Date: Wed Feb 5 11:28:32 2020 -0500

    Fix containerd build failure

    The 20200205T023000Z CENGN build failed on containerd due to a build
    ordering issue. In the failed build, containerd was built ahead of
    rpm, and the mock build environment for the containerd build ran with
    the stock CentOS version of RPM. Unfortunately, it appears this
    version of RPM fails when trying to build the debuginfo for a golang
    package. There are currently two other golang packages in StarlingX,
    but these have debuginfo disabled in the spec.

    Adding a version-specific dependency in the containerd spec to ensure
    the newer RPM is installed resolves the issue.

    Change-Id: Ia7c85751012bbd0c3b83a2496bd7424e123eef93
    Closes-Bug: 1862038
    Co-Authored-By: Scott Little <email address hidden>
    Signed-off-by: Don Penney <email address hidden>

commit 7b7959e9b5cc9a68a6fcffba44bca2f84643b133
Author: Al Bailey <email address hidden>
Date: Tue Jan 28 07:49:23 2020 -0600

    Update pylint target for python3 and upper constraints

    This change imposes the upper constraint in tox to protect from
    future releases causing random breakage.

    Ex: A new version of python-libvirt was released Jan 23 2020
    which will not install on python2.

    This change also enables the python3 target for pylint which will
    allow the upper constraint to be changed to a more recent version
    when all the tox files are aligned.

    Change-Id: I9056778085d32b3401df60c20d67cff0a21dfe97
    Story: 2004515
    Task: 38496
    Signed-off-by: Al Bailey <email address hidden>

commit 424ba94a9aa9e64fe1c0d2099b63e7d979b492cd
Author: Jim Somerville <email address hidden>
Date: Fri Jan 24 12:36:06 2020 -0500

    Mellanox Driver: Disable use of kernel page pool functionality

    Problem: The out-of-tree Mellanox driver detects the presence
    of page pool support in the new kernel, and thus wants to use it.
    However, page pools are not configured to be on in the new kernel
    by default (CONFIG_PAGE_POOL), and not only that, the config
    option is hidden ie. it is not user selectable. The built-in
    Mellanox driver selects it, but we don't use the built-in driver.
    The out-of-tree driver does compile but not all pieces of it
    will load properly, specifically the mlx5 pieces which rely on
    page pool functionality being enabled in the kernel.

    Solution: Simply disable kernel page pool use in the
    out-of-tree Mellanox driver, making it work the same way as
    it did with the older kernel.

    Change-Id: If7e7155867d539352fcd0ea3acd5a17dd9d9579f
    Closes-Bug: 1860347
    Signed-off-by: Jim Somerville <email address hidden>

commit 7165b3539c75009311d3d4360a15b6ee4c7a4573
Author: Lin Shuicheng <email address hidden>
Date: Sun Jan 19 01:59:42 2020 +0000

   ...

Read more...

tags: added: in-f-centos8
Revision history for this message
Yang Liu (yliu12) wrote :

System with CX4 (wcp63-66) was installed successfully with 02-06 load.

tags: removed: stx.retestneeded
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (3.3 KiB)

verified on wcp 61-62
2020-02-10_00-10-00

$ uname -r
3.10.0-1062...

Interfaces are listed now using
$ ifconfig -a
and also corresponding interfaces are listed with system command
$ system host-port-list controller-0
+--------------------------------------+----------+----------+--------------+--------+-----------+-------------+------------------------------------+
| uuid | name | type | pci address | device | processor | accelerated | device type |
+--------------------------------------+----------+----------+--------------+--------+-----------+-------------+------------------------------------+
| 082bb5fe-4e81-4b99-adc1-76d1eedfd4f8 | enp3s0f0 | ethernet | 0000:03:00.0 | 0 | 0 | True | I350 Gigabit Network Connection |
| | | | | | | | [1521] |
| | | | | | | | |
| 12627a00-79d8-456a-b65d-0df3a793d054 | enp3s0f3 | ethernet | 0000:03:00.3 | 0 | 0 | True | I350 Gigabit Network Connection |
| | | | | | | | [1521] |
| | | | | | | | |
| 3a1f6433-2fff-43bb-80d1-ffc5e96dbc5c | ens785f0 | ethernet | 0000:05:00.0 | 0 | 0 | True | MT27710 Family [ConnectX-4 Lx] |
| | | | | | | | [1015] |
| | | | | | | | |
| 8383b789-1f5e-48a4-beb7-117dd7fe1ac7 | ens785f1 | ethernet | 0000:05:00.1 | 0 | 0 | True | MT27710 Family [ConnectX-4 Lx] |
| | | | | | | | [1015] |
| | | | | | | | |
| eecb12eb-ae97-46bb-8e78-edc1e510fc91 | ens801f0 | ethernet | 0000:81:00.0 | 0 | 1 | True | MT27710 Family [ConnectX-4 Lx] |
| | | | | | | | [1015] |
| | | | | | | | |
| 2dabd28c-e65c-45ca-ba35-c0756fe4a4e8 | ens801f1 | ethernet | 0000:81:00.1 | 0 | 1 | True | MT27710 Family [ConnectX-4 Lx] |
| | | | | | | | [1015] ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.