Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Fix Released
|
Critical
|
Canonical Kernel Team | ||
linux (Ubuntu) |
Fix Released
|
Critical
|
Unassigned | ||
Cosmic |
Fix Released
|
Critical
|
Unassigned |
Bug Description
== SRU Justification ==
The requested commit fixes a regression introduce by mainline commit
3a2f70331226, in v4.18-rc1. The commit is only needed in Cosmic. Do to
the regression, A Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core)
== Fix ==
37fdffb217a4 ("net/mlx5: WQ, fixes for fragmented WQ buffers API")
== Regression Potential ==
Low. This commit has been cc'd to stable, so it has had additional
upstream review.
== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
== Comment: #0 - Michael Ranweiler - 2018-10-18 11:34:40 ==
---Problem Description---
At the system if u do
ethtool -S enP48p1s0f0 | grep wqe_err
rx_wqe_err: 1
rx0_wqe_err: 0
rx1_wqe_err: 0
rx2_wqe_err: 0
rx3_wqe_err: 1
rx4_wqe_err: 0
rx5_wqe_err: 0
rx6_wqe_err: 0
rx7_wqe_err: 0
rx8_wqe_err: 0
rx9_wqe_err: 0
rx10_wqe_err: 0
rx11_wqe_err: 0
rx12_wqe_err: 0
rx13_wqe_err: 0
rx14_wqe_err: 0
rx15_wqe_err: 0
Will see that rx side is hitting issue.
---Additional Hardware Info---
Mellanox CX5 Ethernet 100G
lspci
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
Machine Type = P9
---Debugger---
A debugger is not configured
---Steps to Reproduce---
Using a CX5 Ethernet 100G card
lspci
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
just configure IP
ifconfig enP48p1s0f0 33.33.33.33 netmask 255.255.255.0 up
then partner system configure IP and then try ping -f
ping -f 33.33.33.33
PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data.
.......
--- 33.33.33.33 ping statistics ---
5413 packets transmitted, 5373 received, 0% packet loss, time 934ms
rtt min/avg/max/mdev = 0.015/0.
# ping 33.33.33.33
PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data.
^C
--- 33.33.33.33 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1071ms
then at the recv system then do
ethtool -S enP48p1s0f0 | grep wqe_err
rx_wqe_err: 1
rx0_wqe_err: 0
rx1_wqe_err: 0
rx2_wqe_err: 0
rx3_wqe_err: 1
rx4_wqe_err: 0
rx5_wqe_err: 0
rx6_wqe_err: 0
rx7_wqe_err: 0
rx8_wqe_err: 0
rx9_wqe_err: 0
rx10_wqe_err: 0
rx11_wqe_err: 0
rx12_wqe_err: 0
rx13_wqe_err: 0
rx14_wqe_err: 0
rx15_wqe_err: 0
you will see rx_wqe_err with a counter non-zero.
This is fixed by this patch:
https:/
== Comment: #1 - Carol L. Soto - 2018-10-18 11:46:00 ==
I did a git clone to the cosmic tree and loaded the kernel in a system.
kernel 4.18.12 and I can recreate it.
lspci | grep Mell | grep ConnectX-5
0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
:~# ethtool -S enp1s0f0 | grep wqe_err
rx_wqe_err: 2
rx0_wqe_err: 1
rx1_wqe_err: 1
rx2_wqe_err: 0
rx3_wqe_err: 0
rx4_wqe_err: 0
rx5_wqe_err: 0
rx6_wqe_err: 0
rx7_wqe_err: 0
rx8_wqe_err: 0
rx9_wqe_err: 0
rx10_wqe_err: 0
...
Let me check if the proposed patch needs backport or not.
== Comment: #3 - Carol L. Soto - 2018-10-18 13:34:46 ==
I was able to apply the proposed patch as it to the cosmic git tree and no issue. (no need to backport)
using a kernel 4.18.12+.
With the proposed patch I do not see wqe err and ping does not stop.
ethtool -S enp1s0f0 | grep wqe_err
rx_wqe_err: 0
rx0_wqe_err: 0
rx1_wqe_err: 0
rx2_wqe_err: 0
rx3_wqe_err: 0
rx4_wqe_err: 0
rx5_wqe_err: 0
rx6_wqe_err: 0
rx7_wqe_err: 0
rx8_wqe_err: 0
rx9_wqe_err: 0
rx10_wqe_err: 0
...
CVE References
tags: | added: architecture-ppc64le bugnameltc-172460 severity-critical targetmilestone-inin--- |
Changed in ubuntu: | |
assignee: | nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
affects: | ubuntu → linux (Ubuntu) |
Changed in ubuntu-power-systems: | |
importance: | Undecided → Critical |
assignee: | nobody → Canonical Kernel Team (canonical-kernel-team) |
Changed in linux (Ubuntu): | |
importance: | Undecided → Critical |
assignee: | Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury) |
status: | New → In Progress |
Changed in ubuntu-power-systems: | |
status: | New → In Progress |
Changed in linux (Ubuntu Cosmic): | |
status: | New → In Progress |
importance: | Undecided → Critical |
assignee: | nobody → Joseph Salisbury (jsalisbury) |
description: | updated |
Changed in linux (Ubuntu Cosmic): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-cosmic removed: verification-needed-cosmic |
tags: |
added: verification-done-bionic removed: verification-needed-bionic |
tags: | added: cscc |
tags: |
added: targetmilestone-inin1810 removed: targetmilestone-inin--- |
I built a test kernel with commit 37fdffb217a4560 9edccbb8b407d03 1143f551c0. The test kernel can be downloaded from: kernel. ubuntu. com/~jsalisbury /lp1799393
http://
Can you test this kernel and see if it resolves this bug?
Note about installing test kernels: unsigned .deb packages.
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-
Thanks in advance!