Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core)

Bug #1799393 reported by bugproxy on 2018-10-23
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Critical
Canonical Kernel Team
linux (Ubuntu)
Critical
Joseph Salisbury
Cosmic
Critical
Joseph Salisbury

Bug Description

== SRU Justification ==
The requested commit fixes a regression introduce by mainline commit
3a2f70331226, in v4.18-rc1. The commit is only needed in Cosmic. Do to
the regression, A Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core)

== Fix ==
37fdffb217a4 ("net/mlx5: WQ, fixes for fragmented WQ buffers API")

== Regression Potential ==
Low. This commit has been cc'd to stable, so it has had additional
upstream review.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

== Comment: #0 - Michael Ranweiler - 2018-10-18 11:34:40 ==

---Problem Description---
At the system if u do
ethtool -S enP48p1s0f0 | grep wqe_err
     rx_wqe_err: 1
     rx0_wqe_err: 0
     rx1_wqe_err: 0
     rx2_wqe_err: 0
     rx3_wqe_err: 1
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
     rx11_wqe_err: 0
     rx12_wqe_err: 0
     rx13_wqe_err: 0
     rx14_wqe_err: 0
     rx15_wqe_err: 0

Will see that rx side is hitting issue.

---Additional Hardware Info---
Mellanox CX5 Ethernet 100G
lspci
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

Machine Type = P9

---Debugger---
A debugger is not configured

---Steps to Reproduce---
Using a CX5 Ethernet 100G card
lspci
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

just configure IP
ifconfig enP48p1s0f0 33.33.33.33 netmask 255.255.255.0 up
then partner system configure IP and then try ping -f
ping -f 33.33.33.33
PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data.
........................................^C
--- 33.33.33.33 ping statistics ---
5413 packets transmitted, 5373 received, 0% packet loss, time 934ms
rtt min/avg/max/mdev = 0.015/0.019/0.669/0.010 ms, ipg/ewma 0.172/0.020 ms
# ping 33.33.33.33
PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data.
^C
--- 33.33.33.33 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1071ms

then at the recv system then do
ethtool -S enP48p1s0f0 | grep wqe_err
     rx_wqe_err: 1
     rx0_wqe_err: 0
     rx1_wqe_err: 0
     rx2_wqe_err: 0
     rx3_wqe_err: 1
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
     rx11_wqe_err: 0
     rx12_wqe_err: 0
     rx13_wqe_err: 0
     rx14_wqe_err: 0
     rx15_wqe_err: 0
you will see rx_wqe_err with a counter non-zero.

This is fixed by this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=37fdffb217a45609edccbb8b407d031143f551c0

== Comment: #1 - Carol L. Soto - 2018-10-18 11:46:00 ==
I did a git clone to the cosmic tree and loaded the kernel in a system.

kernel 4.18.12 and I can recreate it.

lspci | grep Mell | grep ConnectX-5
0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
:~# ethtool -S enp1s0f0 | grep wqe_err
     rx_wqe_err: 2
     rx0_wqe_err: 1
     rx1_wqe_err: 1
     rx2_wqe_err: 0
     rx3_wqe_err: 0
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
...

Let me check if the proposed patch needs backport or not.

== Comment: #3 - Carol L. Soto - 2018-10-18 13:34:46 ==
I was able to apply the proposed patch as it to the cosmic git tree and no issue. (no need to backport)
using a kernel 4.18.12+.

With the proposed patch I do not see wqe err and ping does not stop.
ethtool -S enp1s0f0 | grep wqe_err
     rx_wqe_err: 0
     rx0_wqe_err: 0
     rx1_wqe_err: 0
     rx2_wqe_err: 0
     rx3_wqe_err: 0
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
...

bugproxy (bugproxy) on 2018-10-23
tags: added: architecture-ppc64le bugnameltc-172460 severity-critical targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Changed in ubuntu-power-systems:
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
importance: Undecided → Critical
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
status: New → In Progress
Changed in ubuntu-power-systems:
status: New → In Progress
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 37fdffb217a45609edccbb8b407d031143f551c0. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1799393

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

------- Comment From <email address hidden> 2018-10-23 19:47 EDT-------
(In reply to comment #8)
> I built a test kernel with commit 37fdffb217a45609edccbb8b407d031143f551c0.
> The test kernel can be downloaded from:
> http://kernel.ubuntu.com/~jsalisbury/lp1799393
>
> Can you test this kernel and see if it resolves this bug?
>
> Note about installing test kernels:
> ? If the test kernel is prior to 4.15(Bionic) you need to install the
> linux-image and linux-image-extra .deb packages.
> ? If the test kernel is 4.15(Bionic) or newer, you need to install the
> linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.
>
> Thanks in advance!

Hi
I was able to verify this with this kernel
4.18.0-10-generic #12~lp1799393 SMP Tue Oct 23 19:04:13 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

I did a ping flood and I can see that I am not getting wqe_err right way like before.
#netstat -in
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
enP2p1s0 1500 4295 0 9 0 271 0 0 0 BMRU
enp1s0f0 1500 5608322 0 0 0 5606566 0 0 0 BMRU
lo 65536 12 0 0 0 12 0 0 0 LRU
virbr0 1500 0 0 0 0 0 0 0 0 BMU
# ethtool -S enp1s0f0 | grep rx_wqe_err
rx_wqe_err: 0

Thanks.

Changed in linux (Ubuntu Cosmic):
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Joseph Salisbury (jsalisbury)
description: updated
Changed in linux (Ubuntu Cosmic):
status: In Progress → Fix Committed
Frank Heimes (frank-heimes) wrote :

Changing to Fix Committed since SRU was applied

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-cosmic' to 'verification-done-cosmic'. If the problem still exists, change the tag 'verification-needed-cosmic' to 'verification-failed-cosmic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-cosmic
bugproxy (bugproxy) on 2018-11-16
tags: added: verification-done-cosmic
removed: verification-needed-cosmic
Launchpad Janitor (janitor) wrote :
Download full text (39.7 KiB)

This bug was fixed in the package linux - 4.18.0-12.13

---------------
linux (4.18.0-12.13) cosmic; urgency=medium

  * linux: 4.18.0-12.13 -proposed tracker (LP: #1802743)

  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  * CVE-2018-18955: nested user namespaces with more than five extents
    incorrectly grant privileges over inode (LP: #1801924) // CVE-2018-18955
    - userns: also map extents in the reverse map to kernel IDs

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

  * crash in ENA driver on removing an interface (LP: #1802341)
    - SAUCE: net: ena: fix crash during ena_remove()

  * Ubuntu 18.04.1 - [s390x] Kernel panic while stressing network bonding
    (LP: #1797367)
    - s390/qeth: reduce hard-coded access to ccw channels
    - s390/qeth: sanitize strings in debug messages

  * Add checksum offload and T...

Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Andrew Cloke (andrew-cloke) wrote :

Marking main "Linux" series as "Fix Released" as this issue is already in the disco kernel.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
bugproxy (bugproxy) on 2019-02-14
tags: added: verification-done-bionic
removed: verification-needed-bionic
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-03-01 00:25 EDT-------
I did verified this bugzilla but I reverified with this level. (4.18.0-16-generic)
uname -r
4.18.0-16-generic
# netstat -in
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
enP2p1s0 1500 9161 0 9 0 379 0 0 0 BMRU
enp1s0f0 1500 5459302 0 0 0 5455281 0 0 0 BMRU
lo 65536 12 0 0 0 12 0 0 0 LRU
virbr0 1500 0 0 0 0 0 0 0 0 BMU
# ethtool -S enp1s0f0 | grep rx_wqe
rx_wqe_err: 0

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers