Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core)

Bug #1799393 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Critical
Canonical Kernel Team
linux (Ubuntu)
Fix Released
Critical
Joseph Salisbury
Cosmic
Fix Released
Critical
Joseph Salisbury

Bug Description

== SRU Justification ==
The requested commit fixes a regression introduce by mainline commit
3a2f70331226, in v4.18-rc1. The commit is only needed in Cosmic. Do to
the regression, A Mellanox CX5 stops pinging with rx_wqe_err (mlx5_core)

== Fix ==
37fdffb217a4 ("net/mlx5: WQ, fixes for fragmented WQ buffers API")

== Regression Potential ==
Low. This commit has been cc'd to stable, so it has had additional
upstream review.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

== Comment: #0 - Michael Ranweiler - 2018-10-18 11:34:40 ==

---Problem Description---
At the system if u do
ethtool -S enP48p1s0f0 | grep wqe_err
     rx_wqe_err: 1
     rx0_wqe_err: 0
     rx1_wqe_err: 0
     rx2_wqe_err: 0
     rx3_wqe_err: 1
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
     rx11_wqe_err: 0
     rx12_wqe_err: 0
     rx13_wqe_err: 0
     rx14_wqe_err: 0
     rx15_wqe_err: 0

Will see that rx side is hitting issue.

---Additional Hardware Info---
Mellanox CX5 Ethernet 100G
lspci
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

Machine Type = P9

---Debugger---
A debugger is not configured

---Steps to Reproduce---
Using a CX5 Ethernet 100G card
lspci
0030:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]

just configure IP
ifconfig enP48p1s0f0 33.33.33.33 netmask 255.255.255.0 up
then partner system configure IP and then try ping -f
ping -f 33.33.33.33
PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data.
........................................^C
--- 33.33.33.33 ping statistics ---
5413 packets transmitted, 5373 received, 0% packet loss, time 934ms
rtt min/avg/max/mdev = 0.015/0.019/0.669/0.010 ms, ipg/ewma 0.172/0.020 ms
# ping 33.33.33.33
PING 33.33.33.33 (33.33.33.33) 56(84) bytes of data.
^C
--- 33.33.33.33 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1071ms

then at the recv system then do
ethtool -S enP48p1s0f0 | grep wqe_err
     rx_wqe_err: 1
     rx0_wqe_err: 0
     rx1_wqe_err: 0
     rx2_wqe_err: 0
     rx3_wqe_err: 1
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
     rx11_wqe_err: 0
     rx12_wqe_err: 0
     rx13_wqe_err: 0
     rx14_wqe_err: 0
     rx15_wqe_err: 0
you will see rx_wqe_err with a counter non-zero.

This is fixed by this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=37fdffb217a45609edccbb8b407d031143f551c0

== Comment: #1 - Carol L. Soto - 2018-10-18 11:46:00 ==
I did a git clone to the cosmic tree and loaded the kernel in a system.

kernel 4.18.12 and I can recreate it.

lspci | grep Mell | grep ConnectX-5
0000:01:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0000:01:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.0 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
0030:01:00.1 Infiniband controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
:~# ethtool -S enp1s0f0 | grep wqe_err
     rx_wqe_err: 2
     rx0_wqe_err: 1
     rx1_wqe_err: 1
     rx2_wqe_err: 0
     rx3_wqe_err: 0
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
...

Let me check if the proposed patch needs backport or not.

== Comment: #3 - Carol L. Soto - 2018-10-18 13:34:46 ==
I was able to apply the proposed patch as it to the cosmic git tree and no issue. (no need to backport)
using a kernel 4.18.12+.

With the proposed patch I do not see wqe err and ping does not stop.
ethtool -S enp1s0f0 | grep wqe_err
     rx_wqe_err: 0
     rx0_wqe_err: 0
     rx1_wqe_err: 0
     rx2_wqe_err: 0
     rx3_wqe_err: 0
     rx4_wqe_err: 0
     rx5_wqe_err: 0
     rx6_wqe_err: 0
     rx7_wqe_err: 0
     rx8_wqe_err: 0
     rx9_wqe_err: 0
     rx10_wqe_err: 0
...

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-172460 severity-critical targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu):
importance: Undecided → Critical
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
status: New → In Progress
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit 37fdffb217a45609edccbb8b407d031143f551c0. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1799393

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-10-23 19:47 EDT-------
(In reply to comment #8)
> I built a test kernel with commit 37fdffb217a45609edccbb8b407d031143f551c0.
> The test kernel can be downloaded from:
> http://kernel.ubuntu.com/~jsalisbury/lp1799393
>
> Can you test this kernel and see if it resolves this bug?
>
> Note about installing test kernels:
> ? If the test kernel is prior to 4.15(Bionic) you need to install the
> linux-image and linux-image-extra .deb packages.
> ? If the test kernel is 4.15(Bionic) or newer, you need to install the
> linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.
>
> Thanks in advance!

Hi
I was able to verify this with this kernel
4.18.0-10-generic #12~lp1799393 SMP Tue Oct 23 19:04:13 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

I did a ping flood and I can see that I am not getting wqe_err right way like before.
#netstat -in
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
enP2p1s0 1500 4295 0 9 0 271 0 0 0 BMRU
enp1s0f0 1500 5608322 0 0 0 5606566 0 0 0 BMRU
lo 65536 12 0 0 0 12 0 0 0 LRU
virbr0 1500 0 0 0 0 0 0 0 0 BMU
# ethtool -S enp1s0f0 | grep rx_wqe_err
rx_wqe_err: 0

Thanks.

Changed in linux (Ubuntu Cosmic):
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Joseph Salisbury (jsalisbury)
description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Changed in linux (Ubuntu Cosmic):
status: In Progress → Fix Committed
Revision history for this message
Frank Heimes (fheimes) wrote :
Revision history for this message
Frank Heimes (fheimes) wrote :

Changing to Fix Committed since SRU was applied

Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-cosmic' to 'verification-done-cosmic'. If the problem still exists, change the tag 'verification-needed-cosmic' to 'verification-failed-cosmic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-cosmic
bugproxy (bugproxy)
tags: added: verification-done-cosmic
removed: verification-needed-cosmic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (39.7 KiB)

This bug was fixed in the package linux - 4.18.0-12.13

---------------
linux (4.18.0-12.13) cosmic; urgency=medium

  * linux: 4.18.0-12.13 -proposed tracker (LP: #1802743)

  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  * CVE-2018-18955: nested user namespaces with more than five extents
    incorrectly grant privileges over inode (LP: #1801924) // CVE-2018-18955
    - userns: also map extents in the reverse map to kernel IDs

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

  * crash in ENA driver on removing an interface (LP: #1802341)
    - SAUCE: net: ena: fix crash during ena_remove()

  * Ubuntu 18.04.1 - [s390x] Kernel panic while stressing network bonding
    (LP: #1797367)
    - s390/qeth: reduce hard-coded access to ccw channels
    - s390/qeth: sanitize strings in debug messages

  * Add checksum offload and T...

Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Marking main "Linux" series as "Fix Released" as this issue is already in the disco kernel.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
bugproxy (bugproxy)
tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-03-01 00:25 EDT-------
I did verified this bugzilla but I reverified with this level. (4.18.0-16-generic)
uname -r
4.18.0-16-generic
# netstat -in
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
enP2p1s0 1500 9161 0 9 0 379 0 0 0 BMRU
enp1s0f0 1500 5459302 0 0 0 5455281 0 0 0 BMRU
lo 65536 12 0 0 0 12 0 0 0 LRU
virbr0 1500 0 0 0 0 0 0 0 0 BMU
# ethtool -S enp1s0f0 | grep rx_wqe
rx_wqe_err: 0

Brad Figg (brad-figg)
tags: added: cscc
bugproxy (bugproxy)
tags: added: targetmilestone-inin1810
removed: targetmilestone-inin---
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.