[UBUNTU 20.04] [HPS] Kernel panic with "refcount_t: underflow" in mlx5 driver
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu on IBM z Systems |
Fix Released
|
High
|
Skipper Bug Screeners | ||
linux (Ubuntu) |
Fix Released
|
High
|
Skipper Bug Screeners | ||
Focal |
Fix Released
|
High
|
Canonical Kernel Team |
Bug Description
SRU Justification:
==================
[ Impact ]
* The mlx5 driver is causing a Kernel panic with
"refcount_t: underflow".
* This issue occurs during a recovery when the PCI device
is isolated and thus doesn't respond.
[ Fix ]
* This issue got solved upstream with
aaf2e65cac7f aaf2e65cac7f2e1
"net/mlx5: Fix handling of entry refcount when command
is not issued to FW" (upstream since 6.1-rc1)
* But to get aaf2e65cac7f a backport of b898ce7bccf1
b898ce7bccf1
"net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
not accessible" is required on top (upstream since 5.10)
[ Test Plan ]
* An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
is needed that has Mellanox cards (RoCE Express 2.1)
assigned, configured and enabled and that runs a 5.4
kernel with mlx5 driver.
* Create some network traffic on (one of the) RoCE device
(interface ens???[d?]) for testing (e.g. with stress-ng).
* Make sure the module/driver mlx5 is loaded and in use.
* Trigger a recovery (via the Support Element)
that will render the adapter (ports) unresponsive
for a moment and should provoke a similar situation.
* Alternatively the interface itself can be removed for
a moment and re-added again (but this may break further
things on top).
* Due to the lack of RoCE Express 2.1 hardware,
the verification is on IBM.
[ Where problems could occur ]
* The modifications are limited to the Mellanox mlx5 driver
only - no other network driver is affected.
* The pre-required commit (aaf2e65cac7f) can have a bad
impact on (re-)claiming pages if FW is not accessible,
which could cause page leaks in case done wrong.
But this commit is pretty save since it's upstream
since v5.10.
* The fix itself (aaf2e65cac7f) mainly changes the
cmd_work_handler and mlx5_cmd_
in a way that instead of pci_channel_offline
mlx5_cmd_is_down (introiduced by b898ce7bccf1).
* Actually b898ce7bccf1 started with changing from
pci_
but looks like a few cases
(in the area of refcount increate/decrease) were missed,
that are now covered by aaf2e65cac7f.
* It fixes now on top refcounts are now always properly
increment and decrement to achieve a symmetric state
for all flows.
* These changes may have an impact on all cases where the
mlx5 device is not responding, which can happen in case
of an offline channel, interface down, reset or recovery.
[ Other Info ]
* Looking at the master-next git trees for jammy, kinetic
and lunar showed that both fixes are already included,
hence only focal is affected.
__________
---Problem Description---
Kernel panic with "refcount_t: underflow" in kernel log
Contact Information = <email address hidden>, <email address hidden>
---uname output---
5.4.0-128-generic
Machine Type = s390x
---System Hang---
Kernel panic and stack-trace as below
---Debugger---
A debugger is not configured
Stack trace output:
[Sat Apr 8 17:52:21 UTC 2023] Call Trace:
[Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f8
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff80613
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f13
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e96
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f5
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f13
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e27
[Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e27
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a
[Sat Apr 8 17:52:21 UTC 2023] Last Breaking-
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a
[Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]---
[Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP
[Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_
[Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_
[Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu
[Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR)
[Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_
[Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_
[Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800
[Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08
[Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740
[Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8
[Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0
[Sat Apr 8 17:52:21 UTC 2023] Call Trace:
[Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d2
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fa
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a
[Sat Apr 8 17:52:21 UTC 2023] Last Breaking-
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee
[Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops
Oops output:
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee
[Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops
------------
[Michael]
I had a look into the dump from wdc3-qz1-
crash> sys
The system was up and running since:
UPTIME: 282 days, 02:16:10
There a a lot of martian source messages again like:
[Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0
[Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06
I hope that we get them suppressed soon.
Then at the following time a first issue can be observed: NFS timeout
[Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-
The reason could be
a) the server
b) the network
c) the local network adapter
Then about 1:05 hour later the first mlx5 related issues are reported
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_
?
Then about 15 minutes later the NFS code performs a panic_on_oops
?
[Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-
[Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space
[Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803
[Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE.
[Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024
[Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP
[Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_
tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_
eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_
mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p
tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel
tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo
[Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as
ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_
x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la
st unloaded: sysdigcloud_probe]
[Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.
[Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR)
[Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+
[Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8
[Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537
[Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500
[Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48
[Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc
[Sun Apr 16 12:34:10 UTC 2023] Call Trace:
[Sun Apr 16 12:34:10 UTC 2023] ([<000000000000
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aae
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aae
[Sun Apr 16 12:34:10 UTC 2023] [<0000004742161
[Sun Apr 16 12:34:10 UTC 2023] [<0000004742161
[Sun Apr 16 12:34:10 UTC 2023] [<000000474277e
[Sun Apr 16 12:34:10 UTC 2023] Last Breaking-
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779
[Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops
The network interfaces p0 and p1 are missing:
crash> net | grep -P "p0 |p1 "
5b726fa000 macvtap0
It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code.
That would be the related upstream commit:
aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW
----
[Niklas]
I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-
it likely won't help with not losing the interface but it does sound like it could
solve the kernel crash/refcount warning.
=======
Summary:
Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash.
We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE.
aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW
https://<email address hidden>/
=======
tags: | added: architecture-s3903164 bugnameltc-202279 severity-high targetmilestone-inin--- |
Changed in ubuntu: | |
assignee: | nobody → Skipper Bug Screeners (skipper-screen-team) |
affects: | ubuntu → linux (Ubuntu) |
tags: | added: patch |
tags: |
added: targetmilestone-inin2004 removed: targetmilestone-inin--- |
description: | updated |
description: | updated |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
Changed in ubuntu-z-systems: | |
status: | In Progress → Fix Committed |
Changed in ubuntu-z-systems: | |
status: | Fix Committed → Fix Released |
tags: | removed: verification-needed-focal-linux-bluefield |
tags: | added: verification-needed-focal-linux-bluefield |
The change requires the backport of one additional patch (both are provided above).
We created a test kernel with those changes for validation and you can find the debian packages at https:/ /people. canonical. com/~mhcerri/ lp2019011/ s390x_debs. tgz
Please let us know if the test kernel works as expected. Thank you!