[UBUNTU 20.04] [HPS] Kernel panic with "refcount_t: underflow" in mlx5 driver

Bug #2019011 reported by bugproxy
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Skipper Bug Screeners
linux (Ubuntu)
Fix Released
High
Skipper Bug Screeners
Focal
Fix Released
High
Canonical Kernel Team

Bug Description

SRU Justification:
==================

[ Impact ]

 * The mlx5 driver is causing a Kernel panic with
   "refcount_t: underflow".

 * This issue occurs during a recovery when the PCI device
   is isolated and thus doesn't respond.

[ Fix ]

 * This issue got solved upstream with
   aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
   "net/mlx5: Fix handling of entry refcount when command
   is not issued to FW" (upstream since 6.1-rc1)

 * But to get aaf2e65cac7f a backport of b898ce7bccf1
   b898ce7bccf13087719c021d829dab607c175246
   "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
   not accessible" is required on top (upstream since 5.10)

[ Test Plan ]

 * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
   is needed that has Mellanox cards (RoCE Express 2.1)
   assigned, configured and enabled and that runs a 5.4
   kernel with mlx5 driver.

 * Create some network traffic on (one of the) RoCE device
   (interface ens???[d?]) for testing (e.g. with stress-ng).

 * Make sure the module/driver mlx5 is loaded and in use.

 * Trigger a recovery (via the Support Element)
   that will render the adapter (ports) unresponsive
   for a moment and should provoke a similar situation.

 * Alternatively the interface itself can be removed for
   a moment and re-added again (but this may break further
   things on top).

 * Due to the lack of RoCE Express 2.1 hardware,
   the verification is on IBM.

[ Where problems could occur ]

 * The modifications are limited to the Mellanox mlx5 driver
   only - no other network driver is affected.

 * The pre-required commit (aaf2e65cac7f) can have a bad
   impact on (re-)claiming pages if FW is not accessible,
   which could cause page leaks in case done wrong.
   But this commit is pretty save since it's upstream
   since v5.10.

 * The fix itself (aaf2e65cac7f) mainly changes the
   cmd_work_handler and mlx5_cmd_comp_handler functions
   in a way that instead of pci_channel_offline
   mlx5_cmd_is_down (introiduced by b898ce7bccf1).

 * Actually b898ce7bccf1 started with changing from
   pci_channel_offline to mlx5_cmd_is_down,
   but looks like a few cases
   (in the area of refcount increate/decrease) were missed,
   that are now covered by aaf2e65cac7f.

 * It fixes now on top refcounts are now always properly
   increment and decrement to achieve a symmetric state
   for all flows.

 * These changes may have an impact on all cases where the
   mlx5 device is not responding, which can happen in case
   of an offline channel, interface down, reset or recovery.

[ Other Info ]

 * Looking at the master-next git trees for jammy, kinetic
   and lunar showed that both fixes are already included,
   hence only focal is affected.
__________

---Problem Description---

Kernel panic with "refcount_t: underflow" in kernel log

Contact Information = <email address hidden>, <email address hidden>

---uname output---
5.4.0-128-generic

Machine Type = s390x

---System Hang---
Kernel panic and stack-trace as below

---Debugger---
A debugger is not configured

Stack trace output:
[Sat Apr 8 17:52:21 UTC 2023] Call Trace:
[Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939a286>] refcount_warn_saturate+0xce/0x140)
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f861e>] cmd_ent_put+0xe6/0xf8 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9b6a>] mlx5_cmd_comp_handler+0x102/0x4f0 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9f8a>] cmd_comp_notifier+0x32/0x48 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe4fc>] mlx5_eq_async_int+0x13c/0x200 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061318e>] mlx5_irq_int_handler+0x2e/0x48 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e960ce>] zpci_floating_irq_handler+0xe6/0x1b8
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f54a6>] do_airq_interrupt+0x96/0x130
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30e42>] do_IRQ+0x7a/0xb0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a408>] io_int_handler+0x12c/0x294
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e2752e>] enabled_wait+0x46/0xd8
[Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2752e>] enabled_wait+0x46/0xd8)
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e278aa>] arch_cpu_idle+0x2a/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1536>] do_idle+0xee/0x1b0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee17a6>] cpu_startup_entry+0x36/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3ab38>] smp_init_secondary+0xc8/0xe8
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a770>] smp_start_secondary+0x88/0x90
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10
[Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address:
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a286>] refcount_warn_saturate+0xce/0x140
[Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]---
[Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP
[Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_probe(OE) vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache ebtable_broute binfmt_misc nbd veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_nat xt_mark sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bonding s390_trng
[Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: sysdigcloud_probe]
[Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu
[Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR)
[Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_work_on+0x30/0x70)
[Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800
[Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08
[Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740
[Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8
[Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0
                                          0000002a58ec51cc: e3b0f0a00004 lg %r11,160(%r15)
                                         #0000002a58ec51d2: eb11400000e6 laog %r1,%r1,0(%r4)
                                         >0000002a58ec51d8: 07e0 bcr 14,%r0
                                          0000002a58ec51da: a7110001 tmll %r1,1
                                          0000002a58ec51de: a7840016 brc 8,0000002a58ec520a
                                          0000002a58ec51e2: a7280000 lhi %r2,0
                                          0000002a58ec51e6: a7b20300 tmhh %r11,768
[Sat Apr 8 17:52:21 UTC 2023] Call Trace:
[Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d23ae0>] 0x3e016d23ae0)
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fab0a>] cmd_exec+0x44a/0xab0 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb2b0>] mlx5_cmd_exec+0x40/0x70 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657cb0>] mlx5_eswitch_get_vport_stats+0xb0/0x2a0 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644602>] mlx5e_rep_update_hw_counters+0x52/0xb8 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f1ec>] mlx5e_update_stats_work+0x44/0x58 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec56f4>] process_one_work+0x274/0x4d0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5998>] worker_thread+0x48/0x560
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd014>] kthread+0x144/0x160
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a094>] ret_from_fork+0x28/0x30
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10
[Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address:
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060
[Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops

Oops output:
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060
[Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops

------------

[Michael]

I had a look into the dump from wdc3-qz1-sr2-rk086-s05:

crash> sys

The system was up and running since:

UPTIME: 282 days, 02:16:10

There a a lot of martian source messages again like:

[Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0
[Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06

I hope that we get them suppressed soon.

Then at the following time a first issue can be observed: NFS timeout

[Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out

The reason could be

a) the server
b) the network
c) the local network adapter

Then about 1:05 hour later the first mlx5 related issues are reported

[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
?

Then about 15 minutes later the NFS code performs a panic_on_oops
?
[Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out
[Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space
[Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803
[Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE.
[Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024
[Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP
[Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_probe(OE) binfmt_misc nbd vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache xt_s
tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_
eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_
mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p
tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel
tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo
[Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as
ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_v
x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la
st unloaded: sysdigcloud_probe]
[Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.04.1+hf334332v20220521b1-Ubuntu
[Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR)
[Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+0x3a/0xf8 [sunrpc])
[Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8
[Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537
[Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500
[Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48
[Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc
                                          000003ff8076303e: e31020c00004 lg %r1,192(%r2)
                                         #000003ff80763044: e3a010000004 lg %r10,0(%r1)
                                         >000003ff8076304a: e310a4070090 llgc %r1,1031(%r10)
                                          000003ff80763050: a7110010 tmll %r1,16
                                          000003ff80763054: a7740025 brc 7,000003ff8076309e
                                          000003ff80763058: c418ffffe7d8 lgrl %r1,000003ff80760008
                                          000003ff8076305e: 91021003 tm 3(%r1),2
[Sun Apr 16 12:34:10 UTC 2023] Call Trace:
[Sun Apr 16 12:34:10 UTC 2023] ([<0000000000000000>] 0x0)
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779454>] __rpc_execute+0x8c/0x488 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779df2>] rpc_execute+0x8a/0x128 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766d62>] rpc_run_task+0x132/0x180 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766e00>] rpc_call_sync+0x50/0xa0 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360e40>] nfs3_rpc_wrapper.constprop.12+0x48/0xe0 [nfsv3]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361c5e>] nfs3_proc_getattr+0x6e/0xc8 [nfsv3]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaeaa8>] __nfs_revalidate_inode+0x158/0x3b0 [nfs]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaef9c>] nfs_getattr+0x1bc/0x388 [nfs]
[Sun Apr 16 12:34:10 UTC 2023] [<0000004742161032>] vfs_statx+0xaa/0xf8
[Sun Apr 16 12:34:10 UTC 2023] [<0000004742161798>] __do_sys_newstat+0x38/0x60
[Sun Apr 16 12:34:10 UTC 2023] [<000000474277e802>] system_call+0x2a6/0x2c8
[Sun Apr 16 12:34:10 UTC 2023] Last Breaking-Event-Address:
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779452>] __rpc_execute+0x8a/0x488 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops

The network interfaces p0 and p1 are missing:

crash> net | grep -P "p0 |p1 "
   5b726fa000 macvtap0

It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code.

That would be the related upstream commit:

aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW

----
[Niklas]
I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-5.4.0-128.144 and that does not appear to have this fix. If I read things correctly this is again an issue that may occur during a recovery when the PCI device is isolated and thus doesn't respond. So
it likely won't help with not losing the interface but it does sound like it could
solve the kernel crash/refcount warning.

====================================================================================================
Summary:

Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash.
We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE.

aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW
https://<email address hidden>/
====================================================================================================

bugproxy (bugproxy)
tags: added: architecture-s3903164 bugnameltc-202279 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Changed in linux (Ubuntu Focal):
status: New → In Progress
Revision history for this message
Marcelo Cerri (mhcerri) wrote :
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

The change requires the backport of one additional patch (both are provided above).

We created a test kernel with those changes for validation and you can find the debian packages at https://people.canonical.com/~mhcerri/lp2019011/s390x_debs.tgz

Please let us know if the test kernel works as expected. Thank you!

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2023-05-10 15:31 EDT-------
Hi,

Thank you very much for the quick support.
Is this kernel (5.4.0.149) and the package attached corresponds to 20.04 or 20.04 HWE ?

Revision history for this message
Pedro Principeza (pprincipeza) wrote :

Hi, Vineeth.

These packages are for the 20.04 LTS Kernel.

BR,
pprincipeza

tags: added: patch
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2023-05-11 03:40 EDT-------
Thank you.
Would it be possible to get the same backported for 22.04 HWE as well ?

Regards.
Vineeth

Revision history for this message
Pedro Principeza (pprincipeza) wrote :

Hi, Vineeth.

The patch in hand is included in the HWE version of the Focal Kernel and in the LTS version of the Jammy Kernel. Both are 5.15, FWIW, and the fix has a different id there:

f0f894f0f636 net/mlx5: Fix handling of entry refcount when command is not issued to FW

The Focal LTS Kernel is the only one that needs the backport. Let us know how testing goes at your end.

BR,
pprincipeza

bugproxy (bugproxy)
tags: added: targetmilestone-inin2004
removed: targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2023-06-27 06:45 EDT-------
The cloud team did some testing with the fixed Focal -gt version.
The problem did not appear anymore, therefore I think we can close this bugzilla / LP item.
Thanks to everybody for your work.

==> Changing the status to: CLOSED

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2023-06-27 07:04 EDT-------
Sorry for the premature closing of this bug.
Reopening this item as the fix needs to be released in Focal LTS by Canonical first, before we can close.

Revision history for this message
Frank Heimes (fheimes) wrote :

Just double checked the potential affected releases.
The fix(es) is(are) incl. in lunar, kinetic and jammy - so the only affected release is indeed focal.

Changed in linux (Ubuntu):
status: New → Fix Released
Changed in ubuntu-z-systems:
status: New → In Progress
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Frank Heimes (fheimes)
description: updated
Revision history for this message
Frank Heimes (fheimes) wrote :

Submission to the kernel team mailing list was done:
https://lists.ubuntu.com/archives/kernel-team/2023-June/thread.html#140723

Changed in linux (Ubuntu Focal):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in ubuntu-z-systems:
importance: Undecided → High
Frank Heimes (fheimes)
description: updated
Revision history for this message
Marcelo Cerri (mhcerri) wrote :

Hi, Boris.

Just to confirm did you manage to validate the 5.4 generic test kernel? This fix is intended to the 5.4 generic kernel in bionic and in focal (via the generic HWE kernel).

Thank you!

Revision history for this message
Frank Heimes (fheimes) wrote :

Test build (on slightly newer focal kernel):
https://launchpad.net/~fheimes/+archive/ubuntu/lp2019011

Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.4.0-155.172 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux verification-needed-focal
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2023-08-10 04:01 EDT-------
The patched kernel is running for quite a while in our systems, so far w/o showing the reported issue again.
With that we could declare the verification as done.

Thanks everyone for all your work!

Revision history for this message
Frank Heimes (fheimes) wrote :

Thanks for the update - adjusting the tags accordingly ...

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (21.8 KiB)

This bug was fixed in the package linux - 5.4.0-156.173

---------------
linux (5.4.0-156.173) focal; urgency=medium

  * focal/linux: 5.4.0-156.173 -proposed tracker (LP: #2026585)

  * CVE-2023-3390
    - netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE

  * Focal update: v5.4.241 upstream stable release (LP: #2023930)
    - scsi: ses: Handle enclosure with just a primary component gracefully
    - x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot
    - cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
    - treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()
    - smb3: fix problem with null cifs super block with previous patch
    - pinctrl: amd: Use irqchip template
    - pinctrl: amd: disable and mask interrupts on probe
    - pinctrl: amd: Disable and mask interrupts on resume
    - pwm: cros-ec: Explicitly set .polarity in .get_state()
    - pwm: sprd: Explicitly set .polarity in .get_state()
    - wifi: mac80211: fix invalid drv_sta_pre_rcu_remove calls for non-uploaded
      sta
    - icmp: guard against too small mtu
    - net: don't let netpoll invoke NAPI if in xmit context
    - sctp: check send stream number after wait_for_sndbuf
    - ipv6: Fix an uninit variable access bug in __ip6_make_skb()
    - gpio: davinci: Add irq chip flag to skip set wake
    - sunrpc: only free unix grouplist after RCU settles
    - NFSD: callback request does not use correct credential for AUTH_SYS
    - xhci: also avoid the XHCI_ZERO_64B_REGS quirk with a passthrough iommu
    - USB: serial: cp210x: add Silicon Labs IFS-USB-DATACABLE IDs
    - usb: typec: altmodes/displayport: Fix configure initial pin assignment
    - USB: serial: option: add Telit FE990 compositions
    - USB: serial: option: add Quectel RM500U-CN modem
    - iio: adc: ti-ads7950: Set `can_sleep` flag for GPIO chip
    - iio: dac: cio-dac: Fix max DAC write value check for 12-bit
    - tty: serial: sh-sci: Fix transmit end interrupt handler
    - tty: serial: sh-sci: Fix Rx on RZ/G2L SCI
    - tty: serial: fsl_lpuart: avoid checking for transfer complete when
      UARTCTRL_SBK is asserted in lpuart32_tx_empty
    - nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread()
    - nilfs2: fix sysfs interface lifetime
    - ALSA: hda/realtek: Add quirk for Clevo X370SNW
    - perf/core: Fix the same task check in perf_event_set_output
    - ftrace: Mark get_lock_parent_ip() __always_inline
    - can: j1939: j1939_tp_tx_dat_new(): fix out-of-bounds memory access
    - tracing: Free error logs of tracing instances
    - net_sched: prevent NULL dereference if default qdisc setup failed
    - drm/panfrost: Fix the panfrost_mmu_map_fault_addr() error path
    - ring-buffer: Fix race while reader and writer are on the same page
    - mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
    - irqdomain: Look for existing mapping only once
    - irqdomain: Refactor __irq_domain_alloc_irqs()
    - irqdomain: Fix mapping-creation race
    - Revert "pinctrl: amd: Disable and mask interrupts on resume"
    - ALSA: emu10k1: fix capture interrupt handler unlinking
    - ALSA: hd...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-ibm/5.4.0-1055.60 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-ibm' to 'verification-done-focal-linux-ibm'. If the problem still exists, change the tag 'verification-needed-focal-linux-ibm' to 'verification-failed-focal-linux-ibm'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-ibm-v2 verification-needed-focal-linux-ibm
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.4.0-1114.120 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-azure' to 'verification-done-focal-linux-azure'. If the problem still exists, change the tag 'verification-needed-focal-linux-azure' to 'verification-failed-focal-linux-azure'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-azure-v2 verification-needed-focal-linux-azure
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-bluefield/5.4.0-1069.75 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-bluefield' to 'verification-done-focal-linux-bluefield'. If the problem still exists, change the tag 'verification-needed-focal-linux-bluefield' to 'verification-failed-focal-linux-bluefield'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-focal-linux-bluefield-v2 verification-needed-focal-linux-bluefield
bugproxy (bugproxy)
tags: removed: verification-needed-focal-linux-bluefield
tags: added: verification-needed-focal-linux-bluefield
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.