Ubuntu
linux package

[UBUNTU 20.04] [HPS] Kernel panic with "refcount_t: underflow" in mlx5 driver

Bug #2019011 reported by bugproxy on 2023-05-09

This bug affects 1 person

	Status	Importance	Assigned to
Ubuntu on IBM z Systems	Fix Released	High	Skipper Bug Screeners
linux (Ubuntu)	Fix Released	High	Skipper Bug Screeners
Focal	Fix Released	High	Canonical Kernel Team

Bug Description

SRU Justification:
==================

[ Impact ]

* The mlx5 driver is causing a Kernel panic with
"refcount_t: underflow".

* This issue occurs during a recovery when the PCI device
is isolated and thus doesn't respond.

[ Fix ]

* This issue got solved upstream with
   aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
   "net/mlx5: Fix handling of entry refcount when command
   is not issued to FW" (upstream since 6.1-rc1)

* But to get aaf2e65cac7f a backport of b898ce7bccf1
   b898ce7bccf13087719c021d829dab607c175246
   "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
   not accessible" is required on top (upstream since 5.10)

[ Test Plan ]

* An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
   is needed that has Mellanox cards (RoCE Express 2.1)
   assigned, configured and enabled and that runs a 5.4
   kernel with mlx5 driver.

* Create some network traffic on (one of the) RoCE device
(interface ens???[d?]) for testing (e.g. with stress-ng).

* Make sure the module/driver mlx5 is loaded and in use.

* Trigger a recovery (via the Support Element)
that will render the adapter (ports) unresponsive
for a moment and should provoke a similar situation.

* Alternatively the interface itself can be removed for
a moment and re-added again (but this may break further
things on top).

* Due to the lack of RoCE Express 2.1 hardware,
the verification is on IBM.

[ Where problems could occur ]

* The modifications are limited to the Mellanox mlx5 driver
only - no other network driver is affected.

* The pre-required commit (aaf2e65cac7f) can have a bad
   impact on (re-)claiming pages if FW is not accessible,
   which could cause page leaks in case done wrong.
   But this commit is pretty save since it's upstream
   since v5.10.

* The fix itself (aaf2e65cac7f) mainly changes the
   cmd_work_handler and mlx5_cmd_comp_handler functions
   in a way that instead of pci_channel_offline
   mlx5_cmd_is_down (introiduced by b898ce7bccf1).

* Actually b898ce7bccf1 started with changing from
   pci_channel_offline to mlx5_cmd_is_down,
   but looks like a few cases
   (in the area of refcount increate/decrease) were missed,
   that are now covered by aaf2e65cac7f.

* It fixes now on top refcounts are now always properly
increment and decrement to achieve a symmetric state
for all flows.

* These changes may have an impact on all cases where the
mlx5 device is not responding, which can happen in case
of an offline channel, interface down, reset or recovery.

[ Other Info ]

* Looking at the master-next git trees for jammy, kinetic
and lunar showed that both fixes are already included,
hence only focal is affected.
__________

---Problem Description---

Kernel panic with "refcount_t: underflow" in kernel log

Contact Information = <email address hidden>, <email address hidden>

---uname output---
5.4.0-128-generic

Machine Type = s390x

---System Hang---
Kernel panic and stack-trace as below

---Debugger---
A debugger is not configured

Stack trace output:
[Sat Apr 8 17:52:21 UTC 2023] Call Trace:
[Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939a286>] refcount_warn_saturate+0xce/0x140)
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f861e>] cmd_ent_put+0xe6/0xf8 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9b6a>] mlx5_cmd_comp_handler+0x102/0x4f0 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9f8a>] cmd_comp_notifier+0x32/0x48 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe4fc>] mlx5_eq_async_int+0x13c/0x200 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061318e>] mlx5_irq_int_handler+0x2e/0x48 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e960ce>] zpci_floating_irq_handler+0xe6/0x1b8
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f54a6>] do_airq_interrupt+0x96/0x130
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30e42>] do_IRQ+0x7a/0xb0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a408>] io_int_handler+0x12c/0x294
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e2752e>] enabled_wait+0x46/0xd8
[Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2752e>] enabled_wait+0x46/0xd8)
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e278aa>] arch_cpu_idle+0x2a/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1536>] do_idle+0xee/0x1b0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee17a6>] cpu_startup_entry+0x36/0x40
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3ab38>] smp_init_secondary+0xc8/0xe8
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a770>] smp_start_secondary+0x88/0x90
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10
[Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address:
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a286>] refcount_warn_saturate+0xce/0x140
[Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]---
[Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP
[Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_probe(OE) vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache ebtable_broute binfmt_misc nbd veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_nat xt_mark sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo bonding s390_trng
[Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: sysdigcloud_probe]
[Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu
[Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR)
[Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_work_on+0x30/0x70)
[Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800
[Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08
[Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740
[Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8
[Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0
                                          0000002a58ec51cc: e3b0f0a00004 lg %r11,160(%r15)
                                         #0000002a58ec51d2: eb11400000e6 laog %r1,%r1,0(%r4)
                                         >0000002a58ec51d8: 07e0 bcr 14,%r0
                                          0000002a58ec51da: a7110001 tmll %r1,1
                                          0000002a58ec51de: a7840016 brc 8,0000002a58ec520a
                                          0000002a58ec51e2: a7280000 lhi %r2,0
                                          0000002a58ec51e6: a7b20300 tmhh %r11,768
[Sat Apr 8 17:52:21 UTC 2023] Call Trace:
[Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d23ae0>] 0x3e016d23ae0)
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fab0a>] cmd_exec+0x44a/0xab0 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb2b0>] mlx5_cmd_exec+0x40/0x70 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657cb0>] mlx5_eswitch_get_vport_stats+0xb0/0x2a0 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644602>] mlx5e_rep_update_hw_counters+0x52/0xb8 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f1ec>] mlx5e_update_stats_work+0x44/0x58 [mlx5_core]
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec56f4>] process_one_work+0x274/0x4d0
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5998>] worker_thread+0x48/0x560
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd014>] kthread+0x144/0x160
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a094>] ret_from_fork+0x28/0x30
[Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10
[Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address:
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060
[Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops

Oops output:
[Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060
[Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops

------------

[Michael]

I had a look into the dump from wdc3-qz1-sr2-rk086-s05:

crash> sys

The system was up and running since:

UPTIME: 282 days, 02:16:10

There a a lot of martian source messages again like:

[Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0
[Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06

I hope that we get them suppressed soon.

Then at the following time a first issue can be observed: NFS timeout

[Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out

The reason could be

a) the server
b) the network
c) the local network adapter

Then about 1:05 hour later the first mlx5 related issues are reported

[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
[Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5
?

Then about 15 minutes later the NFS code performs a panic_on_oops
?
[Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out
[Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space
[Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803
[Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE.
[Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024
[Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP
[Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_probe(OE) binfmt_misc nbd vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache xt_s
tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_
eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_
mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p
tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel
tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo
[Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as
ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_v
x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la
st unloaded: sysdigcloud_probe]
[Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.04.1+hf334332v20220521b1-Ubuntu
[Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR)
[Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+0x3a/0xf8 [sunrpc])
[Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8
[Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537
[Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500
[Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48
[Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc
                                          000003ff8076303e: e31020c00004 lg %r1,192(%r2)
                                         #000003ff80763044: e3a010000004 lg %r10,0(%r1)
                                         >000003ff8076304a: e310a4070090 llgc %r1,1031(%r10)
                                          000003ff80763050: a7110010 tmll %r1,16
                                          000003ff80763054: a7740025 brc 7,000003ff8076309e
                                          000003ff80763058: c418ffffe7d8 lgrl %r1,000003ff80760008
                                          000003ff8076305e: 91021003 tm 3(%r1),2
[Sun Apr 16 12:34:10 UTC 2023] Call Trace:
[Sun Apr 16 12:34:10 UTC 2023] ([<0000000000000000>] 0x0)
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779454>] __rpc_execute+0x8c/0x488 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779df2>] rpc_execute+0x8a/0x128 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766d62>] rpc_run_task+0x132/0x180 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766e00>] rpc_call_sync+0x50/0xa0 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360e40>] nfs3_rpc_wrapper.constprop.12+0x48/0xe0 [nfsv3]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361c5e>] nfs3_proc_getattr+0x6e/0xc8 [nfsv3]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaeaa8>] __nfs_revalidate_inode+0x158/0x3b0 [nfs]
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaef9c>] nfs_getattr+0x1bc/0x388 [nfs]
[Sun Apr 16 12:34:10 UTC 2023] [<0000004742161032>] vfs_statx+0xaa/0xf8
[Sun Apr 16 12:34:10 UTC 2023] [<0000004742161798>] __do_sys_newstat+0x38/0x60
[Sun Apr 16 12:34:10 UTC 2023] [<000000474277e802>] system_call+0x2a6/0x2c8
[Sun Apr 16 12:34:10 UTC 2023] Last Breaking-Event-Address:
[Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779452>] __rpc_execute+0x8a/0x488 [sunrpc]
[Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops

The network interfaces p0 and p1 are missing:

crash> net | grep -P "p0 |p1 "
5b726fa000 macvtap0

It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code.

That would be the related upstream commit:

aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW

----
[Niklas]
I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-5.4.0-128.144 and that does not appear to have this fix. If I read things correctly this is again an issue that may occur during a recovery when the PCI device is isolated and thus doesn't respond. So
it likely won't help with not losing the interface but it does sound like it could
solve the kernel crash/refcount warning.

====================================================================================================
Summary:

Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash.
We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE.

aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW
https://<email address hidden>/
====================================================================================================

See original description

Tags:

CVE References

bugproxy (bugproxy) on 2023-05-09

tags:	added: architecture-s3903164 bugnameltc-202279 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee:	nobody → Skipper Bug Screeners (skipper-screen-team)
affects:	ubuntu → linux (Ubuntu)

Revision history for this message

Marcelo Cerri (mhcerri) wrote on 2023-05-10:

0001-net-mlx5-cmdif-Avoid-skipping-reclaim-pages-if-FW-is.patch Edit (4.5 KiB, text/plain)

Changed in linux (Ubuntu Focal):
status:	New → In Progress

Revision history for this message

Marcelo Cerri (mhcerri) wrote on 2023-05-10:

0002-net-mlx5-Fix-handling-of-entry-refcount-when-command.patch Edit (2.5 KiB, text/plain)

Revision history for this message

Marcelo Cerri (mhcerri) wrote on 2023-05-10:

The change requires the backport of one additional patch (both are provided above).

We created a test kernel with those changes for validation and you can find the debian packages at https://people.canonical.com/~mhcerri/lp2019011/s390x_debs.tgz

Please let us know if the test kernel works as expected. Thank you!

Revision history for this message

bugproxy (bugproxy) wrote on 2023-05-10: Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2023-05-10 15:31 EDT-------
Hi,

Thank you very much for the quick support.
Is this kernel (5.4.0.149) and the package attached corresponds to 20.04 or 20.04 HWE ?

Revision history for this message

Pedro Principeza (pprincipeza) wrote on 2023-05-10:

Hi, Vineeth.

These packages are for the 20.04 LTS Kernel.

BR,
pprincipeza

Ubuntu Foundations Team Bug Bot (crichton) on 2023-05-10

tags:

added: patch

Revision history for this message

bugproxy (bugproxy) wrote on 2023-05-11:

------- Comment From <email address hidden> 2023-05-11 03:40 EDT-------
Thank you.
Would it be possible to get the same backported for 22.04 HWE as well ?

Regards.
Vineeth

Revision history for this message

Pedro Principeza (pprincipeza) wrote on 2023-05-11:

Hi, Vineeth.

The patch in hand is included in the HWE version of the Focal Kernel and in the LTS version of the Jammy Kernel. Both are 5.15, FWIW, and the fix has a different id there:

f0f894f0f636 net/mlx5: Fix handling of entry refcount when command is not issued to FW

The Focal LTS Kernel is the only one that needs the backport. Let us know how testing goes at your end.

BR,
pprincipeza

bugproxy (bugproxy) on 2023-06-02

tags:

added: targetmilestone-inin2004
removed: targetmilestone-inin---

Revision history for this message

bugproxy (bugproxy) wrote on 2023-06-27:

------- Comment From <email address hidden> 2023-06-27 06:45 EDT-------
The cloud team did some testing with the fixed Focal -gt version.
The problem did not appear anymore, therefore I think we can close this bugzilla / LP item.
Thanks to everybody for your work.

==> Changing the status to: CLOSED

Revision history for this message

bugproxy (bugproxy) wrote on 2023-06-27:

------- Comment From <email address hidden> 2023-06-27 07:04 EDT-------
Sorry for the premature closing of this bug.
Reopening this item as the fix needs to be released in Focal LTS by Canonical first, before we can close.

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-06-28:

#10

Just double checked the potential affected releases.
The fix(es) is(are) incl. in lunar, kinetic and jammy - so the only affected release is indeed focal.

Changed in linux (Ubuntu):
status:	New → Fix Released
Changed in ubuntu-z-systems:
status:	New → In Progress
assignee:	nobody → Skipper Bug Screeners (skipper-screen-team)

Frank Heimes (fheimes) on 2023-06-28

description:

updated

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-06-28:

#11

Submission to the kernel team mailing list was done:
https://lists.ubuntu.com/archives/kernel-team/2023-June/thread.html#140723

Changed in linux (Ubuntu Focal):
assignee:	nobody → Canonical Kernel Team (canonical-kernel-team)
importance:	Undecided → High
Changed in linux (Ubuntu):
importance:	Undecided → High
Changed in ubuntu-z-systems:
importance:	Undecided → High

Frank Heimes (fheimes) on 2023-06-28

description:

updated

Revision history for this message

Marcelo Cerri (mhcerri) wrote on 2023-06-28:

#12

Hi, Boris.

Just to confirm did you manage to validate the 5.4 generic test kernel? This fix is intended to the 5.4 generic kernel in bionic and in focal (via the generic HWE kernel).

Thank you!

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-06-28:

#13

Test build (on slightly newer focal kernel):
https://launchpad.net/~fheimes/+archive/ubuntu/lp2019011

Roxana Nicolescu (roxanan) on 2023-07-07

Changed in linux (Ubuntu Focal):
status:	In Progress → Fix Committed

Frank Heimes (fheimes) on 2023-07-07

Changed in ubuntu-z-systems:
status:	In Progress → Fix Committed

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2023-07-13:

#14

This bug is awaiting verification that the linux/5.4.0-155.172 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: kernel-spammed-focal-linux verification-needed-focal

Revision history for this message

bugproxy (bugproxy) wrote on 2023-08-10:

#15

------- Comment From <email address hidden> 2023-08-10 04:01 EDT-------
The patched kernel is running for quite a while in our systems, so far w/o showing the reported issue again.
With that we could declare the verification as done.

Thanks everyone for all your work!

Revision history for this message

Frank Heimes (fheimes) wrote on 2023-08-10:

#16

Thanks for the update - adjusting the tags accordingly ...

tags:

added: verification-done-focal
removed: verification-needed-focal

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-08-10:

#17

Download full text (21.8 KiB)

This bug was fixed in the package linux - 5.4.0-156.173

---------------
linux (5.4.0-156.173) focal; urgency=medium

* focal/linux: 5.4.0-156.173 -proposed tracker (LP: #2026585)

* CVE-2023-3390
- netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE

This bug was fixed in the package linux - 5.4.0-156.173

---------------
linux (5.4.0-156.173) focal; urgency=medium

* focal/linux: 5.4.0-156.173 -proposed tracker (LP: #2026585)

* CVE-2023-3390
    - netfilter: nf_tables: incorrect error path handling with NFT_MSG_NEWRULE

* Focal update: v5.4.241 upstream stable release (LP: #2023930)
    - scsi: ses: Handle enclosure with just a primary component gracefully
    - x86/PCI: Add quirk for AMD XHCI controller that loses MSI-X state in D3hot
    - cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
    - treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()
    - smb3: fix problem with null cifs super block with previous patch
    - pinctrl: amd: Use irqchip template
    - pinctrl: amd: disable and mask interrupts on probe
    - pinctrl: amd: Disable and mask interrupts on resume
    - pwm: cros-ec: Explicitly set .polarity in .get_state()
    - pwm: sprd: Explicitly set .polarity in .get_state()
    - wifi: mac80211: fix invalid drv_sta_pre_rcu_remove calls for non-uploaded
      sta
    - icmp: guard against too small mtu
    - net: don't let netpoll invoke NAPI if in xmit context
    - sctp: check send stream number after wait_for_sndbuf
    - ipv6: Fix an uninit variable access bug in __ip6_make_skb()
    - gpio: davinci: Add irq chip flag to skip set wake
    - sunrpc: only free unix grouplist after RCU settles
    - NFSD: callback request does not use correct credential for AUTH_SYS
    - xhci: also avoid the XHCI_ZERO_64B_REGS quirk with a passthrough iommu
    - USB: serial: cp210x: add Silicon Labs IFS-USB-DATACABLE IDs
    - usb: typec: altmodes/displayport: Fix configure initial pin assignment
    - USB: serial: option: add Telit FE990 compositions
    - USB: serial: option: add Quectel RM500U-CN modem
    - iio: adc: ti-ads7950: Set `can_sleep` flag for GPIO chip
    - iio: dac: cio-dac: Fix max DAC write value check for 12-bit
    - tty: serial: sh-sci: Fix transmit end interrupt handler
    - tty: serial: sh-sci: Fix Rx on RZ/G2L SCI
    - tty: serial: fsl_lpuart: avoid checking for transfer complete when
      UARTCTRL_SBK is asserted in lpuart32_tx_empty
    - nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread()
    - nilfs2: fix sysfs interface lifetime
    - ALSA: hda/realtek: Add quirk for Clevo X370SNW
    - perf/core: Fix the same task check in perf_event_set_output
    - ftrace: Mark get_lock_parent_ip() __always_inline
    - can: j1939: j1939_tp_tx_dat_new(): fix out-of-bounds memory access
    - tracing: Free error logs of tracing instances
    - net_sched: prevent NULL dereference if default qdisc setup failed
    - drm/panfrost: Fix the panfrost_mmu_map_fault_addr() error path
    - ring-buffer: Fix race while reader and writer are on the same page
    - mm/swap: fix swap_info_struct race between swapoff and get_swap_pages()
    - irqdomain: Look for existing mapping only once
    - irqdomain: Refactor __irq_domain_alloc_irqs()
    - irqdomain: Fix mapping-creation race
    - Revert "pinctrl: amd: Disable and mask interrupts on resume"
    - ALSA: emu10k1: fix capture interrupt handler unlinking
    - ALSA: hda/sigmatel: add pin overrides for Intel DP45SG motherboard
    - ALSA: i2c/cs8427: fix iec958 mixer control deactivation
    - ALSA: firewire-tascam: add missing unwind goto in
      snd_tscm_stream_start_duplex()
    - ALSA: hda/sigmatel: fix S/PDIF out on Intel D*45* motherboards
    - Bluetooth: L2CAP: Fix use-after-free in l2cap_disconnect_{req,rsp}
    - Bluetooth: Fix race condition in hidp_session_thread
    - btrfs: print checksum type and implementation at mount time
    - btrfs: fix fast csum implementation detection
    - mtdblock: tolerate corrected bit-flips
    - mtd: rawnand: meson: fix bitmask for length in command word
    - mtd: rawnand: stm32_fmc2: remove unsupported EDO mode
    - niu: Fix missing unwind goto in niu_alloc_channels()
    - qlcnic: check pci_reset_function result
    - sctp: fix a potential overflow in sctp_ifwdtsn_skip
    - RDMA/core: Fix GID entry ref leak when create_ah fails
    - udp6: fix potential access to stale information
    - net: macb: fix a memory corruption in extended buffer descriptor mode
    - power: supply: cros_usbpd: reclassify "default case!" as debug
    - i2c: imx-lpi2c: clean rx/tx buffers upon new message
    - efi: sysfb_efi: Add quirk for Lenovo Yoga Book X91F/L
    - drm: panel-orientation-quirks: Add quirk for Lenovo Yoga Book X90F
    - verify_pefile: relax wrapper length check
    - asymmetric_keys: log on fatal failures in PE/pkcs7
    - ubi: Fix failure attaching when vid_hdr offset equals to (sub)page size
    - mtd: ubi: wl: Fix a couple of kernel-doc issues
    - ubi: Fix deadlock caused by recursively holding work_sem
    - i2c: ocores: generate stop condition after timeout in polling mode
    - watchdog: sbsa_wdog: Make sure the timeout programming is within the limits
    - coresight-etm4: Fix for() loop drvdata->nr_addr_cmp range bug
    - xfs: show the proper user quota options
    - xfs: remove the kuid/kgid conversion wrappers
    - xfs: add a new xfs_sb_version_has_v3inode helper
    - xfs: only check the superblock version for dinode size calculation
    - xfs: simplify di_flags2 inheritance in xfs_ialloc
    - xfs: simplify a check in xfs_ioctl_setattr_check_cowextsize
    - xfs: remove the di_version field from struct icdinode
    - xfs: set inode size after creating symlink
    - xfs: report corruption only as a regular error
    - xfs: shut down the filesystem if we screw up quota reservation
    - xfs: consider shutdown in bmapbt cursor delete assert
    - xfs: don't reuse busy extents on extent trim
    - xfs: force log and push AIL to clear pinned inodes when aborting mount
    - Linux 5.4.241

* [UBUNTU 20.04] [HPS] Kernel panic with "refcount_t: underflow" in mlx5
    driver (LP: #2019011)
    - net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible
    - net/mlx5: Fix handling of entry refcount when command is not issued to FW

* Disable hv-kvp-daemon if /dev/vmbus/hv_kvp is not present (LP: #2024900)
    - [Packaging] disable hv-kvp-daemon if needed

* CVE-2023-35001
    - netfilter: nf_tables: prevent OOB access in nft_byteorder_eval

* CVE-2023-32629
    - ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs

* CVE-2023-3141
    - memstick: r592: Fix UAF bug in r592_remove due to race condition

* CVE-2023-3111
    - btrfs: check return value of btrfs_commit_transaction in relocation
    - btrfs: unset reloc control if transaction commit fails in
      prepare_to_relocate()

* CVE-2023-3090
    - ipvlan:Fix out-of-bounds caused by unclear skb->cb

* CVE-2023-1611
    - btrfs: fix race between quota disable and quota assign ioctls

* CVE-2022-0168
    - cifs: move some variables off the stack in smb2_ioctl_query_info
    - cifs: prevent bad output lengths in smb2_ioctl_query_info()
    - cifs: fix NULL ptr dereference in smb2_ioctl_query_info()

* CVE-2022-27672
    - x86/speculation: Identify processors vulnerable to SMT RSB predictions
    - KVM: x86: Mitigate the cross-thread return address predictions bug
    - Documentation/hw-vuln: Add documentation for Cross-Thread Return Predictions

* Severe NFS performance degradation after LP #2003053 (LP: #2022098)
    - SAUCE: Make NFS file-access stale cache behaviour opt-in

* Encountering an issue with memcpy_fromio causing failed boot of SEV-enabled
    guest (LP: #2020319)
    - x86/sev: Unroll string mmio with CC_ATTR_GUEST_UNROLL_STRING_IO

* Focal update: v5.4.240 upstream stable release (LP: #2023601)
    - net: tls: fix possible race condition between do_tls_getsockopt_conf() and
      do_tls_setsockopt_conf()
    - power: supply: da9150: Fix use after free bug in da9150_charger_remove due
      to race condition
    - iavf: fix inverted Rx hash condition leading to disabled hash
    - iavf: fix non-tunneled IPv6 UDP packet type and hashing
    - intel/igbvf: free irq on the error path in igbvf_request_msix()
    - igbvf: Regard vf reset nack as success
    - i2c: imx-lpi2c: check only for enabled interrupt flags
    - scsi: scsi_dh_alua: Fix memleak for 'qdata' in alua_activate()
    - net: usb: smsc95xx: Limit packet length to skb->len
    - qed/qed_sriov: guard against NULL derefs from qed_iov_get_vf_info
    - net: qcom/emac: Fix use after free bug in emac_remove due to race condition
    - net/ps3_gelic_net: Fix RX sk_buff length
    - net/ps3_gelic_net: Use dma_mapping_error
    - keys: Do not cache key in task struct if key is requested from kernel thread
    - bpf: Adjust insufficient default bpf_jit_limit
    - net/mlx5: Read the TC mapping of all priorities on ETS query
    - atm: idt77252: fix kmemleak when rmmod idt77252
    - erspan: do not use skb_mac_header() in ndo_start_xmit()
    - net/sonic: use dma_mapping_error() for error check
    - nvme-tcp: fix nvme_tcp_term_pdu to match spec
    - hvc/xen: prevent concurrent accesses to the shared ring
    - net: mdio: thunder: Add missing fwnode_handle_put()
    - Bluetooth: btqcomsmd: Fix command timeout after setting BD address
    - platform/chrome: cros_ec_chardev: fix kernel data leak from ioctl
    - hwmon (it87): Fix voltage scaling for chips with 10.9mV ADCs
    - scsi: qla2xxx: Perform lockless command completion in abort path
    - uas: Add US_FL_NO_REPORT_OPCODES for JMicron JMS583Gen 2
    - thunderbolt: Use const qualifier for `ring_interrupt_index`
    - riscv: Bump COMMAND_LINE_SIZE value to 1024
    - ca8210: fix mac_len negative array access
    - m68k: Only force 030 bus error if PC not in exception table
    - selftests/bpf: check that modifier resolves after pointer
    - scsi: target: iscsi: Fix an error message in iscsi_check_key()
    - scsi: ufs: core: Add soft dependency on governor_simpleondemand
    - scsi: lpfc: Avoid usage of list iterator variable after loop
    - net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990
    - net: usb: qmi_wwan: add Telit 0x1080 composition
    - sh: sanitize the flags on sigreturn
    - cifs: empty interface list when server doesn't support query interfaces
    - scsi: core: Add BLIST_SKIP_VPD_PAGES for SKhynix H28U74301AMR
    - usb: gadget: u_audio: don't let userspace block driver unbind
    - fsverity: Remove WQ_UNBOUND from fsverity read workqueue
    - igb: revert rtnl_lock() that causes deadlock
    - dm thin: fix deadlock when swapping to thin device
    - usb: cdns3: Fix issue with using incorrect PCI device function
    - usb: chipdea: core: fix return -EINVAL if request role is the same with
      current role
    - usb: chipidea: core: fix possible concurrent when switch role
    - wifi: mac80211: fix qos on mesh interfaces
    - nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy()
    - i2c: xgene-slimpro: Fix out-of-bounds bug in xgene_slimpro_i2c_xfer()
    - dm stats: check for and propagate alloc_percpu failure
    - dm crypt: add cond_resched() to dmcrypt_write()
    - sched/fair: sanitize vruntime of entity being placed
    - sched/fair: Sanitize vruntime of entity being migrated
    - tun: avoid double free in tun_free_netdev
    - ocfs2: fix data corruption after failed write
    - fsverity: don't drop pagecache at end of FS_IOC_ENABLE_VERITY
    - bus: imx-weim: fix branch condition evaluates to a garbage value
    - md: avoid signed overflow in slot_store()
    - ALSA: asihpi: check pao in control_message()
    - ALSA: hda/ca0132: fixup buffer overrun at tuning_ctl_set()
    - fbdev: tgafb: Fix potential divide by zero
    - sched_getaffinity: don't assume 'cpumask_size()' is fully initialized
    - fbdev: nvidia: Fix potential divide by zero
    - fbdev: intelfb: Fix potential divide by zero
    - fbdev: lxfb: Fix potential divide by zero
    - fbdev: au1200fb: Fix potential divide by zero
    - ca8210: Fix unsigned mac_len comparison with zero in ca8210_skb_tx()
    - dma-mapping: drop the dev argument to arch_sync_dma_for_*
    - mips: bmips: BCM6358: disable RAC flush for TP1
    - mtd: rawnand: meson: invalidate cache on polling ECC bit
    - scsi: megaraid_sas: Fix crash after a double completion
    - ptp_qoriq: fix memory leak in probe()
    - regulator: fix spelling mistake "Cant" -> "Can't"
    - regulator: Handle deferred clk
    - net/net_failover: fix txq exceeding warning
    - can: bcm: bcm_tx_setup(): fix KMSAN uninit-value in vfs_write
    - s390/vfio-ap: fix memory leak in vfio_ap device driver
    - i40e: fix registers dump after run ethtool adapter self test
    - bnxt_en: Fix typo in PCI id to device description string mapping
    - net: dsa: mv88e6xxx: Enable IGMP snooping on user ports only
    - net: mvneta: make tx buffer array agnostic
    - pinctrl: ocelot: Fix alt mode for ocelot
    - Input: alps - fix compatibility with -funsigned-char
    - Input: focaltech - use explicitly signed char type
    - cifs: prevent infinite recursion in CIFSGetDFSRefer()
    - cifs: fix DFS traversal oops without CONFIG_CIFS_DFS_UPCALL
    - Input: goodix - add Lenovo Yoga Book X90F to nine_bytes_report DMI table
    - xen/netback: don't do grant copy across page boundary
    - pinctrl: at91-pio4: fix domain name assignment
    - NFSv4: Fix hangs when recovering open state after a server reboot
    - ALSA: hda/conexant: Partial revert of a quirk for Lenovo
    - ALSA: usb-audio: Fix regression on detection of Roland VS-100
    - drm/etnaviv: fix reference leak when mmaping imported buffer
    - btrfs: scan device in non-exclusive mode
    - ext4: fix kernel BUG in 'ext4_write_inline_data_end()'
    - net_sched: add __rcu annotation to netdev->qdisc
    - net: sched: fix race condition in qdisc_graft()
    - firmware: arm_scmi: Fix device node validation for mailbox transport
    - gfs2: Always check inode size of inline inodes
    - Linux 5.4.240

* Focal update: v5.4.239 upstream stable release (LP: #2023600)
    - Linux 5.4.239

* CVE-2023-2124
    - xfs: verify buffer contents when we skip log replay

* CVE-2020-36691
    - netlink: limit recursion depth in policy validation

* CVE-2022-1184
    - ext4: check if directory block is within i_size
    - ext4: fix check for block being out of directory size

* CVE-2022-4269
    - net: sched: extract qstats update code into functions
    - net: sched: don't expose action qstats to skb_tc_reinsert()
    - net/sched: act_mirred: refactor the handle of xmit
    - net: sched: remove unused tcf_result extension
    - net/sched: act_mirred: better wording on protection against excessive stack
      growth
    - act_mirred: use the backlog for nested calls to mirred ingress

* Focal update: v5.4.238 upstream stable release (LP: #2023427)
    - ext4: fix cgroup writeback accounting with fs-layer encryption
    - xfrm: Allow transport-mode states with AF_UNSPEC selector
    - drm/panfrost: Don't sync rpm suspension after mmu flushing
    - cifs: Move the in_send statistic to __smb_send_rqst()
    - drm/meson: fix 1px pink line on GXM when scaling video overlay
    - clk: HI655X: select REGMAP instead of depending on it
    - docs: Correct missing "d_" prefix for dentry_operations member
      d_weak_revalidate
    - scsi: mpt3sas: Fix NULL pointer access in mpt3sas_transport_port_add()
    - ALSA: hda - add Intel DG1 PCI and HDMI ids
    - ALSA: hda - controller is in GPU on the DG1
    - ALSA: hda: Add Alderlake-S PCI ID and HDMI codec vid
    - ALSA: hda: Add Intel DG2 PCI ID and HDMI codec vid
    - ALSA: hda: Match only Intel devices with CONTROLLER_IN_GPU()
    - netfilter: nft_redir: correct value of inet type `.maxattrs`
    - scsi: core: Fix a comment in function scsi_host_dev_release()
    - scsi: core: Fix a procfs host directory removal regression
    - tcp: tcp_make_synack() can be called from process context
    - nfc: pn533: initialize struct pn533_out_arg properly
    - ipvlan: Make skb->skb_iif track skb->dev for l3s mode
    - i40e: Fix kernel crash during reboot when adapter is in recovery mode
    - qed/qed_dev: guard against a possible division by zero
    - net: tunnels: annotate lockless accesses to dev->needed_headroom
    - net: phy: smsc: bail out in lan87xx_read_status if genphy_read_status fails
    - nfc: st-nci: Fix use after free bug in ndlc_remove due to race condition
    - net: usb: smsc75xx: Limit packet length to skb->len
    - nvmet: avoid potential UAF in nvmet_req_complete()
    - block: sunvdc: add check for mdesc_grab() returning NULL
    - ipv4: Fix incorrect table ID in IOCTL path
    - net: usb: smsc75xx: Move packet length check to prevent kernel panic in
      skb_pull
    - net/iucv: Fix size of interrupt data
    - ethernet: sun: add check for the mdesc_grab()
    - hwmon: (adt7475) Display smoothing attributes in correct order
    - hwmon: (adt7475) Fix masking of hysteresis registers
    - hwmon: (xgene) Fix use after free bug in xgene_hwmon_remove due to race
      condition
    - hwmon: (ina3221) return prober error code
    - media: m5mols: fix off-by-one loop termination error
    - mmc: atmel-mci: fix race between stop command and start of next command
    - jffs2: correct logic when creating a hole in jffs2_write_begin
    - ext4: fail ext4_iget if special inode unallocated
    - ext4: fix task hung in ext4_xattr_delete_inode
    - drm/amdkfd: Fix an illegal memory access
    - sh: intc: Avoid spurious sizeof-pointer-div warning
    - ext4: fix possible double unlock when moving a directory
    - tty: serial: fsl_lpuart: skip waiting for transmission complete when
      UARTCTRL_SBK is asserted
    - interconnect: fix mem leak when freeing nodes
    - tracing: Check field value in hist_field_name()
    - tracing: Make tracepoint lockdep check actually test something
    - ftrace: Fix invalid address access in lookup_rec() when index is 0
    - fbdev: stifb: Provide valid pixelclock and add fb_check_var() checks
    - x86/mm: Fix use of uninitialized buffer in sme_enable()
    - drm/i915: Don't use stolen memory for ring buffers with LLC
    - serial: 8250_em: Fix UART port type
    - s390/ipl: add missing intersection check to ipl_report handling
    - PCI: Unify delay handling for reset and resume
    - HID: core: Provide new max_buffer_size attribute to over-ride the default
    - HID: uhid: Over-ride the default maximum data buffer value with our own
    - Linux 5.4.238

* Focal update: v5.4.237 upstream stable release (LP: #2023420)
    - fs: prevent out-of-bounds array speculation when closing a file descriptor
    - x86/CPU/AMD: Disable XSAVES on AMD family 0x17
    - drm/connector: print max_requested_bpc in state debugfs
    - ext4: fix RENAME_WHITEOUT handling for inline directories
    - ext4: fix another off-by-one fsmap error on 1k block filesystems
    - ext4: move where set the MAY_INLINE_DATA flag is set
    - ext4: fix WARNING in ext4_update_inline_data
    - ext4: zero i_disksize when initializing the bootloader inode
    - nfc: change order inside nfc_se_io error path
    - iommu/amd: Add PCI segment support for ivrs_[ioapic/hpet/acpihid] commands
    - iommu/amd: Fix ill-formed ivrs_ioapic, ivrs_hpet and ivrs_acpihid options
    - iommu/amd: Add a length limitation for the ivrs_acpihid command-line
      parameter
    - ipmi:ssif: make ssif_i2c_send() void
    - ipmi:ssif: resend_msg() cannot fail
    - ipmi:ssif: Remove rtc_us_timer
    - ipmi:ssif: Increase the message retry time
    - ipmi:ssif: Add a timer between request retries
    - irqdomain: Change the type of 'size' in __irq_domain_add() to be consistent
    - irqdomain: Fix domain registration race
    - iommu/vt-d: Fix PASID directory pointer coherency
    - SMB3: Backup intent flag missing from some more ops
    - cifs: Fix uninitialized memory read in smb3_qfs_tcon()
    - scsi: core: Remove the /proc/scsi/${proc_name} directory earlier
    - ext4: Fix possible corruption when moving a directory
    - drm/msm/a5xx: fix setting of the CP_PREEMPT_ENABLE_LOCAL register
    - nfc: fdp: add null check of devm_kmalloc_array in
      fdp_nci_i2c_read_device_properties
    - ila: do not generate empty messages in ila_xlat_nl_cmd_get_mapping()
    - selftests: nft_nat: ensuring the listening side is up before starting the
      client
    - net: usb: lan78xx: Remove lots of set but unused 'ret' variables
    - net: lan78xx: fix accessing the LAN7800's internal phy specific registers
      from the MAC driver
    - net: caif: Fix use-after-free in cfusbl_device_notify()
    - bnxt_en: Avoid order-5 memory allocation for TPA data
    - netfilter: tproxy: fix deadlock due to missing BH disable
    - btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
    - scsi: megaraid_sas: Update max supported LD IDs to 240
    - net/smc: fix fallback failed while sendmsg with fastopen
    - riscv: Use READ_ONCE_NOCHECK in imprecise unwinding stack mode
    - ext4: Fix deadlock during directory rename
    - MIPS: Fix a compilation issue
    - alpha: fix R_ALPHA_LITERAL reloc for large modules
    - macintosh: windfarm: Use unsigned type for 1-bit bitfields
    - PCI: Add SolidRun vendor ID
    - media: ov5640: Fix analogue gain control
    - ipmi/watchdog: replace atomic_add() and atomic_sub()
    - ipmi:watchdog: Set panic count to proper value on a panic
    - drm/i915: Don't use BAR mappings for ring buffers with LLC
    - x86, vmlinux.lds: Add RUNTIME_DISCARD_EXIT to generic DISCARDS
    - arch: fix broken BuildID for arm64 and riscv
    - powerpc/vmlinux.lds: Define RUNTIME_DISCARD_EXIT
    - powerpc/vmlinux.lds: Don't discard .rela* for relocatable builds
    - s390: define RUNTIME_DISCARD_EXIT to fix link error with GNU ld < 2.36
    - sh: define RUNTIME_DISCARD_EXIT
    - UML: define RUNTIME_DISCARD_EXIT
    - s390/dasd: add missing discipline function
    - Linux 5.4.237

* Focal update: v5.4.236 upstream stable release (LP: #2020390)
    - staging: rtl8192e: Remove function ..dm_check_ac_dc_power calling a script
    - staging: rtl8192e: Remove call_usermodehelper starting RadioPower.sh
    - Linux 5.4.236

* Packaging resync (LP: #1786013)
    - [Packaging] resync update-dkms-versions helper

-- Roxana Nicolescu <roxana.nicolescu@canonical.com>  Mon, 10 Jul 2023 17:38:56 +0200

Changed in linux (Ubuntu Focal):
status:	Fix Committed → Fix Released

Frank Heimes (fheimes) on 2023-08-10

Changed in ubuntu-z-systems:
status:	Fix Committed → Fix Released

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2023-08-18:

#18

This bug is awaiting verification that the linux-ibm/5.4.0-1055.60 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-ibm' to 'verification-done-focal-linux-ibm'. If the problem still exists, change the tag 'verification-needed-focal-linux-ibm' to 'verification-failed-focal-linux-ibm'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: kernel-spammed-focal-linux-ibm-v2 verification-needed-focal-linux-ibm

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2023-08-18:

#19

This bug is awaiting verification that the linux-azure/5.4.0-1114.120 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-azure' to 'verification-done-focal-linux-azure'. If the problem still exists, change the tag 'verification-needed-focal-linux-azure' to 'verification-failed-focal-linux-azure'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: kernel-spammed-focal-linux-azure-v2 verification-needed-focal-linux-azure

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2023-08-23:

#20

This bug is awaiting verification that the linux-bluefield/5.4.0-1069.75 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal-linux-bluefield' to 'verification-done-focal-linux-bluefield'. If the problem still exists, change the tag 'verification-needed-focal-linux-bluefield' to 'verification-failed-focal-linux-bluefield'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: kernel-spammed-focal-linux-bluefield-v2 verification-needed-focal-linux-bluefield

bugproxy (bugproxy) on 2023-10-17

tags:

removed: verification-needed-focal-linux-bluefield

CDE Administration (cdeadmin) on 2024-02-08

tags:

added: verification-needed-focal-linux-bluefield

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

[UBUNTU 20.04] [HPS] Kernel panic with "refcount_t: underflow" in mlx5 driver

Bug Description

CVE References

Other bug subscribers

Patches

Remote bug watches

Ubuntu
linux package