[Ubuntu 24.04] FW1060.00 (NH1060_026) sosreport is running to Kernel OOPS crash

Bug #2070358 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Committed
High
Patricia Domingues
linux (Ubuntu)
Invalid
High
Canonical Kernel Team
Noble
Fix Committed
High
Unassigned
sosreport (Ubuntu)
Invalid
High
Unassigned
Noble
Invalid
Undecided
Unassigned

Bug Description

SRU Justification:

[Impact]
 * When the sosreport command is executed, a kernel OOPS happens and the system is crashing,
  depending on the configuration (but default) the system/LPAR is rebooting.

[Fix]
 * e0011bca603c101f2a3c007bdb77f7006fa78fb1 e0011bca603c "nfsd: initialise nfsd_info.mutex early"

[Test Case]
 * Have a Ubuntu Server 24.04 LTS installation on ppc64el.
 * one option is only running sosreport on the system - and
 the crash is seen when the sosreport is starting to capture dump
 * second option (without sosreport) is:
 * CONFIG_NFSD=m (or y) must be set
 * mount nfsd if not already, using "$ mount -t nfsd nfsd /proc/fs/nfsd" command
 * The kernel oops will happen and the logs will show:
   ...
   BUG: Kernel NULL pointer dereference on read at 0x00000000
   Faulting instruction address: 0xc0000000016ff114
   Oops: Kernel access of bad area, sig: 11 [#1]
   ...
 * On a system with that kernel that incl. the above patch
   no oops will occur and the sosreport command will execute normally.

[Regression Potential]
* There is a certain risk of a regression, with any code modification,
  and here because the mutex handling in nfsd is modified.

* But the changes are pretty traceable.

* On top the commit is already upstream reviewed and accepted.

* The modifications were done by the NFSD maintainer and also tested by IBM.

[Other]
* The fix/commit got upstream accepted with kernel v6.10-rc7,
  hence Oracular (with a planned kernel of >=6.10) is not affected.

== Comment: #0 - Tasmiya Nalatwad <email address hidden> - 2024-05-28 04:35:50 ==
--- Description ---
When sosreport command is executed the kernel OOPS crash is happening and lpar is rebooting. As kdump was enabled the dump is captured.

Note : The bug looks similar Bug 206504 Which is seen on z lpars.

--- Lpar Details ---
1. PowerVM
2. FW: FW1060.00 (NH1060_026)
3. OS: Ubuntu 24.04
4. Kernel: 6.8.0-31-generic
5. Mem (free -mh): 47Gi
6. cpus: 40

--- Steps to reproduce ---
1. run sosreport command on the lpar and the crash is seen when the sosreport is starting to capture dump.

--- Traces ---
root@ubuntulp2host:~# sosreport
Please note the 'sosreport' command has been deprecated in favor of the new 'sos' command, E.G. 'sos report'.
Redirecting to 'sos report '

sosreport (version 4.5.6)

This command will collect system configuration and diagnostic
information from this Ubuntu system.

For more information on Canonical visit:

        Community Website : https://www.ubuntu.com/
        Commercial Support : https://www.canonical.com

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Optionally, please enter the case id that you are generating this report for []:

 Setting up archive ...
 Setting up plugins ...
[plugin:lxd] skipped command 'lxc image list': required kmods missing: ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, ip6_tables, ip6table_filter.
[plugin:lxd] skipped command 'lxc list': required kmods missing: ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, ip6_tables, ip6table_filter.
[plugin:lxd] skipped command 'lxc network list': required kmods missing: ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, ip6_tables, ip6table_filter.
[plugin:lxd] skipped command 'lxc profile list': required kmods missing: ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, ip6_tables, ip6table_filter.
[plugin:lxd] skipped command 'lxc storage list': required kmods missing: ip6table_nat, ip6table_raw, bpfilter, iptable_mangle, iptable_filter, iptable_raw, ebtable_filter, ip6table_mangle, ebtables, iptable_nat, ip6_tables, ip6table_filter.
[plugin:networking] skipped command 'ip -s macsec show': required kmods missing: macsec. Use '--allow-system-changes' to enable collection.
[plugin:networking] skipped command 'ss -peaonmi': required kmods missing: af_packet_diag, unix_diag, netlink_diag, udp_diag, inet_diag, tcp_diag, xsk_diag. Use '--allow-system-changes' to enable collection.
Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment.
[plugin:ufw] skipped command 'ufw status numbered': required kmods missing: bpfilter, iptable_filter.
[plugin:ufw] skipped command 'ufw app list': required kmods missing: bpfilter, iptable_filter.
 Running plugins. Please wait ...

  Starting 21/75 firewall_tables [Running: cloud_init ebpf filesys firewall_tables] [ 1057.076626] Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
[ 1057.076645] BUG: Kernel NULL pointer dereference on read at 0x00000000
[ 1057.076650] Faulting instruction address: 0xc0000000016ff114
[ 1057.076655] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1057.076659] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 1057.076665] Modules linked in: rpcsec_gss_krb5 xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc rdma_ucm ib_uverbs qrtr rdma_cm iw_cm ib_cm ib_core cfg80211 binfmt_misc kvm_hv kvm vmx_crypto nfsd auth_rpcgss nfs_acl lockd grace nf_tables nvme_fabrics dm_multipath nvme_core nvme_auth sunrpc nfnetlink ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 nx_compress_pseries nx_compress ibmvscsi 842_decompress ibmveth pseries_rng poly1305_p10_crypto chacha_p10_crypto libchacha crct10dif_vpmsum crc32c_vpmsum aes_gcm_p10_crypto
[ 1057.076731] CPU: 25 PID: 6109 Comm: sosreport Kdump: loaded Not tainted 6.8.0-31-generic #31-Ubuntu
[ 1057.076737] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_026) hv:phyp pSeries
[ 1057.076743] NIP: c0000000016ff114 LR: c0000000016ff108 CTR: c0000000016ff0e0
[ 1057.076747] REGS: c000000067e63630 TRAP: 0300 Not tainted (6.8.0-31-generic)
[ 1057.076752] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24044400 XER: 2004008c
[ 1057.076761] CFAR: c0000000016fb6c8 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
[ 1057.076761] GPR00: 0000000000000000 c000000067e638d0 c000000002254800 0000000000000000
[ 1057.076761] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1057.076761] GPR08: 0000000000000000 0000000000000000 c000000057a07980 c008000005d39538
[ 1057.076761] GPR12: c0000000016ff0e0 c000000c1bc8ff00 0000000000000000 0000000000000000
[ 1057.076761] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 1057.076761] GPR20: c00000006751a628 0000000000000000 0000000000000000 0000000000000000
[ 1057.076761] GPR24: 0000000000000000 c00000006751a618 0000000000000000 c000000067e63a70
[ 1057.076761] GPR28: c000000067e63a98 0000000000000000 c00000006b4d9188 0000000000000000
[ 1057.076809] NIP [c0000000016ff114] mutex_lock+0x34/0x98
[ 1057.076816] LR [c0000000016ff108] mutex_lock+0x28/0x98
[ 1057.076821] Call Trace:
[ 1057.076823] [c000000067e638d0] [c0000000016ff108] mutex_lock+0x28/0x98 (unreliable)
[ 1057.076829] [c000000067e63900] [c008000005d2e480] svc_pool_stats_start+0x48/0xf8 [sunrpc]
[ 1057.076866] [c000000067e63970] [c0000000007196a0] seq_read_iter+0x16c/0x6a4
[ 1057.076871] [c000000067e63a40] [c000000000719d00] seq_read+0x128/0x1a8
[ 1057.076875] [c000000067e63ae0] [c0000000006c8254] vfs_read+0xe4/0x3e0
[ 1057.076881] [c000000067e63b90] [c0000000006c94a0] ksys_read+0x90/0x168
[ 1057.076886] [c000000067e63be0] [c000000000033248] system_call_exception+0xf8/0x290
[ 1057.076892] [c000000067e63e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
[ 1057.076899] --- interrupt: 3000 at 0x689080b5b504
[ 1057.076903] NIP: 0000689080b5b504 LR: 0000689080b5b504 CTR: 0000000000000000
[ 1057.076907] REGS: c000000067e63e80 TRAP: 3000 Not tainted (6.8.0-31-generic)
[ 1057.076911] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 42044402 XER: 00000000
[ 1057.076922] IRQMASK: 0
[ 1057.076922] GPR00: 0000000000000003 000068907600da50 0000689080c96d00 0000000000000008
[ 1057.076922] GPR04: 000068905c014660 0000000000010000 000068907ca613c8 00006890760168e0
[ 1057.076922] GPR08: 000068907600f228 0000000000000000 0000000000000000 0000000000000000
[ 1057.076922] GPR12: 0000000000000000 00006890760168e0 0000000000000001 0000000000000000
[ 1057.076922] GPR16: 000068907c89bb50 000068907ff10968 000068907ff10978 0000689080372f7a
[ 1057.076922] GPR20: 0000689080372f78 000068907ff10938 000068907ff108f0 0000000010493180
[ 1057.076922] GPR24: 000068905c014660 0000000000000008 0000000000000000 0000000000000000
[ 1057.076922] GPR28: 000068905c014660 0000000000010000 0000000000000008 000068907600da50
[ 1057.076965] NIP [0000689080b5b504] 0x689080b5b504
[ 1057.076969] LR [0000689080b5b504] 0x689080b5b504
[ 1057.076972] --- interrupt: 3000
[ 1057.076975] Code: 38425720 7c0802a6 60000000 7c0802a6 fbe1fff8 7c7f1b78 f8010010 f821ffd1 4bffc575 60000000 39200000 e94d0908 <7d00f8a8> 7c284800 40c20010 7d40f9ad
[ 1057.076990] ---[ end trace 0000000000000000 ]---

== Comment: #1 - Tasmiya Nalatwad <email address hidden> - 2024-05-28 04:39:47 ==
Placed the dump file and dmesg file in the junebug server

ssh <email address hidden>
Location to the dump dile is present : /home/dump/dumps/206751

== Comment: #5 - Sourabh Jain <email address hidden> - 2024-05-29 09:23:29 ==
Hello Team,

Here is my observation on this issue:

The kernel crash is due to sos trying to get data from below sysfs file:
/proc/fs/nfsd/pool_stats

This issue is also reproducible with current upstream kernel 6.10-rc1.

So there is nothing wrong with sos tool, it is a kernel bug.

Here is the first kernel bad commit which introduced this issue:

7b207ccd9833 svc: don't hold reference for poolstats, only mutex.

Here are the steps to reproduce this issue without sos tool:

Requirements:
 1. Kernel must have "7b207ccd9833 svc: don't hold reference for poolstats, only mutex." commit
 2. CONFIG_NFSD=m must be enabled
 3. mount nfsd if not already using "$ mount -t nfsd nfsd /proc/fs/nfsd" command

Run the below command reproduce the issue:
$ cat /proc/fs/nfsd/pool_stats

NOTE: the above command will crash the kernel.

Thanks,
Sourabh Jain

== Comment: #9 - Sourabh Jain <email address hidden> - 2024-06-17 08:57:19 ==
Hello Team,

NFSD maintainer has provided the fix.
https://<email address hidden>/

Feel free try the above fix.

Note: the fix is for Linux kernel and not for sosreport tool.

Thanks,
Sourabh Jain

== Comment: #10 - Sourabh Jain <email address hidden> - 2024-06-17 22:07:11 ==
Hello Team,

Fix is applied to nfsd-next kernel. Likely to hit mainline kernel in next rc.
https://<email address hidden>/

Thanks,
Sourabh Jain

== Comment: #14 - Tasmiya Nalatwad <email address hidden> - 2024-06-25 03:38:16 ==
Team, I have tested the fix on custom kernel "6.9.0-rc7nfsd-fix+" and the issue is not reproducible.

---- uname ----
Linux ubuntulp2host 6.9.0-rc7nfsd-fix+ #2 SMP Tue Jun 25 06:49:48 UTC 2024 ppc64le ppc64le ppc64le GNU/Linux

1. sosreport is generated as expected

------------------- logs ---------------------------
Please note the 'sosreport' command has been deprecated in favor of the new 'sos' command, E.G. 'sos report'.
Redirecting to 'sos report '

sosreport (version 4.5.6)

This command will collect system configuration and diagnostic
information from this Ubuntu system.

For more information on Canonical visit:

        Community Website : https://www.ubuntu.com/
        Commercial Support : https://www.canonical.com

The generated archive may contain data considered sensitive and its
content should be reviewed by the originating organization before being
passed to any third party.
No changes will be made to system configuration.
Press ENTER to continue, or CTRL-C to quit.
Optionally, please enter the case id that you are generating this report for []:
 Setting up archive ...
 Setting up plugins ...
[plugin:lxd] skipped command 'lxc image list': required kmods missing: ip6table_raw, iptable_filter, ebtables, bpfilter, iptable_nat, ebtable_filter, ip6table_nat, iptable_mangle, ip6table_mangle, ip6_tables, ip6table_filter, iptable_raw.
[plugin:lxd] skipped command 'lxc list': required kmods missing: ip6table_raw, iptable_filter, ebtables, bpfilter, iptable_nat, ebtable_filter, ip6table_nat, iptable_mangle, ip6table_mangle, ip6_tables, ip6table_filter, iptable_raw.
[plugin:lxd] skipped command 'lxc network list': required kmods missing: ip6table_raw, iptable_filter, ebtables, bpfilter, iptable_nat, ebtable_filter, ip6table_nat, iptable_mangle, ip6table_mangle, ip6_tables, ip6table_filter, iptable_raw.
[plugin:lxd] skipped command 'lxc profile list': required kmods missing: ip6table_raw, iptable_filter, ebtables, bpfilter, iptable_nat, ebtable_filter, ip6table_nat, iptable_mangle, ip6table_mangle, ip6_tables, ip6table_filter, iptable_raw.
[plugin:lxd] skipped command 'lxc storage list': required kmods missing: ip6table_raw, iptable_filter, ebtables, bpfilter, iptable_nat, ebtable_filter, ip6table_nat, iptable_mangle, ip6table_mangle, ip6_tables, ip6table_filter, iptable_raw.
[plugin:networking] skipped command 'ip -s macsec show': required kmods missing: macsec. Use '--allow-system-changes' to enable collection.
[plugin:networking] skipped command 'ss -peaonmi': required kmods missing: unix_diag, xsk_diag, af_packet_diag, tcp_diag, udp_diag, netlink_diag, inet_diag. Use '--allow-system-changes' to enable collection.
Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment.
[plugin:ufw] skipped command 'ufw status numbered': required kmods missing: bpfilter, iptable_filter.
[plugin:ufw] skipped command 'ufw app list': required kmods missing: bpfilter, iptable_filter.
 Running plugins. Please wait ...

  Finishing plugins [Running: logs]
  Finished running plugins
Creating compressed archive...

Your sosreport has been generated and saved in:
 /tmp/sosreport-ubuntulp2host-2024-06-25-cussrcx.tar.xz

 Size 5.99MiB
 Owner root
 sha256 192c04e45142382038adb223d6dc4aa95edc8edf5d37a576cdd2912e71cdd98b

Please send this file to your support representative.

2. As mentioned by Sourabh in the above comments the below command is not giving crash/OOPS .

cat /proc/fs/nfsd/pool_stats
# pool packets-arrived sockets-enqueued threads-woken threads-timedout
0 0 2 0 0

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2024-06-25 05:18 EDT-------
please integrate this commit into ubuntu 24.04

Fix is applied to nfsd-next kernel. Likely to hit mainline kernel in next rc.
https://<email address hidden>/

tags: added: architecture-ppc64le bugnameltc-206751 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
affects: linux (Ubuntu) → sosreport (Ubuntu)
Changed in sosreport (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → nobody
Changed in ubuntu-power-systems:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
bugproxy (bugproxy)
tags: added: targetmilestone-inin2404
removed: targetmilestone-inin---
Revision history for this message
Frank Heimes (fheimes) wrote :

Hello and thanks for having reported this issue.

Once the patch "nfsd: fix oops when reading pool_stats before server is started" is upstream accepted (having it in 'linux-next' is sufficient), we can think about picking it.
Do you know if it will be marked upstream as stable update, and with that be automatically applied to the mainline 6.8 tree as well?

Btw. we recently had a similar report on sosreport crashing on noble, and the reason was probably kernel 6.8.0-31, since it was resolved with having the latest kernel (6.8.0-38) installed.

Would you mind updating your system to the latest update level and trying again?
Can be done like this:
sudo apt --yes update
sudo apt --yes full-upgrade
sudo reboot
and calling 'sosreport' whan the system is again up?
Thanks.

Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in sosreport (Ubuntu):
importance: Undecided → High
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2024-06-27 01:08 EDT-------
Thanks,

Also, see this reference.

https://<email address hidden>/T/#u

I will let tester to test with latest and update here.

Revision history for this message
bugproxy (bugproxy) wrote : Whole Console logs captured during performing the below steps

------- Comment on attachment From <email address hidden> 2024-07-01 06:29 EDT-------

Team,

I have updated the kernel using below commands as mentioned in comment-21 and executed sosreport. Issue was reproducible with the updated kernel as well.

--- Steps Followed ---
sudo apt --yes update
sudo apt --yes full-upgrade
sudo reboot
sosreport

--- uname -a ---
Linux ubuntulp2host 6.8.0-36-generic #36-Ubuntu SMP Mon Jun 10 11:02:49 UTC 2024 ppc64le ppc64le ppc64le GNU/Linux

--- cat /etc/os-release ---
PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Revision history for this message
Frank Heimes (fheimes) wrote :

Hello Tasmiya, many thanks that you retried with latest kernel -36.

So we'll focus then on incl. commit
8e948c365d9c nfsd: fix oops when reading pool_stats before server is started

I just noticed that it's now upstream with v6.10-rc5.

Changed in sosreport (Ubuntu):
status: New → Invalid
Changed in ubuntu-power-systems:
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Patricia Domingues (patriciasd)
description: updated
Revision history for this message
Patricia Domingues (patriciasd) wrote :

I figured out that commit 8e948c365d9c "nfsd: fix oops when reading pool_stats before server is started" is not in the Ubuntu 24.04 LTS Noble Numbat tree, hence no need and skipping commit ac03629b1612 "Revert "nfsd: fix oops when reading pool_stats before server is started". With that, will only be applying e0011bca603c "nfsd: initialise nfsd_info.mutex early".

Frank Heimes (fheimes)
Changed in linux (Ubuntu):
assignee: nobody → Patricia Domingues (patriciasd)
Revision history for this message
Patricia Domingues (patriciasd) wrote :

A test build of patched kernel for Noble is available at this PPA: https://launchpad.net/~patriciasd/+archive/ubuntu/noble.kernel-sru.lp2070358

Revision history for this message
Frank Heimes (fheimes) wrote :

Patch was submitted to kernel teams mailing list (thx Patricia):
https://lists.ubuntu.com/archives/kernel-team/2024-July/thread.html#152124

Changed in ubuntu-power-systems:
status: New → In Progress
Changed in linux (Ubuntu):
status: New → In Progress
assignee: Patricia Domingues (patriciasd) → Canonical Kernel Team (canonical-kernel-team)
Stefan Bader (smb)
Changed in linux (Ubuntu Noble):
importance: Undecided → High
status: New → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
description: updated
Changed in linux (Ubuntu):
status: In Progress → Invalid
Changed in sosreport (Ubuntu Noble):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.