[UBUNTU 20.04] LPAR becomes unresponsive after the Kernel panic - rq_qos_wake_function

Bug #1929923 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Invalid
Critical
Skipper Bug Screeners
linux (Ubuntu)
Invalid
Undecided
Skipper Bug Screeners

Bug Description

---Problem Description---
kernel panic rq_qos_wake_function

---uname output---
Linux version 5.4.0-71-generic

Machine Type = s390x

---Debugger---
A debugger is not configured

Stack trace output:
 May 15 20:21:04 data1 kernel: Call Trace:
May 15 20:21:04 data1 kernel: ([<000000234091e670>] 0x234091e670)
May 15 20:21:04 data1 kernel: [<0000003e10047e3a>] rq_qos_wake_function+0x8a/0xa0
May 15 20:21:04 data1 kernel: [<0000003e0fbec482>] __wake_up_common+0xa2/0x1b0
May 15 20:21:04 data1 kernel: [<0000003e0fbec984>] __wake_up_common_lock+0x94/0xe0
May 15 20:21:04 data1 kernel: [<0000003e0fbec9fa>] __wake_up+0x2a/0x40
May 15 20:21:04 data1 kernel: [<0000003e1005ee70>] wbt_done+0x90/0xe0
May 15 20:21:04 data1 kernel: [<0000003e10047f42>] __rq_qos_done+0x42/0x60
May 15 20:21:04 data1 kernel: [<0000003e10033cb0>] blk_mq_free_request+0xe0/0x140
May 15 20:21:04 data1 kernel: [<0000003e101d46f0>] dm_softirq_done+0x140/0x230
May 15 20:21:04 data1 kernel: [<0000003e100326c0>] blk_done_softirq+0xc0/0xe0
May 15 20:21:04 data1 kernel: [<0000003e103fc084>] __do_softirq+0x104/0x360
May 15 20:21:04 data1 kernel: [<0000003e0fb9da1e>] irq_exit+0x9e/0xc0
May 15 20:21:04 data1 kernel: [<0000003e0fb28ae8>] do_IRQ+0x78/0xb0
May 15 20:21:04 data1 kernel: [<0000003e103fb588>] ext_int_handler+0x130/0x134
May 15 20:21:04 data1 kernel: [<0000003e101d4416>] dm_mq_queue_rq+0x36/0x1d0
May 15 20:21:04 data1 kernel: Last Breaking-Event-Address:
May 15 20:21:04 data1 kernel: [<0000003e0fbce75e>] wake_up_process+0xe/0x20
May 15 20:21:04 data1 kernel: Kernel panic - not syncing: Fatal exception in interrupt

Oops output:
 no

System Dump Info:
  The system was configured to capture a dump, however a dump was not produced.

-Attach sysctl -a output output to the bug.

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-192966 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
bugproxy (bugproxy)
tags: added: severity-critical
removed: severity-high
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Is it possible to describe the steps required to reproduce this issue? And the environment in which it occurred?
Thanks!

Changed in ubuntu-z-systems:
importance: Undecided → Critical
Revision history for this message
Pedro Principeza (pprincipeza) wrote :

Greetings!

Aside from Andrew's last request, I see that a very similar issue was discussed on LP# 1881109 [0] but, at that time, tests with a later kernel didn't reproduce the issue.

You're running 5.4.0-71 there. Have you been able to reproduce this using either 5.4.0-73 or 5.4.0-74 (the latter in Proposed, only)?

Thanks!

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1881109

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Revision history for this message
Frank Heimes (fheimes) wrote :

Well, as already mentioned in the two comments before more details are needed:

If a system is not on the latest kernel level, but shows kernel issues, it is first of all needed to update the system to the latest kernel level and try to recreate the issue there.
So I agree with Pedro that this needs to be verified on (currently) latest 5.4.0.73, since especially 5.4.0.73 includes hundreds of upstream stable patches because it includes the range from v5.4.102 to .106.

Testing on 5.4.0-74 (currently in proposed) would be the next crucial step, since it incl. even more upstream stable patches ranging from v5.4.107 to .114 - again hundreds of patches.

This is needed to be sure that we don't hunt a bug that may already have been fixed.

If the issue is re-produceable on these kernel levels, wee need more details on the environment:
- which IBM Z or LinuxONE system is in use
- which storage backend is attached and used?
- is zFCP/SCSI used or DASDs?
- and what is the dump device and how did it got configured?
(again detailed steps to re-produce, like Andrew asked for)

Changed in ubuntu-z-systems:
status: New → Incomplete
Revision history for this message
Frank Heimes (fheimes) wrote :

Updating status to 'Invalid' due to inactivity.

Changed in linux (Ubuntu):
status: New → Invalid
Changed in ubuntu-z-systems:
status: Incomplete → Invalid
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2021-08-26 13:23 EDT-------
Problem could not be reproduced by Canonical. Further detailed information requested by Canonical was not provided to them since May.
In the meantime, the new point release Ubuntu Server LTS 20.04.3 is available. This means that 20.04.2 which the bug was opened against, has become obsolete.
Therefore, closing / rejecting the bug.

Changing
Status:->REJECTED (UNREPRODUCIBLE)

Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (5.2 KiB)

------- Comment From <email address hidden> 2021-09-16 04:22 EDT-------
Hi, we have hit the same problem:

vmcore/dvtc2b-2.gpfs.net_202106240332/dmesg.202106240332:
...
[645967.289658] Unable to handle kernel pointer dereference in virtual kernel address space
[645967.289665] Failing address: 001ffc004bf14000 TEID: 001ffc004bf14403
[645967.289668] Fault in home space mode while using kernel ASCE.
[645967.289671] AS:00000001d839c00b R2:000000038bbec00b R3:00000003010c0007 S:0000000302260000 P:0000000000000400
[645967.289715] Oops: 0011 ilc:2 [#1] SMP
[645967.289721] Modules linked in: mmfs26(OE) mmfslinux(OE) tracedev(OE) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache 8021q garp mrp stp llc bonding binfmt_misc dm_service_time dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt s390_trng ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common chsc_sch eadm_sch vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel drm drm_panel_orientation_quirks i2c_core sunrpc ip_tables x_tables btrfs zstd_compress zlib_deflate raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 linear crc32_vx_s390 zfcp scsi_transport_fc qeth_l2 dasd_eckd_mod dasd_mod qeth qdio ccwgroup [last unloaded: tracedev]
[645967.289791] CPU: 4 PID: 1891047 Comm: kgnrdwr_dvtc2b Kdump: loaded Tainted: G OE 5.4.0-74-generic #83-Ubuntu
[645967.289795] Hardware name: IBM 3906 M05 710 (LPAR)
[645967.289798] Krnl PSW : 0404e00180000000 00000001d73e20ce (try_to_wake_up+0x4e/0x700)
[645967.289809] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[645967.289814] Krnl GPRS: 0000000370d32488 001ffc0000000000 001ffc0000000005 0000000000000003
[645967.289817] 0000000000000000 ffffffff00000005 041ffbff80bcb9e0 0000000000000003
[645967.289858] 0000000000000003 001ffc004bf141bc 0000000000000000 001ffc004bf13878
[645967.289860] 0000000095190000 00000001d7c1aa40 001ffbff80bcba10 001ffbff80bcb990
[645967.289872] Krnl Code: 00000001d73e20c2: 41902944 la %r9,2372(%r2)
00000001d73e20c6: 582003ac l %r2,940
#00000001d73e20ca: a7180000 lhi %r1,0
>00000001d73e20ce: ba129000 cs %r1,%r2,0(%r9)
00000001d73e20d2: a77401c9 brc 7,00000001d73e2464
00000001d73e20d6: e310b0080004 lg %r1,8(%r11)
00000001d73e20dc: b9800018 ngr %r1,%r8
00000001d73e20e0: a774001f brc 7,00000001d73e211e
[645967.289894] Call Trace:
[645967.289899] ([<0000000000000000>] 0x0)
[645967.289906] [<00000001d785c83a>] rq_qos_wake_function+0x8a/0xa0
[645967.289913] [<00000001d74004c2>] __wake_up_common+0xa2/0x1b0
[645967.289915] [<00000001d74009c4>] __wake_up_common_lock+0x94/0xe0
[645967.289918] [<00000001d7400a3a>] __wake_up+0x2a/0x40
[645967.289923] [<00000001d7873870>] wbt_done+0x90/0xe0
[645967.289925] [<00000001d785c942>] __rq_qos_done+0x42/0x60
[645967.289928] [<00000001d78486c0>] blk_mq_free_request+0xe0/0x140
[645967.289949] [<001fffff801bf18a>] dasd_request_done+0x2a/0x40 [dasd_mod]
[645967.28995...

Read more...

Revision history for this message
Frank Heimes (fheimes) wrote :

Without further details there is no guarantee that this is caused by GPFS (in such cases a dump analysis is usually needed to find the root cause).
But scanning this log message for gpfs and mmfs tells me that gpfs is installed,
hence it's user space daemon (mmfsd) is running and (more importantly) the kernel module (mmfslinux) is active.
Third party kernel modules can cause issues, since they are usually developed towards a certain kernel version, and tend to start failing while the kernel evolves, hence the kernel is marked as "Tainted".
So this tells me that not the entire kernel/module combination that's running is pristine Ubuntu.

We usually ask to recreate this in a pristine Ubuntu environment (not having a tainted kernel).

In addition I see the gpfs module listed in the call trace again (in combination with (ksys_lseek) which is another indicator:
[645967.290048] ([<001fffff80a8d2ca>] gpfs_f_llseek+0x4a/0x280 [mmfslinux])
[645967.290053] [<00000001d75f5ed2>] ksys_lseek

The error/crash message provides just an overview - a look at the full logs might be helpful.

Was the folder /var/crash checked for any content as well as the GPFS logs (/var/adm/ras)?

And as already stated, it needs to be ensured that the system is on the latest (supported) level.
If its not the case, the situation needs to be recreated on the latest, hence currently supported level:
This "sudo apt update && apt-cache policy linux-generic"
will show your current kernel ('Installed'), but also where you should be ("Candidate").

( This btw. would also apply to GPFS itself (being on the latest level):
http://files.gpfsug.org/presentations/2017/NERSC/GPFS-Troubleshooting-Apr-2017.pdf #22)

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-10-22 06:55 EDT-------
I have talked with our team and we have not seen the problem on the last supported Ubuntu level yet. So you, from our perspective, can cancel the Bug.

Revision history for this message
Frank Heimes (fheimes) wrote :

Okay, thanks Aleksandra for having a look at an env. with the latest level
and for providing feedback on this!
(so this bug is closed)

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-10-22 14:34 EDT-------
Problem doesn't occur on the latest supported Ubuntu level, hence closing the bug.
BZ status change:->REJECTED / UNREPRODUCIBLE

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-10-25 11:37 EDT-------
Closing the bug
IBM BZ status change: REJECTED (UNREPRODUCIBLE) -> CLOSED

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.