Kernel panic,Oops: 0004 ilc:3 [#1] SMP, iscsi_q_20 iscsi_xmitworker [libiscsi]

Bug #1804149 reported by bugproxy on 2018-11-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Medium
Unassigned
linux (Ubuntu)
Medium
Unassigned

Bug Description

== Comment: #0 - Michael Finnegan <email address hidden> - 2018-11-19 14:14:40 ==
---Problem Description---
Kernel panic,Oops: 0004 ilc:3 [#1] SMP, iscsi_q_20 iscsi_xmitworker [libiscsi]

---uname output---
Linux ilabg3 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:13:24 UTC 2018 s390x s390x s390x GNU/Linux

Machine Type = IBM 3906 M05 7G4 (z/VM 7.1.0)

---Debugger---
A debugger is not configured

Contact Information = Michael <email address hidden>

Stack trace output:
 dmesg.201811161956
[1363037.322472] Unable to handle kernel pointer dereference in virtual kernel address space
[1363037.322481] Failing address: 0000000000000000 TEID: 0000000000000483
[1363037.322483] Fault in home space mode while using kernel ASCE.
[1363037.322486] AS:0000000000ea0007 R3:00000000f37d0007 S:00000000f37ff000 P:000000000000013d
[1363037.322524] Oops: 0004 ilc:3 [#1] SMP
[1363037.322529] Modules linked in: iptable_filter binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache qeth_l3 qeth_l2 s390_trng ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common qeth vmur ccwgroup vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables dm_round_robin dm_service_time crc32_vx_s390 dasd_eckd_mod zfcp qdio scsi_transport_fc dasd_fba_mod dasd_mod scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
[1363037.322567] CPU: 3 PID: 37970 Comm: kworker/u128:19 Not tainted 4.15.0-36-generic #39-Ubuntu
[1363037.322573] Hardware name: IBM 3906 M05 7G4 (z/VM 7.1.0)
[1363037.322581] Workqueue: iscsi_q_20 iscsi_xmitworker [libiscsi]
[1363037.322583] Krnl PSW : 00000000c8051cf6 00000000e49a28f4 (__iscsi_get_task+0x6/0x18 [libiscsi])
[1363037.322587] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[1363037.322589] Krnl GPRS: 0000000000000000 0000000002923ce0 0000000000000000 0000000000000400
[1363037.322591] 000003ff80277640 0000000000000008 00000000788efc00 000003ff802777cc
[1363037.322592] 000003ff8027769c 0000000000000000 0000000000000000 0000000078ce8310
[1363037.322594] 00000000f3689800 0000000078ce83a2 000003ff80272e8e 0000000002923c90
[1363037.322601] Krnl Code: 000003ff80272624: c0f4fffffcf6 brcl 15,3ff80272010
                            000003ff8027262a: c0f40000377b brcl 15,3ff80279520
                           #000003ff80272630: c00400000000 brcl 0,3ff80272630
                           >000003ff80272636: eb012078006a asi 120(%r2),1
                            000003ff8027263c: c0f400003772 brcl 15,3ff80279520
                            000003ff80272642: 0707 bcr 0,%r7
                            000003ff80272644: 0707 bcr 0,%r7
                            000003ff80272646: 0707 bcr 0,%r7
[1363037.322618] Call Trace:
[1363037.322621] ([<000003ff80272ec6>] iscsi_xmit_task+0x86/0x138 [libiscsi])
[1363037.322625] [<000003ff8027769c>] iscsi_data_xmit+0x44c/0x548 [libiscsi]
[1363037.322636] [<000003ff802777cc>] iscsi_xmitworker+0x34/0x58 [libiscsi]
[1363037.322642] [<00000000001918f2>] process_one_work+0x262/0x4d8
[1363037.322644] [<0000000000191bc0>] worker_thread+0x58/0x4e8
[1363037.322648] [<0000000000198d24>] kthread+0x14c/0x168
[1363037.322652] [<00000000008e3eb2>] kernel_thread_starter+0xa/0x10
[1363037.322654] [<00000000008e3ea8>] kernel_thread_starter+0x0/0x10
[1363037.322655] Last Breaking-Event-Address:
[1363037.322657] [<000003ff80272e88>] iscsi_xmit_task+0x48/0x138 [libiscsi]
[1363037.322658]
[1363037.322660] Kernel panic - not syncing: Fatal exception in interrupt

Oops output:
  Oops: 0004 ilc:3 [#1] SMP

*Additional Instructions for Michael <email address hidden>:
-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-173376 severity-high targetmilestone-inin---

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → kernel-package (Ubuntu)
bugproxy (bugproxy) on 2018-11-20
tags: added: targetmilestone-inin1804
removed: targetmilestone-inin---
Frank Heimes (fheimes) on 2018-11-20
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Changed in ubuntu-z-systems:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel to see if the bug exists there as well? That kernel is available from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc3

Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
tags: added: kernel-da-key
Download full text (3.7 KiB)

------- Comment From <email address hidden> 2018-11-20 15:44 EDT-------
(In reply to comment #6)
> Did this issue start happening after an update/upgrade?

Host run with latest maintenance updates associated with the installed kernel.

For this case, this issue has been seen twice on the host associated with this defect number

(a) 4.15.0-36-generic #39-Ubuntu - /var/dump/201811161956/dmesg.201811161956
(b) 4.15.0-23-generic #25-Ubuntu - /var/dump/201809121135/dmesg.201809121135

Most recently, provided with this defect number
(a) 4.15.0-36-generic #39-Ubuntu
/var/dump/201811161956
dmesg.201811161956
dump.201811161956

Prior occurrence was as follows
(b) 4.15.0-23-generic #25-Ubuntu
/var/dump/201809121135
dmesg.201809121135
dump.201809121135

Stack trace output:
dmesg.201809121135
[479123.425602] Unable to handle kernel pointer dereference in virtual kernel address space
[479123.425611] Failing address: 0000000000000000 TEID: 0000000000000483
[479123.425614] Fault in home space mode while using kernel ASCE.
[479123.425618] AS:0000000000e9c007 R3:00000000f37d0007 S:00000000f37ff000 P:000000000000013d
[479123.425672] Oops: 0004 ilc:3 [#1] SMP
[479123.425677] Modules linked in: iptable_filter binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache dasd_eckd_mod qeth_l3 qeth_l2 s390_trng ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 sha_common qeth vmur ccwgroup vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp sunrpc libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables dm_round_robin dm_service_time crc32_vx_s390 dasd_fba_mod dasd_mod zfcp qdio scsi_transport_fc scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
[479123.425724] CPU: 2 PID: 19195 Comm: kworker/u128:0 Not tainted 4.15.0-23-generic #25-Ubuntu
[479123.425731] Hardware name: IBM 3906 M05 7G4 (z/VM 7.1.0)
[479123.425744] Workqueue: iscsi_q_9 iscsi_xmitworker [libiscsi]
[479123.425746] Krnl PSW : 00000000a880894f 000000000803fc8a (__iscsi_get_task+0x6/0x18 [libiscsi])
[479123.425752] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[479123.425755] Krnl GPRS: 0000000084102000 000000005965fce0 0000000000000000 0000000080000400
[479123.425756] 000003ff801d7640 0000000000000008 00000000cded0000 000003ff801d77cc
[479123.425758] 000003ff801d769c 0000000000000000 0000000000000000 00000000cd0e0310
[479123.425759] 00000000f3689800 00000000cd0e03a2 000003ff801d2e8e 000000005965fc90
[479123.425771] Krnl Code: 000003ff801d2624: c0f4fffffcf6 brcl 15,3ff801d2010
000003ff801d262a: c0f40000377b brcl 15,3ff801d9520
#000003ff801d2630: c00400000000 brcl 0,3ff801d2630
>000003ff801d2636: eb012078006a asi 120(%r2),1
000003ff801d263c: c0f400003772 brcl 15,3ff801d9520
000003ff801d2642: 0707 bcr 0,%r7
000003ff801d2644: 0707 bcr 0,%r7
000003ff801d2646: 0707 bcr 0,%r7
[479123.425804] Call Trace:
[479123.425808] ([<000003ff801d2ec6>] iscsi_xmit_task+0x86/0x138 [libiscsi])
[479123.425810] [<000003ff801d769c>] iscsi_data_xmit+0x44c/0x548 [libiscsi]
[479...

Read more...

Joseph Salisbury (jsalisbury) wrote :

Were you able to test the mainline kernel to see if this bug still exists there?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-11-28 13:52 EDT-------
(In reply to comment #9)
> Were you able to test the mainline kernel to see if this bug still exists
> there?

Our testing typically requires us to be on released kernels to support release schedules. Since this issue is very intermittent (seen on this host on 201809121135 and 201811161956) we can't stay on a mainline kernel for 2+ months and also continue testing on released kernels to support release schedules

------- Comment From <email address hidden> 2018-11-28 13:54 EDT-------
Similar Stack trace output also seen on another host running on older kernel

i.e. 4.15.0-15-generic #16-Ubuntu"

Reference Bug 169770
https://bugzilla.linux.ibm.com/show_bug.cgi?id=169770
Bug 169770 - LP1782085 : Kernel panic,iSCSI,Unable to handle kernel pointer dereference in virtual kernel address space

Joseph Salisbury (jsalisbury) wrote :

In bug 1782085 it looks like the bug stopped happening after upgrading to version 4.15.0-23. Do you know if that is still the case with that older system, or has the bug come back after upgrading to a newer kernel version?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-12-04 21:05 EDT-------
(In reply to comment #13)
> In bug 1782085 it looks like the bug stopped happening after upgrading to
> version 4.15.0-23. Do you know if that is still the case with that older
> system, or has the bug come back after upgrading to a newer kernel version?

System associated with older 1782085 has been up 47 days on kernel 4.15.0-36-generic with no Oops Kernel panic

i.e.

root@ilzlnx4:~# date;uptime;uname -a
Tue Dec 4 18:59:53 MST 2018
18:59:53 up 47 days, 6:41, 6 users, load average: 8.21, 7.63, 7.02
Linux ilzlnx4 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:13:24 UTC 2018 s390x s390x s390x GNU/Linux
root@ilzlnx4:~#

History of prior kernels on older 1782085 system

root@ilzlnx4:~# ls -lrt /boot | grep -i initrd
-rw-r--r-- 1 root root 12520274 May 11 2017 initrd.img-4.4.0-59-generic.bk
-rw-r--r-- 1 root root 12533717 Jan 8 2018 initrd.img-4.4.0-93-generic
-rw-r--r-- 1 root root 12533595 Jan 8 2018 initrd.img-4.4.0-89-generic
-rw-r--r-- 1 root root 12531125 Jan 8 2018 initrd.img-4.4.0-77-generic
-rw-r--r-- 1 root root 12525524 Jan 8 2018 initrd.img-4.4.0-59-generic
-rw-r--r-- 1 root root 12533634 Jan 8 2018 initrd.img-4.4.0-96-generic
-rw-r--r-- 1 root root 12537400 Jan 9 2018 initrd.img-4.4.0-104-generic
-rw-r--r-- 1 root root 13971222 Mar 14 2018 initrd.img-4.15.0-10-generic.bk
-rw-r--r-- 1 root root 13998457 Jul 17 11:39 initrd.img-4.15.0-10-generic
-rw-r--r-- 1 root root 8770182 Jul 17 11:39 initrd.img-4.4.0-112-generic
-rw-r--r-- 1 root root 14291350 Jul 17 11:39 initrd.img-4.15.0-15-generic
-rw-r--r-- 1 root root 14360370 Oct 17 01:11 initrd.img-4.15.0-23-generic
lrwxrwxrwx 1 root root 28 Oct 18 12:16 initrd.img.old -> initrd.img-4.15.0-23-generic
lrwxrwxrwx 1 root root 28 Oct 18 12:16 initrd.img -> initrd.img-4.15.0-36-generic
-rw-r--r-- 1 root root 14321575 Oct 18 12:16 initrd.img-4.15.0-36-generic
root@ilzlnx4:~#

Frank Heimes (fheimes) wrote :

Not sure if I understand the overall situation correctly, but:

In the bug description I read that the issue occurs with 4.15.0-36-generic,
and in comment 4 its also mentioned that it happens with 4.15.0-36-generic and 4.15.0-23-generic
and again in comment 6 also on 4.15.0-15-generic (I assume all on 16.04).

But in comment 8 (according to LP 1782085) the issue doesn't happen (at least for the last 47 days) on a kernel 4.15.0-36-generic (on that particular system) - but this is the same kernel level like the one mentioned in the bug description?!
Either there is something wrong with the mentioned kernel levels, or it doesn't depend on the kernel, but for example on the infrastructure?!

LP 1782085 describes a situation on an LPAR system, but LP 1804149 on a zVM 7.1 guest system.
Are there other differences between both systems?
Can it have something todo with the underlying hypervisor?

------- Comment on attachment From <email address hidden> 2018-12-11 18:14 EDT-------

Ubuntu guest hosted on z/VM 7.1.0 reported another Kernel panic Oops: 0004 ilc:3 [#1] SMP on Dec 8 13:57:50 ilabg3 kernel: [673809.233056] (MST)

dmesg.ilabg3.201812081359 attached
/var/log/syslog.3, renamed as syslog.ilabg3.3.txt attached
Remote offloaded syslog file attached for 20181208 Filename: ilabg3.remote.syslog.20181208.txt

root@ilabg3:/var/dump/201812081359# uname -a
Linux ilabg3 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:13:24 UTC 2018 s390x s390x s390x GNU/Linux
root@ilabg3:/var/dump/201812081359# ls -l
total 151024
-rw------- 1 root root 376789 Dec 8 13:59 dmesg.201812081359
-rw------- 1 root root 154267627 Dec 8 13:59 dump.201812081359
root@ilabg3:/var/dump/201812081359#

bugproxy (bugproxy) wrote : syslog.3

------- Comment on attachment From <email address hidden> 2018-12-11 18:16 EDT-------

/var/log/syslog.3, renamed as syslog.ilabg3.3.txt attached

------- Comment on attachment From <email address hidden> 2018-12-11 18:17 EDT-------

Remote offloaded syslog file attached for 20181208
Filename: ilabg3.remote.syslog.20181208.txt

Dec 8 13:57:50 ilabg3 kernel: [673809.233056] Unable to handle kernel pointer dereference in virtual kernel address space
Dec 8 13:57:50 ilabg3 kernel: [673809.233064] Failing address: 0000000000000000 TEID: 0000000000000483
Dec 8 13:57:50 ilabg3 kernel: [673809.233066] Fault in home space mode while using kernel ASCE.
Dec 8 13:57:50 ilabg3 kernel: [673809.233069] AS:0000000000ea0007 R3:00000000f37d0007 S:00000000f37ff000 P:000000000000013d
Dec 8 13:57:50 ilabg3 kernel: [673809.233103] Oops: 0004 ilc:3 [#1] SMP

Frank Heimes (fheimes) wrote :

Having a look at the new logs it shows that there are many different disks and disk systems in use.
I see DASDs, that get partly unhappy, like:
Dec 8 07:43:29 ilabg3 kernel: [651348.534047] dasd(eckd): I/O status report for device 0.0.248b:
Dec 8 07:43:29 ilabg3 kernel: [651348.534047] dasd(eckd): in req: 000000006004d11c CC:00 FC:01 AC:00 SC:01 DS:00 CS:00 RC:0
Dec 8 07:43:29 ilabg3 kernel: [651348.534047] dasd(eckd): device 0.0.248b: Failing CCW: (null)
Dec 8 07:43:29 ilabg3 kernel: [651348.534047] dasd(eckd): SORRY - NO VALID SENSE AVAILABLE
Dec 8 07:43:29 ilabg3 kernel: [651348.534072] dasd(eckd): Related CP in req: 000000006004d11c
Dec 8 07:43:29 ilabg3 kernel: [651348.534072] dasd(eckd): CCW 00000000584ecfe4: 2740000C 7FE86BC0 DAT: 18000000 00000e00 00000000
Dec 8 07:43:29 ilabg3 kernel: [651348.534072] dasd(eckd): CCW 00000000a235ae5d: 3E000200 7FEDD400 DAT: 00000000 00000000 00000000
(of course not related to the reported issue)
I see 'local' scsi disks:
scsi 3:0:12:0: Direct-Access IBM FlashSystem-9840 1600 PQ: 0 ANSI: 5
and I see XIV Storage as well.

I noticed a lot of 'Power-on or device reset's, as well as many 'LUN scan's and multipath path failures on a single day on that environment, like:
sd 9:0:0:1: tag#58 Add. Sense: Power on, reset, or bus device reset occurred
...
Dec 8 14:01:07 ilabg3 kernel: [ 2.127745] sd 3:0:0:0: scsi scan: REPORT LUN scan
...
[458582.527232] device-mapper: multipath: Failing path 66:112.
[458582.527234] print_req_error: I/O error, dev sdan, sector 15001576
That seems to be very odd and points to an issue with the SAN infrastructure and/or setup...

The error messages that belong to the Ops seem to be known and points to a potential race discussed here:
[RFC,1/1] libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
https://patchwork.kernel.org/patch/10501773/
Well, the situation above may even favor a race condition like this ...

------- Comment From <email address hidden> 2019-01-10 21:11 EDT-------
In response to
>The error messages that belong to the Ops seem to be known and points to a potential race discussed here:
>[RFC,1/1] libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
>https://patchwork.kernel.org/patch/10501773/

Is there a target date when this race condition will have a fix available to the ubuntu distro ?
as this issue has been seen on at lease two different hosts across three (3) different kernels.
4.15.0-15-generic #16-Ubuntu ilzlnx4
4.15.0-23-generic #25-Ubuntu ilzlnx4,ilabg3
4.15.0-36-generic #39-Ubuntu ilabg3

Most recently seen again on Dec 20 17:12 (MST)

root@ilabg3:/var/dump/201812201711# ls -lrt
total 140732
-rw------- 1 root root 143650157 Dec 20 17:12 dump.201812201711
-rw------- 1 root root 452204 Dec 20 17:12 dmesg.201812201711

Reference attached files
dmesg.201812201711.txt
/var/log/kern.log.3.txt

------- Comment on attachment From <email address hidden> 2019-01-10 21:09 EDT-------

dmesg.201812201711.txt file attached.

------- Comment on attachment From <email address hidden> 2019-01-10 21:10 EDT-------

kern.log.3.txt file attached

------- Comment From <email address hidden> 2019-03-05 10:50 EDT-------
@Canonical: Any update here. When will the test kernel via PPA be made available? Many thanks in advance

Juerg Haefliger (juergh) wrote :

Can you please give this test kernel a try:
https://kernel.ubuntu.com/~juergh/lp1804149/

------- Comment on attachment From <email address hidden> 2018-12-11 18:17 EDT-------

Remote offloaded syslog file attached for 20181208
Filename: ilabg3.remote.syslog.20181208.txt

Dec 8 13:57:50 ilabg3 kernel: [673809.233056] Unable to handle kernel pointer dereference in virtual kernel address space
Dec 8 13:57:50 ilabg3 kernel: [673809.233064] Failing address: 0000000000000000 TEID: 0000000000000483
Dec 8 13:57:50 ilabg3 kernel: [673809.233066] Fault in home space mode while using kernel ASCE.
Dec 8 13:57:50 ilabg3 kernel: [673809.233069] AS:0000000000ea0007 R3:00000000f37d0007 S:00000000f37ff000 P:000000000000013d
Dec 8 13:57:50 ilabg3 kernel: [673809.233103] Oops: 0004 ilc:3 [#1] SMP

Juerg Haefliger (juergh) wrote :

Can you please test the test kernel at:
https://kernel.ubuntu.com/~juergh/lp1804149/

------- Comment From <email address hidden> 2019-04-02 14:10 EDT-------
(In reply to comment #29)
> Can you please test the test kernel at:
> https://kernel.ubuntu.com/~juergh/lp1804149/

I am waiting for the configuration to complete current testing.
I hope to be able to test the test kernel within the next 2-3 weeks

Frank Heimes (fheimes) on 2019-04-02
Changed in ubuntu-z-systems:
status: Triaged → Incomplete
Juerg Haefliger (juergh) wrote :

IBM, have you had any luck with the test kernel?

Frank Heimes (fheimes) wrote :

After discussion with IBM this will now de-prioritized to Medium.

Frank Heimes (fheimes) on 2019-06-24
Changed in ubuntu-z-systems:
importance: High → Medium
Manoj Iyer (manjo) on 2019-06-24
Changed in linux (Ubuntu):
importance: High → Medium
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
Frank Heimes (fheimes) on 2019-11-06
Changed in linux (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → nobody
Changed in ubuntu-z-systems:
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
Changed in linux (Ubuntu):
status: Triaged → Incomplete
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-01-13 16:40 EDT-------
This defect can be closed.

The systems are now running 4.15.0-65-generic and there have been no new reportings of 'Oops: 0004 ilc:3 [#1] SMP' Workqueue: iscsi_q_1 iscsi_xmitworker [libiscsi].

Last Oops panic were reported against kernel 4.15.0-47-generic

Frank Heimes (fheimes) wrote :

@finnegan thanks for the feedback - closing this ticket according to comment #24.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in ubuntu-z-systems:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.