Ubuntu
linux package

ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into xmon after moving to 4.15.0-15.16 kernel

Bug #1762844 reported by bugproxy on 2018-04-10

This bug affects 1 person

	Status	Importance	Assigned to
The Ubuntu-power-systems project	Fix Released	Critical	Canonical Kernel Team
linux (Ubuntu)	Fix Released	Critical	Canonical Kernel Team
Bionic	Fix Released	Critical	Canonical Kernel Team

Bug Description

Problem Description:
===================
Host crashed & enters into xmon after updating to 4.15.0-15.16 kernel kernel.

Steps to re-create:
==================

1. boslcp3 is up with BMC:118 & PNOR: 20180330 levels
2. Installed boslcp3 with latest kernel
4.15.0-13-generic
3. Enabled "-proposed" kernel in /etc/apt/sources.list file
4. Ran sudo apt-get update & apt-get upgrade

5. root@boslcp3:~# ls /boot
abi-4.15.0-13-generic retpoline-4.15.0-13-generic
abi-4.15.0-15-generic retpoline-4.15.0-15-generic
config-4.15.0-13-generic System.map-4.15.0-13-generic
config-4.15.0-15-generic System.map-4.15.0-15-generic
grub vmlinux
initrd.img vmlinux-4.15.0-13-generic
initrd.img-4.15.0-13-generic vmlinux-4.15.0-15-generic
initrd.img-4.15.0-15-generic vmlinux.old
initrd.img.old

6. Rebooted & booted with 4.15.0-15 kernel
7. Enabled xmon by editing file "vi /etc/default/grub" and ran update-grub
8. Rebooted host.
9. Booted with 4.15.0-15 & provided root/password credentials in login prompt

10. Host crashed & enters into XMON state with 'Unable to handle kernel paging request'

root@boslcp3:~# [ 66.295233] Unable to handle kernel paging request for data at address 0x8882f6ed90e9151a
[ 66.295297] Faulting instruction address: 0xc00000000038a110
cpu 0x50: Vector: 380 (Data Access Out of Range) at [c00000000692f650]
    pc: c00000000038a110: kmem_cache_alloc_node+0x2f0/0x350
    lr: c00000000038a0fc: kmem_cache_alloc_node+0x2dc/0x350
    sp: c00000000692f8d0
   msr: 9000000000009033
   dar: 8882f6ed90e9151a
  current = 0xc00000000698fd00
  paca = 0xc00000000fab7000 softe: 0 irq_happened: 0x01
    pid = 1762, comm = systemd-journal
Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 4.15.0-15.16-generic 4.15.15)
enter ? for help
[c00000000692f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[c00000000692f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
[c00000000692f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
[c00000000692fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
[c00000000692fae0] c000000000c5705c unix_dgram_sendmsg+0x15c/0x8f0
[c00000000692fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
[c00000000692fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
[c00000000692fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
[c00000000692fe30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 000074826f6fa9c4
SP (7ffff5dc5510) is in userspace
50:mon>
50:mon>

10. Attached Host console logs

I rebooted the host just to see if it would hit the issue again and this time I didn't even get to the login prompt but it crashed in the same location:

50:mon> r
R00 = c000000000389fd4 R16 = c000200e0b20fdc0
R01 = c000200e0b20f8d0 R17 = 0000000000000048
R02 = c0000000016eb400 R18 = 000000000001fe80
R03 = 0000000000000001 R19 = 0000000000000000
R04 = 0048ca1cff37803d R20 = 0000000000000000
R05 = 0000000000000688 R21 = 0000000000000000
R06 = 0000000000000001 R22 = 0000000000000048
R07 = 0000000000000687 R23 = 4882d6e3c8b7ab55
R08 = 48ca1cff37802b68 R24 = c000200e5851df01
R09 = 0000000000000000 R25 = 8882f6ed90e67454
R10 = 0000000000000000 R26 = c000000000b2ec6c
R11 = c000000000d10f78 R27 = c000000ff901ee00
R12 = 0000000000002000 R28 = ffffffffffffffff
R13 = c00000000fab7000 R29 = 00000000015004c0
R14 = c000200e4c973fc8 R30 = c000200e5851df01
R15 = c000200e4c974238 R31 = c000000ff901ee00
pc = c00000000038a110 kmem_cache_alloc_node+0x2f0/0x350
cfar= c000000000016e1c arch_local_irq_restore+0x1c/0x90
lr = c00000000038a0fc kmem_cache_alloc_node+0x2dc/0x350
msr = 9000000000009033 cr = 28002844
ctr = c00000000061e1b0 xer = 0000000000000000 trap = 380
dar = 8882f6ed90e67454 dsisr = c000200e40bd8400
50:mon> t
[c000200e0b20f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[c000200e0b20f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
[c000200e0b20f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
[c000200e0b20fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
[c000200e0b20fae0] c000000000c56ae4 unix_stream_sendmsg+0x264/0x5c0
[c000200e0b20fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
[c000200e0b20fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
[c000200e0b20fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
[c000200e0b20fe30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 00007d16a993a940
SP (7ffffbee2270) is in userspace

Mirroring to Canonical to advise them that this might be possible regression. Didn't see any obvious changes in this area in the changelog published at https://launchpad.net/ubuntu/+source/linux/4.15.0-15.16 but it would be good to have Canonical help reviewing the deltas as we try to isolate this further.

Tags:

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-10: boslcp3 host console logs_4.15.0-15 kernel

boslcp3 host console logs_4.15.0-15 kernel Edit (119.6 KiB, text/plain)

Default Comment by Bridge

tags:	added: architecture-ppc64le bugnameltc-166588 severity-critical targetmilestone-inin1804
Changed in ubuntu:
assignee:	nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects:	ubuntu → linux (Ubuntu)

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-11: Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-04-10 21:32 EDT-------
According to test they have another bostonLC (boslcp4) and they did update to this kernel and system is booting up normally.
root@boslcp4:~# uname -a
Linux boslcp4 4.15.0-15-generic #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp4:~# date
Tue Apr 10 16:37:37 CDT 2018
root@boslcp4:~# uptime
16:37:38 up 40 min, 2 users, load average: 0.00, 0.03, 0.12

Additionally, I rebooted the system a third time to add the slub_debug=FZ kernel option and system booted to the login and I logged in successfully. I did it a fourth time and it succeeded again.

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-15-generic #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# cat /proc/cmdline
root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro xmon=on splash quiet slub_debug=FZ

Strange.

Revision history for this message

Frank Heimes (fheimes) wrote on 2018-04-11:

Can you test again on a third system?
Can this be a hw problem on the first system?

Changed in ubuntu-power-systems:
status:	New → Triaged
importance:	Undecided → Critical
assignee:	nobody → Canonical Kernel Team (canonical-kernel-team)
tags:	added: triage-g
Changed in ubuntu-power-systems:
status:	Triaged → Incomplete

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-11:

------- Comment From <email address hidden> 2018-04-11 03:14 EDT-------
(In reply to comment #16)
> Can you test again on a third system?
> Can this be a hw problem on the first system?

No. This cannot he an hardware issue, since we are running fine on the same system from last 4 months with multiple kernel updates.

And the system is back up again automatically on 3rd & 4th reboot. So the underlying problem still reside

Frank Heimes (fheimes) on 2018-04-11

Changed in ubuntu-power-systems:
status:	Incomplete → Triaged

Joseph Salisbury (jsalisbury) on 2018-04-11

Changed in linux (Ubuntu):
importance:	Undecided → Critical
status:	New → Triaged
tags:	added: kernel-key

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-04-11:

Can you see if the bug happens with and of these mainline kernels? We can perform a kernel bisect if we can narrow down to the last good kernel version and first bad one:

v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/
v4.15-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc4/
v4.15 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-12:

------- Comment From <email address hidden> 2018-04-12 01:24 EDT-------
(In reply to comment #18)
> Can you see if the bug happens with and of these mainline kernels? We can
> perform a kernel bisect if we can narrow down to the last good kernel
> version and first bad one:
>
> v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
> v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/
> v4.15-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc4/
> v4.15 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/
>
> You don't have to test every kernel, just up until the kernel that first has
> this bug.
>
> Thanks in advance!

We need to make progress in testing other firmware and guest issues. We will come back to this later.

Meanwhile, the problem happened again today with the reboot and we tried to collect the vmcore using 'X', but it did not collect. Indira, pls add those details.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-12:

------- Comment From <email address hidden> 2018-04-12 11:10 EDT-------
Hi,

Today i have tried rebooting boslcp3 system and crash issue recreated.

For first attempt, after rebooting host it booted with latest kernel & i have attempted disable stop4, 5 commands then it immediately crashed & enters into xmon with similar stack trace as reported in the bug(recreation steps). Tried to take dump from xmon prompt using 'X' option , it did not worked & it came back to shell prompt.

For second attempt of reboot, host booted with latest kernel. Issued kdump-config status command & then host crashed with same stack trace as reported in recreation steps. Again tried to take dump from xmon prompt using 'X' which did not worked . it came back to shell prompt.

Attached host console logs for both attempts of reboots clearly.

Regards,
Indira

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-12: First attempt of host reboot_boslcp3

First attempt of host reboot_boslcp3 Edit (137.4 KiB, text/plain)

------- Comment on attachment From <email address hidden> 2018-04-12 11:12 EDT-------

Attached boslcp3 host console logs during 1st attempt of host reboot

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-12: Second attempt of host reboot_boslcp3

Second attempt of host reboot_boslcp3 Edit (10.4 KiB, text/plain)

------- Comment on attachment From <email address hidden> 2018-04-12 11:13 EDT-------

Attached boslcp3 host console logs during 2nd attempt of reboot

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-12: boslcp3 host crash console logs

#10

boslcp3 host crash console logs Edit (23.2 KiB, text/plain)

------- Comment on attachment From <email address hidden> 2018-04-12 14:25 EDT-------

Attached boslcp3 host crash console logs

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-13: boslcp3_crash_aprl13

#11

boslcp3_crash_aprl13 Edit (1.8 KiB, text/plain)

------- Comment (attachment only) From <email address hidden> 2018-04-13 03:24 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-13: Comment bridged from LTC Bugzilla

#12

------- Comment From <email address hidden> 2018-04-13 14:24 EDT-------
Dwip - excellent suggestion, I agree with your suggestion on next steps. If this is a double free we need to catch that earlier than where we are crashing.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-13:

#13

------- Comment From <email address hidden> 2018-04-13 14:51 EDT-------
I believe that the "1" in c000200e5848b701 is a flag. The address actually used will be c000200e5848b700. The flags PAGE_MAPPING_ANON and/or PAGE_MAPPING_MOVABLE are added to page addresses, and are stripped of before dereferencing. If that R30 value is something like "anon_mapping = (unsigned long)READ_ONCE(page->mapping)" then it will contain those flags. Not sure if that applies to your situation or not.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-13: Host console log

#14

Host console log Edit (563.1 KiB, text/plain)

------- Comment on attachment From <email address hidden> 2018-04-13 17:17 EDT-------

The full log of the Host during couple reboot, start test on guest then system drop into xmon.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-15: Comment bridged from LTC Bugzilla

#15

------- Comment From <email address hidden> 2018-04-15 16:36 EDT-------
Please collect the dmesg log and a crashdump.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-16:

#16

------- Comment From <email address hidden> 2018-04-16 01:24 EDT-------
(In reply to comment #54)
> Please collect the dmesg log and a crashdump.

Collected dl logs from xmon prompt & unable to take crashdump from xmon prompt ,we have bug#166660 opened.

Regards,
Indira

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-16: dmesg logs from xmon prompt_boslcp3

#17

dmesg logs from xmon prompt_boslcp3 Edit (145.2 KiB, text/plain)

------- Comment on attachment From <email address hidden> 2018-04-16 01:31 EDT-------

Attached dl logs from xmon pormpt _boslcp3

Manoj Iyer (manjo) on 2018-04-16

Changed in linux (Ubuntu Bionic):
assignee:	Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-16: Comment bridged from LTC Bugzilla

#18

------- Comment From <email address hidden> 2018-04-16 17:21 EDT-------
We're waiting for a reproduce and a kdump. Also more logs, including firmware logs/eSELs/etc.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-16: dmesg logs from xmon prompt_boslcp3

#19

------- Comment on attachment From <email address hidden> 2018-04-16 01:31 EDT-------

Attached dl logs from xmon pormpt _boslcp3

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-18: dmesg output 0418

#20

dmesg output 0418 Edit (231.5 KiB, application/octet-stream)

------- Comment (attachment only) From <email address hidden> 2018-04-18 12:01 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-18: Comment bridged from LTC Bugzilla

#21

------- Comment From <email address hidden> 2018-04-18 12:27 EDT-------
Nick made some interesting comments about lockups in LTC bug 166684, comment #24 about the hard lockup watchdog being added in Kernel 4.13. Also other comments about RCU stall warnings being too aggressive, but at least in this last log RCU doesn't complain until after the first few traces/lockups...

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-19:

#22

------- Comment From <email address hidden> 2018-04-19 03:51 EDT-------
Copied the dump to our kte server

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3/BZ166588/

h# ls -l /LOGS/boslcp3/BZ166588/
total 4
drwxr-xr-x 2 root root 4096 Apr 19 02:42 201804181042

Thanks.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-20:

#23

------- Comment From <email address hidden> 2018-04-20 15:11 EDT-------
Below is a test kernel with the four QLogic commits that were added to the 4.15.0-15.16 kernel reverted, plus the patch from 166877. Please run this and update the bug if the crash is still seen.

https://ibm.ent.box.com/s/n29uregixfwrywyle4ursgovmrbjcxtd

------- Comment From <email address hidden> 2018-04-20 15:11 EDT-------
These are the four qlogic driver commits that were added between 4.15.0-13 and 4.15.0-15.16:

79c67fb6fa21774c67bba59619eaa908c18de759 scsi: qla2xxx: Fix crashes in qla2x00_probe_one on probe failure
21af711d6011c857f11717d20b57516c334d5dd0 scsi: qla2xxx: Fix logo flag for qlt_free_session_done()
60b5e40ad28c93a2752fff0988660fa28fe7905d scsi: qla2xxx: Fix NULL pointer access for fcport structure
e4caf5c1b7d847400f2cb6525e7cf83167863241 scsi: qla2xxx: Fix smatch warning in qla25xx_delete_{rsp|req}_que

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-20:

#24

------- Comment From <email address hidden> 2018-04-20 15:39 EDT-------
Padma is reporting that the boslcp3 is available.

Dwip, I think Indira won't be available at this time of the day. Can you jump in and try to reproduce with the debug kernel in comment #94 above?

Thanks

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-20:

#25

------- Comment From <email address hidden> 2018-04-20 16:33 EDT-------
Klaus, I am not aware of the particular tests being run.

But I pinged Chanh so that he can start a new round of tests.

However ... I do see that boslcp3 now has reverted to the prior kernel:
Linux boslcp3 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:15:55 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

I am not sure if there were other plans, but I let Chanh know about the
existence of the new patch which we would like to be tested. And he
kindly agreed to start the tests after installing the patch in #94.

p.s. the old kernel (according to Chanh) is due to one of the disks
getting corrupted.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-21: console log

#26

console log Edit (162.6 KiB, text/plain)

------- Comment (attachment only) From <email address hidden> 2018-04-20 20:08 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-21: Comment bridged from LTC Bugzilla

#27

Download full text (3.5 KiB)

------- Comment From <email address hidden> 2018-04-21 01:53 EDT-------
Looks like an Oops similar to the previous one in comment#39 starting a sequence of events

root@boslcp3:~# [ 2837.030181] Unable to handle kernel paging request for data at address 0x00000008
[ 2837.030253] Faulting instruction address: 0xc0000000001336fc
[ 2837.030295] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2837.030328] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 2837.030364] Modules linked in: vhost_net vhost macvtap macvlan tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables devlink ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 nfsv4 nfs fscache kvm_hv binfmt_misc kvm dm_service_time dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua joydev input_leds idt_89hpesx mac_hid vmx_crypto crct10dif_vpmsum at24 ofpart cmdlinepart uio_pdrv_genirq uio powernv_flash mtd ibmpowernv ipmi_powernv ipmi_devintf ipmi_msghandler opal_prd nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq ses enclosure hid_generic
[ 2837.030909] usbhid hid qla2xxx ast i2c_algo_bit ttm ixgbe drm_kms_helper mpt3sas nvme_fc syscopyarea sysfillrect nvme_fabrics sysimgblt fb_sys_fops nvme_core raid_class crc32c_vpmsum drm i40e scsi_transport_sas scsi_transport_fc mdio aacraid
[ 2837.031053] CPU: 145 PID: 1182 Comm: kworker/145:1 Not tainted 4.15.0-18-generic #19
[ 2837.031107] NIP: c0000000001336fc LR: c000000000133cf8 CTR: c000000000cfefa0
[ 2837.031156] REGS: c000200e44c77a10 TRAP: 0300 Not tainted (4.15.0-18-generic)
[ 2837.031204] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28000822 XER: 00000000
[ 2837.031257] CFAR: c000000000133cf4 DAR: 0000000000000008 DSISR: 40000000 SOFTE: 0
[ 2837.031257] GPR00: c000000000133cf8 c000200e44c77c90 c0000000016eae00 c000200e44bda5c0
[ 2837.031257] GPR04: c000000fdf6f7da0 c000200e618f7da0 c000200e618fa305 c000000fdf6f7cc8
[ 2837.031257] GPR08: c000200e6190c960 0000000000002440 0000000000000000 c00800000f04e0f8
[ 2837.031257] GPR12: 0000000000000000 c000000007a83b00 c00000000013c788 c000200e50ebf3c0
[ 2837.031257] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2837.031257] GPR20: c000200e618f7d80 0000000000000000 0000000000000000 fffffffffffffef7
[ 2837.031257] GPR24: 0000000000000402 0000000000000000 c000200e618f8100 c000000001713b00
[ 2837.031257] GPR28: c000200e618f7da0 0000000000000000 c000200e618f7d80 c000200e44bda5c0
[ 2837.031687] NIP [c0000000001336fc] process_one_work+0x3c/0x5a0
[ 2837.031727] LR [c000000000133cf8] worker_thread+0x98/0x630
[ 2837.031760] Call Trace:
[ 2837.031778] [c000200e44c77c90] [c000000000133974] process_one_work+0x2b4/0x5a0 (unreliable)
[ 2837.031828] [c000200e44c77d20] [c000000000133cf8] worker_thread+0x98/0x630
[ 2837.031885] [c000200e44c77dc0] [c00000000013c928] kthread+0x1a8/0x1b0
[ 2837.031928] [c000200e44c77e30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
[ 2837.031976] Instruction dump:
[ 2837.032001] 60000000 7d908026...

------- Comment From chnguyen@us.ibm.com 2018-04-21 17:02 EDT-------
Not sure what is going on. The SOL console print out all of these messages...
rcu_sched self-detected stall on CPU
[20705.652053] 	95-....: (1 GPs behind) idle=c72/2/0 softirq=179/180 fqs=2586003
[20705.652101] 	 (t=5172329 jiffies g=213 c=212 q=74736)
[20705.652140] Task dump for CPU 95:
[20705.652164] swapper/95      R  running task        0     0      1 0x00000804
[20705.652213] Call Trace:
[20705.652231] [c000200fff3d3460] [c000000000149ed8] sched_show_task.part.16+0xd8/0x110 (unreliable)
[20705.652288] [c000200fff3d34d0] [c0000000001a9e9c] rcu_dump_cpu_stacks+0xd4/0x138
[20705.652336] [c000200fff3d3520] [c0000000001a8f68] rcu_check_callbacks+0x8e8/0xb40
[20705.652385] [c000200fff3d3650] [c0000000001b7208] update_process_times+0x48/0x90
[20705.652433] [c000200fff3d3680] [c0000000001cef54] tick_sched_handle.isra.5+0x34/0xd0
[20705.652482] [c000200fff3d36b0] [c0000000001cf050] tick_sched_timer+0x60/0xe0
[20705.652530] [c000200fff3d36f0] [c0000000001b7db4] __hrtimer_run_queues+0x144/0x370
[20705.652578] [c000200fff3d3770] [c0000000001b8d0c] hrtimer_interrupt+0xfc/0x350
[20705.652627] [c000200fff3d3840] [c0000000000248f0] __timer_interrupt+0x90/0x260
[20705.652675] [c000200fff3d3890] [c000000000024d08] timer_interrupt+0x98/0xe0
[20705.652716] [c000200fff3d38c0] [c000000000009014] decrementer_common+0x114/0x120
[20705.652765] --- interrupt: 901 at _raw_spin_lock_irqsave+0x88/0x110
[20705.652765]     LR = _raw_spin_lock_irqsave+0x80/0x110
[20705.652836] [c000200fff3d3bb0] [c000200fff3d3bf0] 0xc000200fff3d3bf0 (unreliable)
[20705.652885] [c000200fff3d3bf0] [c000000000904e90] scsi_end_request+0x110/0x270
[20705.652933] [c000200fff3d3c50] [c000000000905414] scsi_io_completion+0x424/0x750
[20705.652981] [c000200fff3d3d10] [c0000000008f949c] scsi_finish_command+0x11c/0x1b0
[20705.653029] [c000200fff3d3d90] [c000000000904428] scsi_softirq_done+0x198/0x220
[20705.653078] [c000200fff3d3e20] [c00000000068fe98] blk_done_softirq+0xb8/0xe0
[20705.653126] [c000200fff3d3e60] [c000000000cffb08] __do_softirq+0x158/0x3e4
[20705.653167] [c000200fff3d3f40] [c000000000115968] irq_exit+0xe8/0x120
[20705.653207] [c000200fff3d3f60] [c000000000017788] __do_irq+0x88/0x1c0
[20705.653248] [c000200fff3d3f90] [c00000000002a1b0] call_do_irq+0x14/0x24
[20705.653289] [c000200e582fba90] [c00000000001795c] do_IRQ+0x9c/0x130
[20705.653330] [c000200e582fbae0] [c000000000009b04] h_virt_irq_common+0x114/0x120
[20705.653379] --- interrupt: ea1 at replay_interrupt_return+0x0/0x4
[20705.653379]     LR = arch_local_irq_restore+0x74/0x90
[20705.653459] [c000200e582fbdd0] [000000000000005f] 0x5f (unreliable)
[20705.653500] [c000200e582fbdf0] [c000000000ac16d0] cpuidle_enter_state+0xf0/0x450
[20705.653549] [c000200e582fbe50] [c00000000017311c] call_cpuidle+0x4c/0x90
[20705.653590] [c000200e582fbe70] [c000000000173530] do_idle+0x2b0/0x330
[20705.653631] [c000200e582fbec0] [c0000000001737ec] cpu_startup_entry+0x3c/0x50
[20705.653679] [c000200e582fbef0] [c00000000004a050] start_secondary+0x4f0/0x510
[20705.653727] [c000200e582fbf90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
[   ***] (2 of 2) A start job is running for? polling (5h 45min 33s / no limit)

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-21: sol console log

#39

sol console log Edit (30.7 KiB, text/plain)

------- Comment (attachment only) From <email address hidden> 2018-04-21 17:04 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-21: Comment bridged from LTC Bugzilla

#40

------- Comment From <email address hidden> 2018-04-21 18:55 EDT-------
(In reply to comment #80)
> (In reply to comment #79)
> > Machine still seems to be up... will check if I can observe anything
> > interesting ...
>
> System just crashes it now. The vmcore is at /var/crash/201804181042

Can we retry this test on the P8 system using Brian's kernel in comment #94?

Also, please post access information for this system.

Andrew Cloke (andrew-cloke) on 2018-04-23

Changed in ubuntu-power-systems:
status:	Triaged → Incomplete
Changed in linux (Ubuntu Bionic):
status:	Triaged → Incomplete

bugproxy (bugproxy) on 2018-05-02

tags:

removed: bugnameltc-166588 kernel-key severity-critical triage-g

bugproxy (bugproxy) on 2018-05-03

tags:

added: bugnameltc-166588 severity-critical

189 comments hidden

view all 269 comments

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: qla2xxx version 10.00.00.04-k

#230

------- Comment (attachment only) From <email address hidden> 2018-04-26 10:51 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: op-buid instructions for patched skiroot build

#231

------- Comment on attachment From <email address hidden> 2018-04-26 14:00 EDT-------

Attaching the instructions to build skiroot/skiboot/Petitboot/all with op-build
(in this case, a patched skiroot's kernel -- zImage.epapr), per Dwip's request.

Hopefully this might help others in the future.

(In reply to comment #203)
> The skiroot kernel build is available at:
>
> http://dorno.rch.stglabs.ibm.com/~mauricfo/kernel/skiroot/bz166588/zImage.
> epapr_4.15.14-openpower1.bz166588c132

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: dmesg log thus far, for May 1 run.

#232

------- Comment (attachment only) From <email address hidden> 2018-05-02 14:28 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: boslcp3 host console tee logs

#233

------- Comment on attachment From <email address hidden> 2018-05-02 23:22 EDT-------

Attached boslcp3 host console tee logs

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: dmesg log_boslcp3_latest for may1 run

#234

------- Comment on attachment From <email address hidden> 2018-05-02 23:49 EDT-------

Attached latest dmesg log for boslcp3 - may1st run

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: /var/log/syslog boslcp3 host

#235

------- Comment on attachment From <email address hidden> 2018-05-02 23:51 EDT-------

Attached /var/log/syslog file from boslcp3 host

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: /var/log/syslog1.file boslcp3

#236

------- Comment on attachment From <email address hidden> 2018-05-02 23:53 EDT-------

Attached /var/log/syslog.1 file from boslcp3 host

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: boslcp3_host_reboot_consolelogs

#237

------- Comment on attachment From <email address hidden> 2018-05-04 11:12 EDT-------

Attached host console logs for reboot issue after fresh installation

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

#238

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch Edit (831.6 KiB, text/plain)

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: Comment bridged from LTC Bugzilla

#239

------- Comment From <email address hidden> 2018-05-05 13:23 EDT-------
The boslcp6 logs look characteristic of the qla2xxx issue (panic in process_one_work()). Don't have detailed qla2xxx logging so can't determine SAN disposition.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: /var/log/syslog1.file boslcp3

#240

------- Comment on attachment From <email address hidden> 2018-05-02 23:53 EDT-------

Attached /var/log/syslog.1 file from boslcp3 host

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: boslcp3_host_reboot_consolelogs

#241

------- Comment on attachment From <email address hidden> 2018-05-04 11:12 EDT-------

Attached host console logs for reboot issue after fresh installation

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

#242

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-05: fuller version of previous boslcp6 log

#243

fuller version of previous boslcp6 log Edit (1.6 MiB, text/plain)

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-06: Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

#244

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-06: fuller version of previous boslcp6 log

#245

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

#246

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: fuller version of previous boslcp6 log

#247

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: Comment bridged from LTC Bugzilla

#248

------- Comment From <email address hidden> 2018-05-07 12:10 EDT-------
Of the "boslcp" systems, only 3 appear to have QLogic adapters. Of those, one has been running without the extended error logging and so collected no data, and one has been down (or non-functional) for about 36 hours. Of the data collected, though, there is no evidence of any SAN instability since Friday - before starting the patched kernels. This means that we have no new data on whether the patches fix the problem.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: dmesg log thus far, for May 1 run.

#249

------- Comment (attachment only) From <email address hidden> 2018-05-02 14:28 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: boslcp3 host console tee logs

#250

------- Comment on attachment From <email address hidden> 2018-05-02 23:22 EDT-------

Attached boslcp3 host console tee logs

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: dmesg log_boslcp3_latest for may1 run

#251

------- Comment on attachment From <email address hidden> 2018-05-02 23:49 EDT-------

Attached latest dmesg log for boslcp3 - may1st run

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: /var/log/syslog boslcp3 host

#252

------- Comment on attachment From <email address hidden> 2018-05-02 23:51 EDT-------

Attached /var/log/syslog file from boslcp3 host

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: /var/log/syslog1.file boslcp3

#253

------- Comment on attachment From <email address hidden> 2018-05-02 23:53 EDT-------

Attached /var/log/syslog.1 file from boslcp3 host

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: boslcp3_host_reboot_consolelogs

#254

------- Comment on attachment From <email address hidden> 2018-05-04 11:12 EDT-------

Attached host console logs for reboot issue after fresh installation

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

#255

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: fuller version of previous boslcp6 log

#256

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

#257

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-07: fuller version of previous boslcp6 log

#258

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-08: Comment bridged from LTC Bugzilla

#259

------- Comment From <email address hidden> 2018-05-08 12:09 EDT-------
It appears that there were some SAN incidents yesterday on boslcp3, approx. times were May 7 12:44:54 through 14:28:17. All were for one port, so not exactly the situation I think caused the panic. If we could correlate these SAN incidents with other activity on neighboring systems, that might help.

[207374.827928] = first incident
[213578.181860] = last incident
[287293.677076] Tue May 8 10:56:52 CDT 2018

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-09:

#260

------- Comment From <email address hidden> 2018-05-09 11:34 EDT-------
There was a period of SAN instability observed on boslcp1 this morning, at about May 9 05:01:28 to 05:51:56. This involved 2 ports simultaneously handling relogins. This was a Pegas kernel that should be susceptible to the panic, but no panic was seen. But since we don't know enough about the exact timing required to produce the panic, we can't say just what that means.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-10:

#261

------- Comment From <email address hidden> 2018-05-10 12:59 EDT-------
I have had some luck reproducing this, on ltc-boston113 (previously unable to reproduce there). I had altered the boot parameters to remove "quiet splash" and added "qla2xxx.logging=0x1e400000", and got the kworker panic during boot (did not even reach login prompt). I also hit this panic while booting the Pegas 1.1 installer, so it looks like Pegas is also affected. I am completing the Pegas install with qla2xxx blacklisted, and will characterize some more.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-10:

#262

------- Comment From <email address hidden> 2018-05-10 14:13 EDT-------
Being able to reproduce this on ltc-boston113 seems to have been a temporary condition. I can no longer reproduce there, Pegas or Ubuntu. Without some idea of what external conditions are causing this, it will be very difficult to pursue.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-11:

#263

------- Comment From <email address hidden> 2018-05-11 12:12 EDT-------
Some information coming in on the SAN where this reproduces. It appears that there is some undesirable configuration, where fast switches are backed by slower switches between host and disks. The current theory is that other activity on the fabric causes bottle-necks in the slow switches and results in the temporary loss of login. Working on a way to reproduce this on-demand.

But, if this is true, I think this probably is not likely to be hit by customers. Seems like customers would not be mixing slow switches with fast, especially in such a dysfunctional setup.

Still investigating, though, so nothing conclusive yet.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-11:

#264

Download full text (11.4 KiB)

------- Comment From <email address hidden> 2018-05-02 14:39 EDT-------
The SAN incident in the previous dmesg log shows only a single port (WWPN) glitching. The logs from panics showed two ports glitching at the same time. Also, this incident did not show the port logging back in for about 8 minutes, whereas the panics showed immediate/concurrent login. So, I'm not certain if we've proven the fix yet.

------- Comment From <email address hidden> 2018-05-02 16:32 EDT-------
I think next steps here are:

1) apply all the known firmware workarounds (GH 1158)
2) Bring up system with Doug's recommendations for log verbosity (comment 211 and 215). Also capture the console output to a separate file if possible.
3) re-start the test using this same kernel, but with no stress on the host: proceed to restart the 3 guests with stress, and have a 4th guest migrating between boslcp3 and 4.

------- Comment From <email address hidden> 2018-05-02 16:36 EDT-------
(In reply to comment #218)
> I think next steps here are:
>
> 1) apply all the known firmware workarounds (GH 1158)
> 2) Bring up system with Doug's recommendations for log verbosity (comment
> 211 and 215). Also capture the console output to a separate file if possible.
> 3) re-start the test using this same kernel, but with no stress on the host:
> proceed to restart the 3 guests with stress, and have a 4th guest migrating
> between boslcp3 and 4.

Klaus, let's hold off on making more changes right now. I'd like to let things run as-is a little longer.

------- Comment From <email address hidden> 2018-05-02 23:21 EDT-------
Attached host boslcp3 host console tee logs.
Default Comment by Bridge

------- Comment From <email address hidden> 2018-05-03 03:22 EDT-------
boslcp3 host console dumps messages related to qlogic driver.

Latest tee logs for boslcp3 host :

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3-host-may1.txt

[ipjoga@kte (AUS) ~]$ ls -l /LOGS/boslcp3-host-may1.txt
-rwxrwxr-x 1 ipjoga ipjoga 20811302 May 3 02:12 /LOGS/boslcp3-host-may1.txt

Regards,
Indira

------- Comment From <email address hidden> 2018-05-03 08:20 EDT-------
There were a large number of SAN incidents in the evening, although none involved two ports at the same time. Still, many involved relogin while the logout was still being processed - so there is some confidence that the patches may be working.

There was a large period of SAN instability between May 2 21:42:09 and 21:58:47. This involved only one port (21:00:00:24:ff:7e:f6:fe). It would be interesting if this could be traced back to some activity, either on this machine or on the SAN (e.g. was migration being tested on other machines at this point?).

We still have not seen the same situation that was associated with the panics (two or more ports experiencing instability at the same time), so it's not clear if we can conclude that the patches fix the original problem.If we could find some trigger for the instability, we might be able to orchestrate the situation originally seen.

------- Comment From <email address hidden> 2018-05-04 11:10 EDT-------
We could not able to install 'sar' package due to 166588 prior patch. And also 'xfs...

------- Comment From dougmill@us.ibm.com 2018-05-02 14:39 EDT-------
The SAN incident in the previous dmesg log shows only a single port (WWPN) glitching. The logs from panics showed two ports glitching at the same time. Also, this incident did not show the port logging back in for about 8 minutes, whereas the panics showed immediate/concurrent login. So, I'm not certain if we've proven the fix yet.

------- Comment From klausk@br.ibm.com 2018-05-02 16:32 EDT-------
I think next steps here are:

1) apply all the known firmware workarounds (GH 1158)
2) Bring up system with Doug's recommendations  for log verbosity (comment 211 and 215). Also capture the console output to a separate file if possible.
3) re-start the test using this same kernel, but with no stress on the host: proceed to restart the 3 guests with stress, and have a 4th guest migrating between boslcp3 and 4.

------- Comment From dougmill@us.ibm.com 2018-05-02 16:36 EDT-------
(In reply to comment #218)
> I think next steps here are:
>
> 1) apply all the known firmware workarounds (GH 1158)
> 2) Bring up system with Doug's recommendations  for log verbosity (comment
> 211 and 215). Also capture the console output to a separate file if possible.
> 3) re-start the test using this same kernel, but with no stress on the host:
> proceed to restart the 3 guests with stress, and have a 4th guest migrating
> between boslcp3 and 4.

Klaus, let's hold off on making more changes right now. I'd like to let things run as-is a little longer.

------- Comment From indira.priya@in.ibm.com 2018-05-02 23:21 EDT-------
Attached host boslcp3 host console tee logs.
Default Comment by Bridge

------- Comment From indira.priya@in.ibm.com 2018-05-03 03:22 EDT-------
boslcp3 host console dumps messages related to qlogic driver.

Latest tee logs for boslcp3 host :

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3-host-may1.txt

[ipjoga@kte (AUS) ~]$ ls -l /LOGS/boslcp3-host-may1.txt
-rwxrwxr-x 1 ipjoga ipjoga 20811302 May  3 02:12 /LOGS/boslcp3-host-may1.txt

Regards,
Indira

------- Comment From dougmill@us.ibm.com 2018-05-03 08:20 EDT-------
There were a large number of SAN incidents in the evening, although none involved two ports at the same time. Still, many involved relogin while the logout was still being processed - so there is some confidence that the patches may be working.

There was a large period of SAN instability between May  2 21:42:09 and 21:58:47. This involved only one port (21:00:00:24:ff:7e:f6:fe). It would be interesting if this could be traced back to some activity, either on this machine or on the SAN (e.g. was migration being tested on other machines at this point?).

------- Comment From indira.priya@in.ibm.com 2018-05-04 11:10 EDT-------
We could not able to install 'sar' package due to 166588 prior patch. And also 'xfs'  was being used on the system from the prior run.  To overcome both, we  planned fresh installation . Installed latest ubutnu1804 kernel(4.15.0-20) on LSI disk and booted up with disk. Login prompt appears & gave credentials. Immediately in less than a minute, system dump messages and started rebooting. Its not allowing time to run anything on the console prompt.

Tried  multiple attempts to boot with the latest kernel & once logged in system is rebooting with call traces as below.

Ubuntu 18.04 LTS boslcp3 hvc0

boslcp3 login: [   51.679446] sd 3:0:1:0: rejecting I/O to offline device
[   58.251326] Unable to handle kernel paging request for data at address 0xbf52a78fa0cf2419
[   58.251413] Faulting instruction address: 0xc00000000038ae70
[   58.251462] Oops: Kernel access of bad area, sig: 11 [#1]
[   58.251500] LE SMP NR_CPUS=2048 NUMA PowerNV
[   58.251543] Modules linked in: rpcsec_gss_krb5(E) nfsv4(E) nfs(E) fscache(E) binfmt_misc(E) dm_service_time(E) dm_multipath(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) joydev(E) input_leds(E) mac_hid(E) idt_89hpesx(E) at24(E) uio_pdrv_genirq(E) uio(E) vmx_crypto(E) ofpart(E) crct10dif_vpmsum(E) cmdlinepart(E) powernv_flash(E) mtd(E) opal_prd(E) ipmi_powernv(E) ibmpowernv(E) ipmi_devintf(E) ipmi_msghandler(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) sch_fq_codel(E) lockd(E) grace(E) sunrpc(E) ip_tables(E) x_tables(E) autofs4(E) ses(E) enclosure(E) hid_generic(E) usbhid(E) hid(E) qla2xxx(E) ast(E) i2c_algo_bit(E) ttm(E) mpt3sas(E) ixgbe(E) drm_kms_helper(E) nvme_fc(E) syscopyarea(E) sysfillrect(E) nvme_fabrics(E) sysimgblt(E) fb_sys_fops(E) nvme_core(E) raid_class(E) crc32c_vpmsum(E) drm(E) i40e(E)
[   58.252067]  scsi_transport_sas(E) aacraid(E) scsi_transport_fc(E) mdio(E)
[   58.252120] CPU: 80 PID: 1740 Comm: ureadahead Tainted: G            E    4.15.0-20-generic #21+bug166588
[   58.252186] NIP:  c00000000038ae70 LR: c00000000038ae5c CTR: c000000000621860
[   58.252245] REGS: c000000fd98b76c0 TRAP: 0380   Tainted: G            E     (4.15.0-20-generic)
[   58.252309] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002844  XER: 00000000
[   58.252373] CFAR: c000000000016e1c SOFTE: 1
[   58.252373] GPR00: c00000000038ad34 c000000fd98b7940 c0000000016eae00 0000000000000001
[   58.252373] GPR04: 007f2daa2bd342ac 00000000000005ea 0000000000000001 00000000000005e9
[   58.252373] GPR08: 7f2daa2bd34242b4 0000000000000000 0000000000000000 0000000000000000
[   58.252373] GPR12: 0000000000002000 c00000000fab7000 c000000fd9d9f848 c000000fd9d9fab8
[   58.252373] GPR16: c000000fd98b7c90 000000000000002a 000000000001fe80 0000000000000000
[   58.252373] GPR20: 0000000000000000 0000000000000000 000000000000002a 7f528781f8910018
[   58.252373] GPR24: c000200e585e2401 bf52a78fa0cf2419 c000000000b2142c c000000ff901ee00
[   58.252373] GPR28: ffffffffffffffff 00000000015004c0 c000200e585e2401 c000000ff901ee00
[   58.252879] NIP [c00000000038ae70] kmem_cache_alloc_node+0x2f0/0x350
[   58.252927] LR [c00000000038ae5c] kmem_cache_alloc_node+0x2dc/0x350
[   58.252974] Call Trace:
[   58.252996] [c000000fd98b7940] [c00000000038ad34] kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[   58.253066] [c000000fd98b79b0] [c000000000b2142c] __alloc_skb+0x6c/0x220
[   58.253116] [c000000fd98b7a10] [c000000000b2332c] alloc_skb_with_frags+0x7c/0x2e0
[   58.253174] [c000000fd98b7aa0] [c000000000b16f8c] sock_alloc_send_pskb+0x29c/0x2c0
[   58.253233] [c000000fd98b7b50] [c000000000c492c4] unix_stream_sendmsg+0x264/0x5c0
[   58.253292] [c000000fd98b7c30] [c000000000b11424] sock_sendmsg+0x64/0x90
[   58.253342] [c000000fd98b7c60] [c000000000b11508] sock_write_iter+0xb8/0x120
[   58.253401] [c000000fd98b7d00] [c0000000003d0434] new_sync_write+0x104/0x160
[   58.253459] [c000000fd98b7d90] [c0000000003d3b78] vfs_write+0xd8/0x220
[   58.253509] [c000000fd98b7de0] [c0000000003d3e98] SyS_write+0x68/0x110
[   58.253560] [c000000fd98b7e30] [c00000000000b184] system_call+0x58/0x6c
[   58.253607] Instruction dump:
[   58.253637] 7c97ba78 fb210038 38a50001 7f19ba78 fb290000 f8aa0000 4bc8bfb1 60000000
[   58.253698] 7fb8b840 419e0028 e93f0022 e91f0140 <7d59482a> 7d394a14 7d4a4278 7fa95040
[   58.253760] ---[ end trace 21f1ccbedad3db06 ]---
[   58.360858] device-mapper: multipath: Reinstating path 65:240.
[   58.362107] sd 3:0:1:0: Power-on or device reset occurred
[   58.369695] sd 2:0:1:0: Power-on or device reset occurred
[   58.371943] sd 3:0:1:0: alua: port group 00 state A non-preferred supports tolusna
[   58.376534] sd 3:0:0:0: Power-on or device reset occurred
[   58.381190] sd 2:0:0:0: Power-on or device reset occurred
[   58.391738] sd 3:0:0:0: alua: port group 01 state N non-preferred supports tolusna
[   59.265054]

Attached boslcp3 host console logs
Please let us know if this is a different issue to be tracked via separate bug.

Regards,
Indira

------- Comment From cdeadmin@us.ibm.com 2018-05-05 10:31 EDT-------
Yesterday, the decision was made at Padma's daily KVM meeting to only track System Firmware Mustfix issues using the LC GA1 Mustfix label since that is all that applies to the Supermicro team. The OS Kernel/KVM issues will be managed with a spreadsheet tracked by the KVM team and also in the internal slack channel. Removing the Mustfix label.

------- Comment From dougmill@us.ibm.com 2018-05-05 13:23 EDT-------
The boslcp6 logs look characteristic of the qla2xxx issue (panic in process_one_work()). Don't have detailed qla2xxx logging so can't determine SAN disposition.

------- Comment From dougmill@us.ibm.com 2018-05-07 12:10 EDT-------
Of the "boslcp" systems, only 3 appear to have QLogic adapters. Of those, one has been running without the extended error logging and so collected no data, and one has been down (or non-functional) for about 36 hours. Of the data collected, though, there is no evidence of any SAN instability since Friday - before starting the patched kernels. This means that we have no new data on whether the patches fix the problem.

------- Comment From dougmill@us.ibm.com 2018-05-08 12:09 EDT-------
It appears that there were some SAN incidents yesterday on boslcp3, approx. times were May  7 12:44:54 through 14:28:17. All were for one port, so not exactly the situation I think caused the panic. If we could correlate these SAN incidents with other activity on neighboring systems, that might help.

[207374.827928] = first incident
[213578.181860] = last incident
[287293.677076] Tue May  8 10:56:52 CDT 2018

------- Comment From dougmill@us.ibm.com 2018-05-09 11:34 EDT-------
There was a period of SAN instability observed on boslcp1 this morning, at about May  9 05:01:28 to 05:51:56. This involved 2 ports simultaneously handling relogins. This was a Pegas kernel that should be susceptible to the panic, but no panic was seen. But since we don't know enough about the exact timing required to produce the panic, we can't say just what that means.

------- Comment From dougmill@us.ibm.com 2018-05-10 12:59 EDT-------
I have had some luck reproducing this, on ltc-boston113 (previously unable to reproduce there). I had altered the boot parameters to remove "quiet splash" and added "qla2xxx.logging=0x1e400000", and got the kworker panic during boot (did not even reach login prompt). I also hit this panic while booting the Pegas 1.1 installer, so it looks like Pegas is also affected. I am completing the Pegas install with qla2xxx blacklisted, and will characterize some more.

------- Comment From dougmill@us.ibm.com 2018-05-10 14:13 EDT-------
Being able to reproduce this on ltc-boston113 seems to have been a temporary condition. I can no longer reproduce there, Pegas or Ubuntu. Without some idea of what external conditions are causing this, it will be very difficult to pursue.

------- Comment From dougmill@us.ibm.com 2018-05-11 12:12 EDT-------
Some information coming in on the SAN where this reproduces. It appears that there is some undesirable configuration, where fast switches are backed by slower switches between host and disks. The current theory is that other activity on the fabric causes bottle-necks in the slow switches and results in the temporary loss of login. Working on a way to reproduce this on-demand.

But, if this is true, I think this probably is not likely to be hit by customers. Seems like customers would not be mixing slow switches with fast, especially in such a dysfunctional setup.

Still investigating, though, so nothing conclusive yet.

bugproxy (bugproxy) on 2018-05-15

tags:

added: severity-high
removed: severity-critical

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-21:

#265

------- Comment From <email address hidden> 2018-05-21 13:20 EDT-------
*** Bug 168018 has been marked as a duplicate of this bug. ***

Andrew Cloke (andrew-cloke) on 2018-05-21

tags:

added: p9

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-24:

#266

------- Comment From <email address hidden> 2018-05-24 14:37 EDT-------
In bug #167562, Canonical reports that these fixes have been put in bionic-proposed (assumed to mean linux-image-4.15.0-23-generic). We need to test this ASAP in order to prevent the patches from being reverted. Can we get the latest -proposed Ubuntu Bionic installed and checked out on the systems where we saw this issue?

This is urgent. Starting by setting NEEDINFO for Chanh, although someone else may need to pick that up.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-05-24:

#267

------- Comment From <email address hidden> 2018-05-24 18:16 EDT-------
(In reply to comment #259)
> In bug #167562, Canonical reports that these fixes have been put in
> bionic-proposed (assumed to mean linux-image-4.15.0-23-generic). We need to
> test this ASAP in order to prevent the patches from being reverted. Can we
> get the latest -proposed Ubuntu Bionic installed and checked out on the
> systems where we saw this issue?
>
> This is urgent. Starting by setting NEEDINFO for Chanh, although someone
> else may need to pick that up.

I installed on boslcp3 and it works. Don't see the crash like we use to see.
root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 17:59:00 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# lspci |grep QLogic
0030:01:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
0030:01:00.1 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
root@boslcp3:~#

Manoj Iyer (manjo) on 2018-06-11

tags:

added: triage-a

Frank Heimes (fheimes) on 2018-06-11

tags:

added: triage-g
removed: triage-a

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-06-11:

#268

Looks like the bionic-proposed kernel works for IBM, and so marking this fix-committed.

Changed in linux (Ubuntu Bionic):
status:	Incomplete → Fix Committed
Changed in linux (Ubuntu):
status:	Incomplete → Fix Committed
Changed in ubuntu-power-systems:
status:	Incomplete → Fix Committed

Frank Heimes (fheimes) on 2018-07-02

Changed in linux (Ubuntu Bionic):
status:	Fix Committed → Fix Released
Changed in linux (Ubuntu):
status:	Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status:	Fix Committed → Fix Released

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2018-07-02:

#269

The bionic-proposed kernel referred to in comment #268 has now been released. Marking as "Fix Released".

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Displaying first 40 and last 40 comments. View all 269 comments or add a comment.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into xmon after moving to 4.15.0-15.16 kernel

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package