ISST-LTE:KVM:Ubuntu1804:BostonLC:boslcp3: Host crashed & enters into xmon after moving to 4.15.0-15.16 kernel

Bug #1762844 reported by bugproxy on 2018-04-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Critical
Canonical Kernel Team
linux (Ubuntu)
Critical
Canonical Kernel Team
Bionic
Critical
Canonical Kernel Team

Bug Description

Problem Description:
===================
Host crashed & enters into xmon after updating to 4.15.0-15.16 kernel kernel.

Steps to re-create:
==================

1. boslcp3 is up with BMC:118 & PNOR: 20180330 levels
2. Installed boslcp3 with latest kernel
    4.15.0-13-generic
3. Enabled "-proposed" kernel in /etc/apt/sources.list file
4. Ran sudo apt-get update & apt-get upgrade

5. root@boslcp3:~# ls /boot
abi-4.15.0-13-generic retpoline-4.15.0-13-generic
abi-4.15.0-15-generic retpoline-4.15.0-15-generic
config-4.15.0-13-generic System.map-4.15.0-13-generic
config-4.15.0-15-generic System.map-4.15.0-15-generic
grub vmlinux
initrd.img vmlinux-4.15.0-13-generic
initrd.img-4.15.0-13-generic vmlinux-4.15.0-15-generic
initrd.img-4.15.0-15-generic vmlinux.old
initrd.img.old

6. Rebooted & booted with 4.15.0-15 kernel
7. Enabled xmon by editing file "vi /etc/default/grub" and ran update-grub
8. Rebooted host.
9. Booted with 4.15.0-15 & provided root/password credentials in login prompt

10. Host crashed & enters into XMON state with 'Unable to handle kernel paging request'

root@boslcp3:~# [ 66.295233] Unable to handle kernel paging request for data at address 0x8882f6ed90e9151a
[ 66.295297] Faulting instruction address: 0xc00000000038a110
cpu 0x50: Vector: 380 (Data Access Out of Range) at [c00000000692f650]
    pc: c00000000038a110: kmem_cache_alloc_node+0x2f0/0x350
    lr: c00000000038a0fc: kmem_cache_alloc_node+0x2dc/0x350
    sp: c00000000692f8d0
   msr: 9000000000009033
   dar: 8882f6ed90e9151a
  current = 0xc00000000698fd00
  paca = 0xc00000000fab7000 softe: 0 irq_happened: 0x01
    pid = 1762, comm = systemd-journal
Linux version 4.15.0-15-generic (buildd@bos02-ppc64el-002) (gcc version 7.3.0 (Ubuntu 7.3.0-14ubuntu1)) #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 (Ubuntu 4.15.0-15.16-generic 4.15.15)
enter ? for help
[c00000000692f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[c00000000692f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
[c00000000692f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
[c00000000692fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
[c00000000692fae0] c000000000c5705c unix_dgram_sendmsg+0x15c/0x8f0
[c00000000692fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
[c00000000692fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
[c00000000692fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
[c00000000692fe30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c00 (System Call) at 000074826f6fa9c4
SP (7ffff5dc5510) is in userspace
50:mon>
50:mon>

10. Attached Host console logs

I rebooted the host just to see if it would hit the issue again and this time I didn't even get to the login prompt but it crashed in the same location:

50:mon> r
R00 = c000000000389fd4 R16 = c000200e0b20fdc0
R01 = c000200e0b20f8d0 R17 = 0000000000000048
R02 = c0000000016eb400 R18 = 000000000001fe80
R03 = 0000000000000001 R19 = 0000000000000000
R04 = 0048ca1cff37803d R20 = 0000000000000000
R05 = 0000000000000688 R21 = 0000000000000000
R06 = 0000000000000001 R22 = 0000000000000048
R07 = 0000000000000687 R23 = 4882d6e3c8b7ab55
R08 = 48ca1cff37802b68 R24 = c000200e5851df01
R09 = 0000000000000000 R25 = 8882f6ed90e67454
R10 = 0000000000000000 R26 = c000000000b2ec6c
R11 = c000000000d10f78 R27 = c000000ff901ee00
R12 = 0000000000002000 R28 = ffffffffffffffff
R13 = c00000000fab7000 R29 = 00000000015004c0
R14 = c000200e4c973fc8 R30 = c000200e5851df01
R15 = c000200e4c974238 R31 = c000000ff901ee00
pc = c00000000038a110 kmem_cache_alloc_node+0x2f0/0x350
cfar= c000000000016e1c arch_local_irq_restore+0x1c/0x90
lr = c00000000038a0fc kmem_cache_alloc_node+0x2dc/0x350
msr = 9000000000009033 cr = 28002844
ctr = c00000000061e1b0 xer = 0000000000000000 trap = 380
dar = 8882f6ed90e67454 dsisr = c000200e40bd8400
50:mon> t
[c000200e0b20f8d0] c000000000389fd4 kmem_cache_alloc_node+0x1b4/0x350 (unreliable)
[c000200e0b20f940] c000000000b2ec6c __alloc_skb+0x6c/0x220
[c000200e0b20f9a0] c000000000b30b6c alloc_skb_with_frags+0x7c/0x2e0
[c000200e0b20fa30] c000000000b247cc sock_alloc_send_pskb+0x29c/0x2c0
[c000200e0b20fae0] c000000000c56ae4 unix_stream_sendmsg+0x264/0x5c0
[c000200e0b20fbc0] c000000000b1ec64 sock_sendmsg+0x64/0x90
[c000200e0b20fbf0] c000000000b20abc ___sys_sendmsg+0x31c/0x390
[c000200e0b20fd90] c000000000b221ec __sys_sendmsg+0x5c/0xc0
[c000200e0b20fe30] c00000000000b184 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 00007d16a993a940
SP (7ffffbee2270) is in userspace

Mirroring to Canonical to advise them that this might be possible regression. Didn't see any obvious changes in this area in the changelog published at https://launchpad.net/ubuntu/+source/linux/4.15.0-15.16 but it would be good to have Canonical help reviewing the deltas as we try to isolate this further.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-166588 severity-critical targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)

------- Comment From <email address hidden> 2018-04-10 21:32 EDT-------
According to test they have another bostonLC (boslcp4) and they did update to this kernel and system is booting up normally.
root@boslcp4:~# uname -a
Linux boslcp4 4.15.0-15-generic #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp4:~# date
Tue Apr 10 16:37:37 CDT 2018
root@boslcp4:~# uptime
16:37:38 up 40 min, 2 users, load average: 0.00, 0.03, 0.12

Additionally, I rebooted the system a third time to add the slub_debug=FZ kernel option and system booted to the login and I logged in successfully. I did it a fourth time and it succeeded again.

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-15-generic #16-Ubuntu SMP Wed Apr 4 13:57:51 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# cat /proc/cmdline
root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro xmon=on splash quiet slub_debug=FZ

Strange.

Frank Heimes (frank-heimes) wrote :

Can you test again on a third system?
Can this be a hw problem on the first system?

Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: triage-g
Changed in ubuntu-power-systems:
status: Triaged → Incomplete
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-11 03:14 EDT-------
(In reply to comment #16)
> Can you test again on a third system?
> Can this be a hw problem on the first system?

No. This cannot he an hardware issue, since we are running fine on the same system from last 4 months with multiple kernel updates.

And the system is back up again automatically on 3rd & 4th reboot. So the underlying problem still reside

Changed in ubuntu-power-systems:
status: Incomplete → Triaged
Changed in linux (Ubuntu):
importance: Undecided → Critical
status: New → Triaged
tags: added: kernel-key
Joseph Salisbury (jsalisbury) wrote :

Can you see if the bug happens with and of these mainline kernels? We can perform a kernel bisect if we can narrow down to the last good kernel version and first bad one:

v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/
v4.15-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc4/
v4.15 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-12 01:24 EDT-------
(In reply to comment #18)
> Can you see if the bug happens with and of these mainline kernels? We can
> perform a kernel bisect if we can narrow down to the last good kernel
> version and first bad one:
>
> v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
> v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/
> v4.15-rc4: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc4/
> v4.15 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/
>
> You don't have to test every kernel, just up until the kernel that first has
> this bug.
>
> Thanks in advance!

We need to make progress in testing other firmware and guest issues. We will come back to this later.

Meanwhile, the problem happened again today with the reboot and we tried to collect the vmcore using 'X', but it did not collect. Indira, pls add those details.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-12 11:10 EDT-------
Hi,

Today i have tried rebooting boslcp3 system and crash issue recreated.

For first attempt, after rebooting host it booted with latest kernel & i have attempted disable stop4, 5 commands then it immediately crashed & enters into xmon with similar stack trace as reported in the bug(recreation steps). Tried to take dump from xmon prompt using 'X' option , it did not worked & it came back to shell prompt.

For second attempt of reboot, host booted with latest kernel. Issued kdump-config status command & then host crashed with same stack trace as reported in recreation steps. Again tried to take dump from xmon prompt using 'X' which did not worked . it came back to shell prompt.

Attached host console logs for both attempts of reboots clearly.

Regards,
Indira

------- Comment on attachment From <email address hidden> 2018-04-12 11:12 EDT-------

Attached boslcp3 host console logs during 1st attempt of host reboot

------- Comment on attachment From <email address hidden> 2018-04-12 11:13 EDT-------

Attached boslcp3 host console logs during 2nd attempt of reboot

------- Comment on attachment From <email address hidden> 2018-04-12 14:25 EDT-------

Attached boslcp3 host crash console logs

------- Comment (attachment only) From <email address hidden> 2018-04-13 03:24 EDT-------

------- Comment From <email address hidden> 2018-04-13 14:24 EDT-------
Dwip - excellent suggestion, I agree with your suggestion on next steps. If this is a double free we need to catch that earlier than where we are crashing.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-13 14:51 EDT-------
I believe that the "1" in c000200e5848b701 is a flag. The address actually used will be c000200e5848b700. The flags PAGE_MAPPING_ANON and/or PAGE_MAPPING_MOVABLE are added to page addresses, and are stripped of before dereferencing. If that R30 value is something like "anon_mapping = (unsigned long)READ_ONCE(page->mapping)" then it will contain those flags. Not sure if that applies to your situation or not.

------- Comment on attachment From <email address hidden> 2018-04-13 17:17 EDT-------

The full log of the Host during couple reboot, start test on guest then system drop into xmon.

------- Comment From <email address hidden> 2018-04-15 16:36 EDT-------
Please collect the dmesg log and a crashdump.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-16 01:24 EDT-------
(In reply to comment #54)
> Please collect the dmesg log and a crashdump.

Collected dl logs from xmon prompt & unable to take crashdump from xmon prompt ,we have bug#166660 opened.

Regards,
Indira

------- Comment on attachment From <email address hidden> 2018-04-16 01:31 EDT-------

Attached dl logs from xmon pormpt _boslcp3

Manoj Iyer (manjo) on 2018-04-16
Changed in linux (Ubuntu Bionic):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)

------- Comment From <email address hidden> 2018-04-16 17:21 EDT-------
We're waiting for a reproduce and a kdump. Also more logs, including firmware logs/eSELs/etc.

------- Comment on attachment From <email address hidden> 2018-04-16 01:31 EDT-------

Attached dl logs from xmon pormpt _boslcp3

------- Comment (attachment only) From <email address hidden> 2018-04-18 12:01 EDT-------

------- Comment From <email address hidden> 2018-04-18 12:27 EDT-------
Nick made some interesting comments about lockups in LTC bug 166684, comment #24 about the hard lockup watchdog being added in Kernel 4.13. Also other comments about RCU stall warnings being too aggressive, but at least in this last log RCU doesn't complain until after the first few traces/lockups...

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-19 03:51 EDT-------
Copied the dump to our kte server

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3/BZ166588/

h# ls -l /LOGS/boslcp3/BZ166588/
total 4
drwxr-xr-x 2 root root 4096 Apr 19 02:42 201804181042

Thanks.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-20 15:11 EDT-------
Below is a test kernel with the four QLogic commits that were added to the 4.15.0-15.16 kernel reverted, plus the patch from 166877. Please run this and update the bug if the crash is still seen.

https://ibm.ent.box.com/s/n29uregixfwrywyle4ursgovmrbjcxtd

------- Comment From <email address hidden> 2018-04-20 15:11 EDT-------
These are the four qlogic driver commits that were added between 4.15.0-13 and 4.15.0-15.16:

79c67fb6fa21774c67bba59619eaa908c18de759 scsi: qla2xxx: Fix crashes in qla2x00_probe_one on probe failure
21af711d6011c857f11717d20b57516c334d5dd0 scsi: qla2xxx: Fix logo flag for qlt_free_session_done()
60b5e40ad28c93a2752fff0988660fa28fe7905d scsi: qla2xxx: Fix NULL pointer access for fcport structure
e4caf5c1b7d847400f2cb6525e7cf83167863241 scsi: qla2xxx: Fix smatch warning in qla25xx_delete_{rsp|req}_que

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-20 15:39 EDT-------
Padma is reporting that the boslcp3 is available.

Dwip, I think Indira won't be available at this time of the day. Can you jump in and try to reproduce with the debug kernel in comment #94 above?

Thanks

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-20 16:33 EDT-------
Klaus, I am not aware of the particular tests being run.

But I pinged Chanh so that he can start a new round of tests.

However ... I do see that boslcp3 now has reverted to the prior kernel:
Linux boslcp3 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:15:55 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

I am not sure if there were other plans, but I let Chanh know about the
existence of the new patch which we would like to be tested. And he
kindly agreed to start the tests after installing the patch in #94.

p.s. the old kernel (according to Chanh) is due to one of the disks
getting corrupted.

------- Comment (attachment only) From <email address hidden> 2018-04-20 20:08 EDT-------

Download full text (3.5 KiB)

------- Comment From <email address hidden> 2018-04-21 01:53 EDT-------
Looks like an Oops similar to the previous one in comment#39 starting a sequence of events

root@boslcp3:~# [ 2837.030181] Unable to handle kernel paging request for data at address 0x00000008
[ 2837.030253] Faulting instruction address: 0xc0000000001336fc
[ 2837.030295] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2837.030328] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 2837.030364] Modules linked in: vhost_net vhost macvtap macvlan tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables devlink ip6table_filter ip6_tables iptable_filter rpcsec_gss_krb5 nfsv4 nfs fscache kvm_hv binfmt_misc kvm dm_service_time dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua joydev input_leds idt_89hpesx mac_hid vmx_crypto crct10dif_vpmsum at24 ofpart cmdlinepart uio_pdrv_genirq uio powernv_flash mtd ibmpowernv ipmi_powernv ipmi_devintf ipmi_msghandler opal_prd nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq ses enclosure hid_generic
[ 2837.030909] usbhid hid qla2xxx ast i2c_algo_bit ttm ixgbe drm_kms_helper mpt3sas nvme_fc syscopyarea sysfillrect nvme_fabrics sysimgblt fb_sys_fops nvme_core raid_class crc32c_vpmsum drm i40e scsi_transport_sas scsi_transport_fc mdio aacraid
[ 2837.031053] CPU: 145 PID: 1182 Comm: kworker/145:1 Not tainted 4.15.0-18-generic #19
[ 2837.031107] NIP: c0000000001336fc LR: c000000000133cf8 CTR: c000000000cfefa0
[ 2837.031156] REGS: c000200e44c77a10 TRAP: 0300 Not tainted (4.15.0-18-generic)
[ 2837.031204] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28000822 XER: 00000000
[ 2837.031257] CFAR: c000000000133cf4 DAR: 0000000000000008 DSISR: 40000000 SOFTE: 0
[ 2837.031257] GPR00: c000000000133cf8 c000200e44c77c90 c0000000016eae00 c000200e44bda5c0
[ 2837.031257] GPR04: c000000fdf6f7da0 c000200e618f7da0 c000200e618fa305 c000000fdf6f7cc8
[ 2837.031257] GPR08: c000200e6190c960 0000000000002440 0000000000000000 c00800000f04e0f8
[ 2837.031257] GPR12: 0000000000000000 c000000007a83b00 c00000000013c788 c000200e50ebf3c0
[ 2837.031257] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2837.031257] GPR20: c000200e618f7d80 0000000000000000 0000000000000000 fffffffffffffef7
[ 2837.031257] GPR24: 0000000000000402 0000000000000000 c000200e618f8100 c000000001713b00
[ 2837.031257] GPR28: c000200e618f7da0 0000000000000000 c000200e618f7d80 c000200e44bda5c0
[ 2837.031687] NIP [c0000000001336fc] process_one_work+0x3c/0x5a0
[ 2837.031727] LR [c000000000133cf8] worker_thread+0x98/0x630
[ 2837.031760] Call Trace:
[ 2837.031778] [c000200e44c77c90] [c000000000133974] process_one_work+0x2b4/0x5a0 (unreliable)
[ 2837.031828] [c000200e44c77d20] [c000000000133cf8] worker_thread+0x98/0x630
[ 2837.031885] [c000200e44c77dc0] [c00000000013c928] kthread+0x1a8/0x1b0
[ 2837.031928] [c000200e44c77e30] [c00000000000b528] ret_from_kernel_thread+0x5c/0xb4
[ 2837.031976] Instruction dump:
[ 2837.032001] 60000000 7d908026...

Read more...

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-21 02:01 EDT-------
Updated boslcp3 with latest PNOR:0420 & restarted tests on guests with kernel '4.15.0-18-generic'.

$ ./ipmis bmc-boslcp3 fru print 47
Product Name : OpenPOWER Firmware
Product Version : open-power-SUPERMICRO-P9DSU-V1.11-20180420-imp
Product Extra : op-build-4d27fab
Product Extra : skiboot-v5.11-70-g5307c0ec7899-pc34e21f
Product Extra : hostboot-742640c
Product Extra : linux-4.15.14-openpower1-p81c2d44
Product Extra : petitboot-v1.7.1-p8b80147
Product Extra : machine-xml-32ce616
Product Extra : occ-4f49f6

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# uname -r
4.15.0-18-generic

Guests kernel:
****************
root@boslcp3g3:~# uname -a
Linux boslcp3g3 4.15.0-15-generic #16+bug166877 SMP Wed Apr 18 14:47:30 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3g3:~# uname -r
4.15.0-15-generic

Regards,
Indira

------- Comment (attachment only) From <email address hidden> 2018-04-13 03:24 EDT-------

------- Comment on attachment From <email address hidden> 2018-04-13 17:17 EDT-------

The full log of the Host during couple reboot, start test on guest then system drop into xmon.

------- Comment on attachment From <email address hidden> 2018-04-16 01:31 EDT-------

Attached dl logs from xmon pormpt _boslcp3

------- Comment (attachment only) From <email address hidden> 2018-04-18 12:01 EDT-------

------- Comment (attachment only) From <email address hidden> 2018-04-20 20:08 EDT-------

------- Comment From <email address hidden> 2018-04-21 08:45 EDT-------
The latest logs show a panic in process_one_work() on CPU 145, some sort of NULL pointer fault, followed by 2 CPUs (22, 125) getting a "Bad interrupt in KVM entry/exit code, sig: 6" panic (possibly in response to the panic IPI). Those 2 CPUs timeout and the KDUMP kexec starts.

The KDUMP then gets the same process_one_work() panic, this time on CPU 1, followed by Hard LOCKUP detected on CPUs 0 and 1. rcu_sched then starts detecting the stalled CPU(s), only trying to dump CPU 1.

The problem seems to keep changing. Originally it was a panic on a very strange address in kmem_cache_alloc_node() from socket code. Later we see a NULL pointer issue in pool_mayday_timeout() from KVM. Now we are seeing a panic in process_one_work() from a kworker thread (unknown workqueue). If these different panics all have the same cause, it would seem to be something like memory corruption. Not being able to get a clean dump is going to be a problem.

bugproxy (bugproxy) wrote :
Download full text (3.8 KiB)

------- Comment From <email address hidden> 2018-04-20 16:48 EDT-------
Boslcp3 is back with the new kernel from #94.

root@boslcp3:~# cat /proc/cmdline
root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro splash quiet crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M@128M

I will launch our test soon.

------- Comment From <email address hidden> 2018-04-20 20:05 EDT-------
(In reply to comment #98)
> Boslcp3 is back with the new kernel from #94.
>
> root@boslcp3:~# uname -a
> Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le
> ppc64le ppc64le GNU/Linux
> root@boslcp3:~# cat /proc/cmdline
> root=UUID=bab108a0-d0a6-4609-87f1-6e33d0ad633c ro splash quiet
> crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:
> 4096M@128M
>
> I will launch our test soon.

It is not looking good on boslcp3. After I start test, within 3 hours run, system is still pingable but I cannot ssh to it. Looking at the console, I see these on all over....
************************************************************
[ 8785.370897] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 8785.370962] 1-...0: (4 GPs behind) idle=ca2/140000000000001/0 softirq=15273/15273 fqs=1075891
[ 8785.371035] (detected by 3, t=2179442 jiffies, g=2107, c=2106, q=386665)
[ 8785.371090] Task dump for CPU 1:
[ 8785.371123] kworker/1:3 R running task 0 4111 2 0x00000804
[ 8785.371195] Call Trace:
[ 8785.371221] [c0000000d5c4fa00] [c000000008133cf8] worker_thread+0x98/0x630 (unreliable)
[ 8848.390897] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 8848.390964] 1-...0: (4 GPs behind) idle=ca2/140000000000001/0 softirq=15273/15273 fqs=1083603
[ 8848.391037] (detected by 3, t=2195197 jiffies, g=2107, c=2106, q=389679)
[ 8848.391092] Task dump for CPU 1:
[ 8848.391125] kworker/1:3 R running task 0 4111 2 0x00000804
[ 8848.391197] Call Trace:
[ 8848.391223] [c0000000d5c4fa00] [c000000008133cf8] worker_thread+0x98/0x630 (unreliable)
[ 8857.031091] systemd[1]: systemd-journald.service: Start operation timed out. Terminating.
***********************************************************

root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le ppc64le ppc64le GNU/Linux

------- Comment From <email address hidden> 2018-04-21 08:08 EDT-------
The two guests are impacted due to (In reply to comment #103)
> Updated boslcp3 with latest PNOR:0420 & restarted tests on guests with
> kernel '4.15.0-18-generic'.
>
> $ ./ipmis bmc-boslcp3 fru print 47
> Product Name : OpenPOWER Firmware
> Product Version : open-power-SUPERMICRO-P9DSU-V1.11-20180420-imp
> Product Extra : op-build-4d27fab
> Product Extra : skiboot-v5.11-70-g5307c0ec7899-pc34e21f
> Product Extra : hostboot-742640c
> Product Extra : linux-4.15.14-openpower1-p81c2d44
> Product Extra : petitboot-v1.7.1-p8b80147
> Product Extra : machine-xml-32ce616
> Product Extra : occ-4f49f6
>
> root@boslcp3:~# uname -a
> Linux boslcp3 4.15.0-18-generic #19 SMP Fri Apr 20 12:45:38 CDT 2018 ppc64le
> ppc64le ppc64le GNU/Linux
> root@bos...

Read more...

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-21 09:21 EDT-------
Should we go back to the stock Ubuntu kernel in an attempt to identify if bug 167104 is a result of the custom kernel or the newest PNOR?

------- Comment From <email address hidden> 2018-04-21 13:17 EDT-------
(In reply to comment #106)
> Should we go back to the stock Ubuntu kernel in an attempt to identify if
> bug 167104 is a result of the custom kernel or the newest PNOR?

I am looking at all these plethora of bugs on the host and guest, seems like a constantly shifting problem. I don't think that this particular instance was because of the custom kernel (reverting some Qlogic patches). In my opinion memory corruption seems to be gaining currency.

Are systems without the Qlogic adapters seeing any of the problems reported here?

How do we debug with constantly moving pieces? Can we get a stable base to start with? By that I mean we go back to a kernel and pnor that worked in the past. Then using the same kernel advance the pnor and validate how it works. We might need to limit the testing to "cater" to the various bugs.

Once we have isolated the pnor, then we repeat with the kernels. Not sure how long these activities will take, but we might need to consider running a parallel exercise.

------- Comment From <email address hidden> 2018-04-21 14:17 EDT-------
boslcp3 seems to have gone off the rails again ... the console is spitting out a
lot of messages like:

[ ***] (2 of 2) A start job is running for?urnal Service

and it is not pingable....

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-04-21 14:26 EDT-------
I have ltc-boston1 setup with Ubuntu kernel 4.15.0-15, but there is no SAN connected to the QLE2742. I see no problem there right now. I have reserve the system for this bug until Monday evening.

bugproxy (bugproxy) wrote :
Download full text (3.2 KiB)

------- Comment From <email address hidden> 2018-04-21 17:02 EDT-------
Not sure what is going on. The SOL console print out all of these messages...
rcu_sched self-detected stall on CPU
[20705.652053] 95-....: (1 GPs behind) idle=c72/2/0 softirq=179/180 fqs=2586003
[20705.652101] (t=5172329 jiffies g=213 c=212 q=74736)
[20705.652140] Task dump for CPU 95:
[20705.652164] swapper/95 R running task 0 0 1 0x00000804
[20705.652213] Call Trace:
[20705.652231] [c000200fff3d3460] [c000000000149ed8] sched_show_task.part.16+0xd8/0x110 (unreliable)
[20705.652288] [c000200fff3d34d0] [c0000000001a9e9c] rcu_dump_cpu_stacks+0xd4/0x138
[20705.652336] [c000200fff3d3520] [c0000000001a8f68] rcu_check_callbacks+0x8e8/0xb40
[20705.652385] [c000200fff3d3650] [c0000000001b7208] update_process_times+0x48/0x90
[20705.652433] [c000200fff3d3680] [c0000000001cef54] tick_sched_handle.isra.5+0x34/0xd0
[20705.652482] [c000200fff3d36b0] [c0000000001cf050] tick_sched_timer+0x60/0xe0
[20705.652530] [c000200fff3d36f0] [c0000000001b7db4] __hrtimer_run_queues+0x144/0x370
[20705.652578] [c000200fff3d3770] [c0000000001b8d0c] hrtimer_interrupt+0xfc/0x350
[20705.652627] [c000200fff3d3840] [c0000000000248f0] __timer_interrupt+0x90/0x260
[20705.652675] [c000200fff3d3890] [c000000000024d08] timer_interrupt+0x98/0xe0
[20705.652716] [c000200fff3d38c0] [c000000000009014] decrementer_common+0x114/0x120
[20705.652765] --- interrupt: 901 at _raw_spin_lock_irqsave+0x88/0x110
[20705.652765] LR = _raw_spin_lock_irqsave+0x80/0x110
[20705.652836] [c000200fff3d3bb0] [c000200fff3d3bf0] 0xc000200fff3d3bf0 (unreliable)
[20705.652885] [c000200fff3d3bf0] [c000000000904e90] scsi_end_request+0x110/0x270
[20705.652933] [c000200fff3d3c50] [c000000000905414] scsi_io_completion+0x424/0x750
[20705.652981] [c000200fff3d3d10] [c0000000008f949c] scsi_finish_command+0x11c/0x1b0
[20705.653029] [c000200fff3d3d90] [c000000000904428] scsi_softirq_done+0x198/0x220
[20705.653078] [c000200fff3d3e20] [c00000000068fe98] blk_done_softirq+0xb8/0xe0
[20705.653126] [c000200fff3d3e60] [c000000000cffb08] __do_softirq+0x158/0x3e4
[20705.653167] [c000200fff3d3f40] [c000000000115968] irq_exit+0xe8/0x120
[20705.653207] [c000200fff3d3f60] [c000000000017788] __do_irq+0x88/0x1c0
[20705.653248] [c000200fff3d3f90] [c00000000002a1b0] call_do_irq+0x14/0x24
[20705.653289] [c000200e582fba90] [c00000000001795c] do_IRQ+0x9c/0x130
[20705.653330] [c000200e582fbae0] [c000000000009b04] h_virt_irq_common+0x114/0x120
[20705.653379] --- interrupt: ea1 at replay_interrupt_return+0x0/0x4
[20705.653379] LR = arch_local_irq_restore+0x74/0x90
[20705.653459] [c000200e582fbdd0] [000000000000005f] 0x5f (unreliable)
[20705.653500] [c000200e582fbdf0] [c000000000ac16d0] cpuidle_enter_state+0xf0/0x450
[20705.653549] [c000200e582fbe50] [c00000000017311c] call_cpuidle+0x4c/0x90
[20705.653590] [c000200e582fbe70] [c000000000173530] do_idle+0x2b0/0x330
[20705.653631] [c000200e582fbec0] [c0000000001737ec] cpu_startup_entry+0x3c/0x50
[20705.653679] [c000200e582fbef0] [c00000000004a050] start_secondary+0x4f0/0x510
[20705.653727] [c000200e582fbf90] [c00000000000aa6c] start_secondary_prolog+0x10/0x14
[ ***] (2 of 2) A start job is running...

Read more...

------- Comment (attachment only) From <email address hidden> 2018-04-21 17:04 EDT-------

------- Comment From <email address hidden> 2018-04-21 18:55 EDT-------
(In reply to comment #80)
> (In reply to comment #79)
> > Machine still seems to be up... will check if I can observe anything
> > interesting ...
>
> System just crashes it now. The vmcore is at /var/crash/201804181042

Can we retry this test on the P8 system using Brian's kernel in comment #94?

Also, please post access information for this system.

Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Changed in linux (Ubuntu Bionic):
status: Triaged → Incomplete
bugproxy (bugproxy) on 2018-05-02
tags: removed: bugnameltc-166588 kernel-key severity-critical triage-g
bugproxy (bugproxy) on 2018-05-03
tags: added: bugnameltc-166588 severity-critical
189 comments hidden view all 269 comments

------- Comment (attachment only) From <email address hidden> 2018-04-26 10:51 EDT-------

------- Comment on attachment From <email address hidden> 2018-04-26 14:00 EDT-------

Attaching the instructions to build skiroot/skiboot/Petitboot/all with op-build
(in this case, a patched skiroot's kernel -- zImage.epapr), per Dwip's request.

Hopefully this might help others in the future.

(In reply to comment #203)
> The skiroot kernel build is available at:
>
> http://dorno.rch.stglabs.ibm.com/~mauricfo/kernel/skiroot/bz166588/zImage.
> epapr_4.15.14-openpower1.bz166588c132

------- Comment (attachment only) From <email address hidden> 2018-05-02 14:28 EDT-------

------- Comment on attachment From <email address hidden> 2018-05-02 23:22 EDT-------

Attached boslcp3 host console tee logs

------- Comment on attachment From <email address hidden> 2018-05-02 23:49 EDT-------

Attached latest dmesg log for boslcp3 - may1st run

------- Comment on attachment From <email address hidden> 2018-05-02 23:51 EDT-------

Attached /var/log/syslog file from boslcp3 host

------- Comment on attachment From <email address hidden> 2018-05-02 23:53 EDT-------

Attached /var/log/syslog.1 file from boslcp3 host

------- Comment on attachment From <email address hidden> 2018-05-04 11:12 EDT-------

Attached host console logs for reboot issue after fresh installation

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

------- Comment From <email address hidden> 2018-05-05 13:23 EDT-------
The boslcp6 logs look characteristic of the qla2xxx issue (panic in process_one_work()). Don't have detailed qla2xxx logging so can't determine SAN disposition.

------- Comment on attachment From <email address hidden> 2018-05-02 23:53 EDT-------

Attached /var/log/syslog.1 file from boslcp3 host

------- Comment on attachment From <email address hidden> 2018-05-04 11:12 EDT-------

Attached host console logs for reboot issue after fresh installation

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

------- Comment From <email address hidden> 2018-05-07 12:10 EDT-------
Of the "boslcp" systems, only 3 appear to have QLogic adapters. Of those, one has been running without the extended error logging and so collected no data, and one has been down (or non-functional) for about 36 hours. Of the data collected, though, there is no evidence of any SAN instability since Friday - before starting the patched kernels. This means that we have no new data on whether the patches fix the problem.

------- Comment (attachment only) From <email address hidden> 2018-05-02 14:28 EDT-------

------- Comment on attachment From <email address hidden> 2018-05-02 23:22 EDT-------

Attached boslcp3 host console tee logs

------- Comment on attachment From <email address hidden> 2018-05-02 23:49 EDT-------

Attached latest dmesg log for boslcp3 - may1st run

------- Comment on attachment From <email address hidden> 2018-05-02 23:51 EDT-------

Attached /var/log/syslog file from boslcp3 host

------- Comment on attachment From <email address hidden> 2018-05-02 23:53 EDT-------

Attached /var/log/syslog.1 file from boslcp3 host

------- Comment on attachment From <email address hidden> 2018-05-04 11:12 EDT-------

Attached host console logs for reboot issue after fresh installation

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

------- Comment on attachment From <email address hidden> 2018-05-05 13:16 EDT-------

Logs from crashes after SAN bring-up of boslcp6 and subsequent logs of success boot after install of bz166588 patch

------- Comment (attachment only) From <email address hidden> 2018-05-05 13:35 EDT-------

------- Comment From <email address hidden> 2018-05-08 12:09 EDT-------
It appears that there were some SAN incidents yesterday on boslcp3, approx. times were May 7 12:44:54 through 14:28:17. All were for one port, so not exactly the situation I think caused the panic. If we could correlate these SAN incidents with other activity on neighboring systems, that might help.

[207374.827928] = first incident
[213578.181860] = last incident
[287293.677076] Tue May 8 10:56:52 CDT 2018

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-09 11:34 EDT-------
There was a period of SAN instability observed on boslcp1 this morning, at about May 9 05:01:28 to 05:51:56. This involved 2 ports simultaneously handling relogins. This was a Pegas kernel that should be susceptible to the panic, but no panic was seen. But since we don't know enough about the exact timing required to produce the panic, we can't say just what that means.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-10 12:59 EDT-------
I have had some luck reproducing this, on ltc-boston113 (previously unable to reproduce there). I had altered the boot parameters to remove "quiet splash" and added "qla2xxx.logging=0x1e400000", and got the kworker panic during boot (did not even reach login prompt). I also hit this panic while booting the Pegas 1.1 installer, so it looks like Pegas is also affected. I am completing the Pegas install with qla2xxx blacklisted, and will characterize some more.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-10 14:13 EDT-------
Being able to reproduce this on ltc-boston113 seems to have been a temporary condition. I can no longer reproduce there, Pegas or Ubuntu. Without some idea of what external conditions are causing this, it will be very difficult to pursue.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-11 12:12 EDT-------
Some information coming in on the SAN where this reproduces. It appears that there is some undesirable configuration, where fast switches are backed by slower switches between host and disks. The current theory is that other activity on the fabric causes bottle-necks in the slow switches and results in the temporary loss of login. Working on a way to reproduce this on-demand.

But, if this is true, I think this probably is not likely to be hit by customers. Seems like customers would not be mixing slow switches with fast, especially in such a dysfunctional setup.

Still investigating, though, so nothing conclusive yet.

bugproxy (bugproxy) wrote :
Download full text (11.4 KiB)

------- Comment From <email address hidden> 2018-05-02 14:39 EDT-------
The SAN incident in the previous dmesg log shows only a single port (WWPN) glitching. The logs from panics showed two ports glitching at the same time. Also, this incident did not show the port logging back in for about 8 minutes, whereas the panics showed immediate/concurrent login. So, I'm not certain if we've proven the fix yet.

------- Comment From <email address hidden> 2018-05-02 16:32 EDT-------
I think next steps here are:

1) apply all the known firmware workarounds (GH 1158)
2) Bring up system with Doug's recommendations for log verbosity (comment 211 and 215). Also capture the console output to a separate file if possible.
3) re-start the test using this same kernel, but with no stress on the host: proceed to restart the 3 guests with stress, and have a 4th guest migrating between boslcp3 and 4.

------- Comment From <email address hidden> 2018-05-02 16:36 EDT-------
(In reply to comment #218)
> I think next steps here are:
>
> 1) apply all the known firmware workarounds (GH 1158)
> 2) Bring up system with Doug's recommendations for log verbosity (comment
> 211 and 215). Also capture the console output to a separate file if possible.
> 3) re-start the test using this same kernel, but with no stress on the host:
> proceed to restart the 3 guests with stress, and have a 4th guest migrating
> between boslcp3 and 4.

Klaus, let's hold off on making more changes right now. I'd like to let things run as-is a little longer.

------- Comment From <email address hidden> 2018-05-02 23:21 EDT-------
Attached host boslcp3 host console tee logs.
Default Comment by Bridge

------- Comment From <email address hidden> 2018-05-03 03:22 EDT-------
boslcp3 host console dumps messages related to qlogic driver.

Latest tee logs for boslcp3 host :

kte111.isst.aus.stglabs.ibm.com 9.3.111.155 [kte/don2rry]

kte111:/LOGS/boslcp3-host-may1.txt

[ipjoga@kte (AUS) ~]$ ls -l /LOGS/boslcp3-host-may1.txt
-rwxrwxr-x 1 ipjoga ipjoga 20811302 May 3 02:12 /LOGS/boslcp3-host-may1.txt

Regards,
Indira

------- Comment From <email address hidden> 2018-05-03 08:20 EDT-------
There were a large number of SAN incidents in the evening, although none involved two ports at the same time. Still, many involved relogin while the logout was still being processed - so there is some confidence that the patches may be working.

There was a large period of SAN instability between May 2 21:42:09 and 21:58:47. This involved only one port (21:00:00:24:ff:7e:f6:fe). It would be interesting if this could be traced back to some activity, either on this machine or on the SAN (e.g. was migration being tested on other machines at this point?).

We still have not seen the same situation that was associated with the panics (two or more ports experiencing instability at the same time), so it's not clear if we can conclude that the patches fix the original problem.If we could find some trigger for the instability, we might be able to orchestrate the situation originally seen.

------- Comment From <email address hidden> 2018-05-04 11:10 EDT-------
We could not able to install 'sar' package due to 166588 prior patch. And also 'xfs...

bugproxy (bugproxy) on 2018-05-15
tags: added: severity-high
removed: severity-critical
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-21 13:20 EDT-------
*** Bug 168018 has been marked as a duplicate of this bug. ***

tags: added: p9
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-24 14:37 EDT-------
In bug #167562, Canonical reports that these fixes have been put in bionic-proposed (assumed to mean linux-image-4.15.0-23-generic). We need to test this ASAP in order to prevent the patches from being reverted. Can we get the latest -proposed Ubuntu Bionic installed and checked out on the systems where we saw this issue?

This is urgent. Starting by setting NEEDINFO for Chanh, although someone else may need to pick that up.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-05-24 18:16 EDT-------
(In reply to comment #259)
> In bug #167562, Canonical reports that these fixes have been put in
> bionic-proposed (assumed to mean linux-image-4.15.0-23-generic). We need to
> test this ASAP in order to prevent the patches from being reverted. Can we
> get the latest -proposed Ubuntu Bionic installed and checked out on the
> systems where we saw this issue?
>
> This is urgent. Starting by setting NEEDINFO for Chanh, although someone
> else may need to pick that up.

I installed on boslcp3 and it works. Don't see the crash like we use to see.
root@boslcp3:~# uname -a
Linux boslcp3 4.15.0-23-generic #25-Ubuntu SMP Wed May 23 17:59:00 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux
root@boslcp3:~# lspci |grep QLogic
0030:01:00.0 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
0030:01:00.1 Fibre Channel: QLogic Corp. ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter (rev 01)
root@boslcp3:~#

Manoj Iyer (manjo) on 2018-06-11
tags: added: triage-a
tags: added: triage-g
removed: triage-a
Manoj Iyer (manjo) wrote :

Looks like the bionic-proposed kernel works for IBM, and so marking this fix-committed.

Changed in linux (Ubuntu Bionic):
status: Incomplete → Fix Committed
Changed in linux (Ubuntu):
status: Incomplete → Fix Committed
Changed in ubuntu-power-systems:
status: Incomplete → Fix Committed
Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
Andrew Cloke (andrew-cloke) wrote :

The bionic-proposed kernel referred to in comment #268 has now been released. Marking as "Fix Released".

Displaying first 40 and last 40 comments. View all 269 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers