FlashGT Integration and Setup: fsbmc30: After 17th reboot of soft bootme, HTX & Linux errors seen with 256 virtual LUNs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Canonical Kernel Team | ||
Xenial |
Fix Released
|
High
|
Seth Forshee | ||
Yakkety |
Fix Released
|
High
|
Seth Forshee |
Bug Description
== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-06-02 15:28:27 ==
==== State: Open by: anitrap on 01 June 2016 17:36:39 ====
Contact: Anitra Powell (<email address hidden> )
Backup: Dion Bell (<email address hidden>)
Primary BMC (1603G):
=======
# cat /proc/ractrends
FW_VERSION=
FW_DATE=Mar 10 2016
FW_BUILDTIME=
FW_DESC=8335 SRC BUILD RR9 03102016
FW_PRODUCTID=1
FW_RELEASEID=RR9
FW_CODEBASEVERS
#
PNOR (1603G):
=======
# ipmitool -H 127.0.0.1 -I lanplus -U ADMIN -P admin fru list 47
Product Name : OpenPOWER Firmware
Product Version : IBM-firestone-
Product Extra : hostboot-
Product Extra : occ-0362706-16fdfa7
Product Extra : skiboot-5.1.13
Product Extra : hostboot-
Product Extra : firestone-
Product Extra : capp-ucode-105cb8f
Partition Info:
=================
ver 1.5.4.3 - OS, HTX, Firmware and Machine details
Machine Serial No: 210995A
Machine Type/Model: 8335-GCA
root@fsbmc30p1:~# uname -a
Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux
FlashGT NVMe setup:
===================
1 FlashGT card in slot 1 running in superpipe mode with 128 LUNs per port (total of 256 LUNs).
lsscsi
[0:0:0:0] disk ATA ST1000NX0313 BE33 /dev/sda
[1:0:0:0] disk ATA ST1000NX0313 BE33 /dev/sdb
[4:0:0:0] disk NVMe SAMSUNG MZ1LV960 3011 /dev/sdc
[4:1:0:0] disk NVMe SAMSUNG MZ1LV960 3011 /dev/sdd
[5:0:0:0] cd/dvd AMI Virtual CDROM0 1.00 /dev/sr0
[5:0:0:1] cd/dvd AMI Virtual CDROM1 1.00 /dev/sr1
[5:0:0:2] cd/dvd AMI Virtual CDROM2 1.00 /dev/sr2
[5:0:0:3] cd/dvd AMI Virtual CDROM3 1.00 /dev/sr3
[6:0:0:0] disk AMI Virtual Floppy0 1.00 /dev/sde
[6:0:0:1] disk AMI Virtual Floppy1 1.00 /dev/sdf
[6:0:0:2] disk AMI Virtual Floppy2 1.00 /dev/sdg
[6:0:0:3] disk AMI Virtual Floppy3 1.00 /dev/sdh
[7:0:0:0] disk AMI Virtual HDisk0 1.00 /dev/sdi
[7:0:0:1] disk AMI Virtual HDisk1 1.00 /dev/sdj
[7:0:0:2] disk AMI Virtual HDisk2 1.00 /dev/sdk
[7:0:0:3] disk AMI Virtual HDisk3 1.00 /dev/sdl
[7:0:0:4] disk AMI Virtual HDisk4 1.00 /dev/sdm
lspci | grep -i acc
0004:01:00.0 Processing accelerators: IBM Device 0601 (rev 01)
ls -l /sys/class/cxl
total 0
lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0 -> ../../devices/
lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0m -> ../../devices/
lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0s -> ../../devices/
lrwxrwxrwx 1 root root 0 May 31 13:27 card0 -> ../../devices/
lscfg | grep afu
+ afu0.0 Slot1/card0/afu0.0
+ afu0.0m Slot1/card0/
+ afu0.0s Slot1/card0/
/opt/ibm/
CXL Flash Device Status
Found 0601 0004:01:00.0 Slot1
Device: SCSI Block Mode LUN WWID
sg2: 4:0:0:0, sdc, superpipe, 600253800253824
sg3: 4:1:0:0, sdd, superpipe, 600253800253824
dpkg -l | grep capi
4el no description given 3.0-1970-3042652 ppc6
4el no description given 3.0-1970-3042652 ppc6
root@fsbmc30p1:
ii afuimage 3.0-1970-3042652 all no description given
cat /opt/ibm/
1970-3042652
/opt/ibm/
AFU Version = 160525N1
NVMe0 Version = BTV73011
NVMe0 NEXT = BTV73011
NVMe0 STATUS = 0x702
NVMe1 Version = BTV73011
NVMe1 NEXT = BTV73011
NVMe1 STATUS = 0x702
cat /tmp/test_lun_mode
128
Problem:
===========
While running soft bootme (shutdown -r from OS every hour, I noticed htx errors after the 9th & 17th reboot of partition. At this point they seem like different issues so I am opening up 2 different defects. I've already opened up defect SW354759 for the first set of htx errors and assigned to htx_screen.
This defect is for issue that happened after 17th reboot (Jun 1 @ 6am). On the 18th reboot (Jun 1 @ 7am), the shutdown -r command failed... I had to manually power down system.
I guess I will open to surelock_screen first since it seems similar to the one Dion opened up while running 128 virtual LUNs per port (defect http://
Test Info:
============
- running Soft bootme (shutdown -r every hour)
- mdt.bu + hxecom (GPUs were running). I copied a modified mdt.bu to another mdt file so I would not see any errors in htx after reboot.
Sample of HTX errors (for this defect)
=======
/dev/sg2.53 Jun 1 06:26:53 2016 err=00000010 sev=4 hxesurelock
READCMP5 numopers= 20000 loop= 4956 blk=0x4eee
len= 4096 offset=0 Seed Values= 37882, 44181, 50758
Data Pattern Seed Values = 37882, 44182, 50758 LBA Fencepost = 0xb94a
cblk_read error - Device or resource busy
/dev/sg2.18 Jun 1 06:26:53 2016 err=00000010 sev=4 hxesurelock
READCMP9 numopers= 20000 loop= 1501 blk=0x93f1
len= 4096 offset=0 Seed Values= 37847, 44740, 50780
Data Pattern Seed Values = 37847, 44741, 50780 LBA Fencepost = 0xb275
cblk_read error - Device or resource busy
/dev/sg2.98 Jun 1 06:26:53 2016 err=00000010 sev=4 hxesurelock
READCMP5 numopers= 20000 loop= 10365 blk=0x86d5
len= 4096 offset=0 Seed Values= 37927, 41320, 50710
Data Pattern Seed Values = 37927, 41321, 50710 LBA Fencepost = 0xbc7c
cblk_read error - Device or resource busy
/dev/sg2.116 Jun 1 06:30:45 2016 err=00000005 sev=4 hxesurelock
RDCMP10 numopers= 20000 loop= 6383 blk=0xc33d
len= 4096 offset=0 Seed Values= 37945, 49039, 50726
Data Pattern Seed Values = 37945, 49040, 50726 LBA Fencepost = 0xd0b0
cblk_read error - Input/output error
/dev/fpu17 Jun 1 06:30:51 2016 err=0000000b sev=1 hxefpu64
pthread_create call failed with rc: 11, errno: 11, Resource temporarily unavailable
/dev/fpu17 Jun 1 06:30:51 2016 err=0000000b sev=1 hxefpu64
Hardware Exerciser stopped on an error
/dev/sctu43 Jun 1 06:30:51 2016 err=0000000b sev=1 hxesctu
pthread_create call failed with rc: 11, errno: 11, Resource temporarily unavailable
/dev/sctu43 Jun 1 06:30:51 2016 err=0000000b sev=1 hxesctu
Hardware Exerciser stopped on an error
Logs:
======
/gsa/ausgsa/
/gsa/ausgsa/
/gsa/ausgsa/
/gsa/ausgsa/
/gsa/ausgsa/
sample of syslog during first htx error:
=======
Jun 1 06:19:20 fsbmc30p1 systemd[1]: Started Cleanup of Temporary Directories.
Jun 1 06:25:01 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next retry is Wed Jun 1 06:25:31 2016 [v8.16.0 try http://
Jun 1 06:25:01 fsbmc30p1 CRON[99327]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Jun 1 06:26:53 fsbmc30p1 CXLBLK[37882]: cflash_
Jun 1 06:26:53 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next retry is Wed Jun 1 06:27:23 2016 [v8.16.0 try http://
Jun 1 06:26:53 fsbmc30p1 CXLBLK[37847]: cflash_
Jun 1 06:26:53 fsbmc30p1 CXLBLK[37927]: cflash_
Jun 1 06:26:59 fsbmc30p1 CXLBLK[37961]: cflash_
Jun 1 06:26:59 fsbmc30p1 CXLBLK[37954]: cflash_
Jun 1 06:26:59 fsbmc30p1 CXLBLK[37887]: cflash_
Jun 1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 200250 ns
sample from kern.log during fail:
=======
Jun 1 06:08:11 fsbmc30p1 kernel: [ 250.251041] nvidia-uvm: Loaded the UVM driver in lite mode, major device number 241
Jun 1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 200250 ns
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.764382] hxesurelock[40392]: unhandled signal 11 at 0000000000000024 nip 00003fff84602978 lr 00003fff84602974 code 30001
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868242] Unable to handle kernel paging request for data at address 0x0000000c
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868599] Faulting instruction address: 0xc00000000035e2b0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868865] Oops: Kernel access of bad area, sig: 11 [#1]
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868928] SMP NR_CPUS=2048 NUMA PowerNV
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868992] Modules linked in: nvidia_uvm(POE) iptable_filter ip_tables x_tables nvidia(POE) ipmi_devintf joydev input_leds mac_hid opal_prd ofpart cmdlinepart powernv_flash mtd at24 ipmi_powernv ipmi_msghandler uio_pdrv_genirq uio ibmpowernv powernv_rng binfmt_misc nfsd ib_iser auth_rpcgss rdma_cm iw_cm ib_cm nfs_acl ib_sa ib_mad lockd ib_core grace ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870299] CPU: 80 PID: 40392 Comm: hxesurelock Tainted: P OE 4.4.8c0ffee0+ #2
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870379] task: c000007935fe23a0 ti: c000007910810000 task.ti: c000007910810000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870476] NIP: c00000000035e2b0 LR: c00000000035e280 CTR: 0000000000000000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870552] REGS: c0000079108135e0 TRAP: 0300 Tainted: P OE (4.4.8c0ffee0+)
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870642] MSR: 9000000100009033 <SF,HV,
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] CFAR: c000000000008468 DAR: 000000000000000c DSISR: 40000000 SOFTE: 1
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR00: c00000000035e280 c000007910813860 c000000001594600 0000000000000000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR04: c000007823192400 000000000002574f 0000000000000001 0000000000000000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR08: c0000079241b8a00 0000000000000000 00000000000044fb 65776f702f62696c
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR12: 2d656c3436637072 c00000000fb6f800 00000000464c457f 0000000000010c78
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR16: 0000000000000000 0000000000000039 d000000034fa04c5 0000000000010000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR20: 00000000000000cd 0000000000000550 0000000000010000 00000000039e0000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR24: 00003fffffffffff c000007910813af8 c000007823192600 c00000793f57b980
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR28: c00000793f573e80 00003fffffffffff 000000000000001f c000007926f29790
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872149] NIP [c00000000035e2b0] elf_core_
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872277] LR [c00000000035e280] elf_core_
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872351] Call Trace:
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872407] [c000007910813860] [c00000000035e280] elf_core_
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872527] [c000007910813a60] [c00000000036898c] do_coredump+
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872625] [c000007910813c20] [c0000000000ce7a0] get_signal+
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872705] [c000007910813d10] [c000000000017344] do_signal+
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872785] [c000007910813e00] [c00000000001776c] do_notify_
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872877] [c000007910813e30] [c000000000009838] ret_from_
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872963] Instruction dump:
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.873004] 60000000 2fa30000 409effa8 e95f0050 39200000 794737e3 4082ffa4 e91f00a0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.873148] 2fa80000 419e002c e92800f8 e9290000 <8129000c> 79279fe3 41820018 7948efe3
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.884655] ---[ end trace f8abb6e0d0322daa ]---
gsave info:
==============
GSA Location: /gsa/ausgsa/
<===== This is from RTC side description =====>
See the Discussion field for the initial comments from CQ.
</===== This is from RTC side description =====>
==== State: Open by: mpvageli on 02 June 2016 14:20:06 ====
Oops: Kernel access of bad area, sig: 11 [#1]
# ipmitool -H 127.0.0.1 -I lanplus -U ADMIN -P admin fru list 47
Product Name : OpenPOWER Firmware
Product Version : IBM-firestone-
Product Extra : hostboot-
Product Extra : occ-0362706-16fdfa7
Product Extra : skiboot-5.1.13
Product Extra : hostboot-
Product Extra : firestone-
Product Extra : capp-ucode-105cb8f
== Comment: #9 - VIPIN K. PARASHAR <email address hidden> - 2016-06-07 12:04:49 ==
root@fsbmc30p1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
root@fsbmc30p1:~# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_
DISTRIB_
DISTRIB_
NAME="Ubuntu"
VERSION="16.04 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04 LTS"
VERSION_ID="16.04"
HOME_URL="http://
SUPPORT_URL="http://
BUG_REPORT_URL="http://
UBUNTU_
root@fsbmc30p1:~# uname -a
Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux
root@fsbmc30p1:~#
== Comment: #24 - VIPIN K. PARASHAR <email address hidden> - 2016-07-07 07:14:05 ==
From kernel logs
===========
[ 7087.918089] device enP3p5s0f2 left promiscuous mode
[ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 nip 00003fff852c2ee8 lr 00003fff852c2938 code 30001
[ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 nip 00003fff890b2ee8 lr 00003fff890b2938 code 30001
[ 8816.526807] Unable to handle kernel paging request for data at address 0x0000000c
[ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
[ 8816.530233] Unable to handle kernel paging request for data at address 0x0000000c
[ 8816.530596] Faulting instruction address: 0xc00000000035e2b0
3f:mon> t
[c000000686a13a60] c00000000036898c do_coredump+
[c000000686a13c20] c0000000000ce7a0 get_signal+
[c000000686a13d10] c000000000017344 do_signal+
[c000000686a13e00] c00000000001776c do_notify_
[c000000686a13e30] c000000000009838 ret_from_
--- Exception: 300 (Data Access) at 00003fff890b2ee8
SP (3fff83c2c490) is in userspace
3f:mon> r
R00 = c00000000035e280 R16 = 0000000000000000
R01 = c000000686a13860 R17 = 0000000000000042
R02 = c000000001594600 R18 = d000000021b104fa
R03 = 0000000000000000 R19 = 0000000000010000
R04 = c000002fb7463400 R20 = 00000000000000cd
R05 = 00000000000001bf R21 = 0000000000000628
R06 = 0000000000000001 R22 = 0000000000010000
R07 = 0000000000000000 R23 = 0000000000250000
R08 = c00000281af21500 R24 = 00003fffffffffff
R09 = 0000000000000000 R25 = c000000686a13af8
R10 = 00000000000044fb R26 = c000002fb7463800
R11 = 6c2d656c34366370 R27 = c000002ff0e05cc0
R12 = 756e672d78756e69 R28 = c000002ff0e05c40
R13 = c00000000fb65680 R29 = 00003fffffffffff
R14 = 00000000464c457f R30 = 0000000000000016
R15 = 0000000000010e70 R31 = c000002fb94bd3b8
pc = c00000000035e2b0 elf_core_
cfar= c000000000008468 slb_miss_
lr = c00000000035e280 elf_core_
msr = 9000000100009033 cr = 28053828
ctr = 0000000000000000 xer = 0000000000000000 trap = 300
dar = 000000000000000c dsisr = 40000000
3f:mon>
hxesurelock process has segfaulted and kernel has crashed while
dumping core.
== Comment: #87 - Frederic Barrat <email address hidden> - 2017-02-21 11:50:40 ==
Fix is in kernel v4.10:
bdecf76e319a297
"cxl: Fix coredump generation when cxl_get_fd() is used"
We'd like to have it backported to 16.10 and 16.04 LTS.
Changed in linux (Ubuntu): | |
assignee: | Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
status: | New → Triaged |
Changed in linux (Ubuntu): | |
status: | Triaged → Fix Released |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Yakkety): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-xenial removed: verification-needed-xenial |
tags: |
added: verification-done-yakkety removed: verification-needed-yakkety |
Default Comment by Bridge