FlashGT Integration and Setup: fsbmc30: After 17th reboot of soft bootme, HTX & Linux errors seen with 256 virtual LUNs

Bug #1667239 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Xenial
Fix Released
High
Seth Forshee
Yakkety
Fix Released
High
Seth Forshee

Bug Description

== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-06-02 15:28:27 ==
==== State: Open by: anitrap on 01 June 2016 17:36:39 ====

Contact: Anitra Powell (<email address hidden> )
Backup: Dion Bell (<email address hidden>)

Primary BMC (1603G):
=====================================================
# cat /proc/ractrends/Helper/FwInfo
FW_VERSION=2.13.91819
FW_DATE=Mar 10 2016
FW_BUILDTIME=10:59:31 CDT
FW_DESC=8335 SRC BUILD RR9 03102016
FW_PRODUCTID=1
FW_RELEASEID=RR9
FW_CODEBASEVERSION=2.X
#

PNOR (1603G):
========================
# ipmitool -H 127.0.0.1 -I lanplus -U ADMIN -P admin fru list 47
Product Name : OpenPOWER Firmware
Product Version : IBM-firestone-ibm-OP8_v1.7_1.62
Product Extra : hostboot-bc98d0b-1a29dff
Product Extra : occ-0362706-16fdfa7
Product Extra : skiboot-5.1.13
Product Extra : hostboot-binaries-43d5a59
Product Extra : firestone-xml-e7b4fa2-c302f0e
Product Extra : capp-ucode-105cb8f

Partition Info:
=================
       ver 1.5.4.3 - OS, HTX, Firmware and Machine details

                           OS: GNU/Linux
                   OS Version: Ubuntu 16.04 LTS \n \l
               Kernel Version: 4.4.8c0ffee0+
                  HTX Version: htxubuntu-396
                    Host Name: fsbmc30p1
            Machine Serial No: 210995A
           Machine Type/Model: 8335-GCA

root@fsbmc30p1:~# uname -a
Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux

FlashGT NVMe setup:
===================
1 FlashGT card in slot 1 running in superpipe mode with 128 LUNs per port (total of 256 LUNs).

lsscsi
[0:0:0:0] disk ATA ST1000NX0313 BE33 /dev/sda
[1:0:0:0] disk ATA ST1000NX0313 BE33 /dev/sdb
[4:0:0:0] disk NVMe SAMSUNG MZ1LV960 3011 /dev/sdc
[4:1:0:0] disk NVMe SAMSUNG MZ1LV960 3011 /dev/sdd
[5:0:0:0] cd/dvd AMI Virtual CDROM0 1.00 /dev/sr0
[5:0:0:1] cd/dvd AMI Virtual CDROM1 1.00 /dev/sr1
[5:0:0:2] cd/dvd AMI Virtual CDROM2 1.00 /dev/sr2
[5:0:0:3] cd/dvd AMI Virtual CDROM3 1.00 /dev/sr3
[6:0:0:0] disk AMI Virtual Floppy0 1.00 /dev/sde
[6:0:0:1] disk AMI Virtual Floppy1 1.00 /dev/sdf
[6:0:0:2] disk AMI Virtual Floppy2 1.00 /dev/sdg
[6:0:0:3] disk AMI Virtual Floppy3 1.00 /dev/sdh
[7:0:0:0] disk AMI Virtual HDisk0 1.00 /dev/sdi
[7:0:0:1] disk AMI Virtual HDisk1 1.00 /dev/sdj
[7:0:0:2] disk AMI Virtual HDisk2 1.00 /dev/sdk
[7:0:0:3] disk AMI Virtual HDisk3 1.00 /dev/sdl
[7:0:0:4] disk AMI Virtual HDisk4 1.00 /dev/sdm

lspci | grep -i acc
0004:01:00.0 Processing accelerators: IBM Device 0601 (rev 01)

ls -l /sys/class/cxl
total 0
lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0 -> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0
lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0m -> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0/afu0.0m
lrwxrwxrwx 1 root root 0 May 31 13:27 afu0.0s -> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0/afu0.0/afu0.0s
lrwxrwxrwx 1 root root 0 May 31 13:27 card0 -> ../../devices/pci0004:00/0004:00:00.0/0004:01:00.0/cxl/card0

lscfg | grep afu
+ afu0.0 Slot1/card0/afu0.0
+ afu0.0m Slot1/card0/afu0.0/afu0.0m
+ afu0.0s Slot1/card0/afu0.0/afu0.0s

/opt/ibm/capikv/bin/cxlfstatus
CXL Flash Device Status

Found 0601 0004:01:00.0 Slot1
    Device: SCSI Block Mode LUN WWID
       sg2: 4:0:0:0, sdc, superpipe, 60025380025382463300046000000000
       sg3: 4:1:0:0, sdd, superpipe, 60025380025382463300052000000000

dpkg -l | grep capi
4el no description given 3.0-1970-3042652 ppc6
4el no description given 3.0-1970-3042652 ppc6

root@fsbmc30p1:/tmp# dpkg -l | grep afu
ii afuimage 3.0-1970-3042652 all no description given

cat /opt/ibm/capikv/version.txt
1970-3042652

/opt/ibm/capikv/afu/cxl_afu_dump /dev/cxl/afu0.0m -v
AFU Version = 160525N1

 NVMe0 Version = BTV73011
 NVMe0 NEXT = BTV73011
 NVMe0 STATUS = 0x702

 NVMe1 Version = BTV73011
 NVMe1 NEXT = BTV73011
 NVMe1 STATUS = 0x702

cat /tmp/test_lun_mode
128

Problem:
===========
While running soft bootme (shutdown -r from OS every hour, I noticed htx errors after the 9th & 17th reboot of partition. At this point they seem like different issues so I am opening up 2 different defects. I've already opened up defect SW354759 for the first set of htx errors and assigned to htx_screen.

This defect is for issue that happened after 17th reboot (Jun 1 @ 6am). On the 18th reboot (Jun 1 @ 7am), the shutdown -r command failed... I had to manually power down system.

I guess I will open to surelock_screen first since it seems similar to the one Dion opened up while running 128 virtual LUNs per port (defect http://w3.rchland.ibm.com/projects/bestquest/?defect=SW353881) . For this fail, other exercisers eventually failed also.

Test Info:
============
- running Soft bootme (shutdown -r every hour)
- mdt.bu + hxecom (GPUs were running). I copied a modified mdt.bu to another mdt file so I would not see any errors in htx after reboot.

Sample of HTX errors (for this defect)
==============================
/dev/sg2.53 Jun 1 06:26:53 2016 err=00000010 sev=4 hxesurelock
READCMP5 numopers= 20000 loop= 4956 blk=0x4eee
len= 4096 offset=0 Seed Values= 37882, 44181, 50758
Data Pattern Seed Values = 37882, 44182, 50758 LBA Fencepost = 0xb94a
cblk_read error - Device or resource busy

/dev/sg2.18 Jun 1 06:26:53 2016 err=00000010 sev=4 hxesurelock
READCMP9 numopers= 20000 loop= 1501 blk=0x93f1
len= 4096 offset=0 Seed Values= 37847, 44740, 50780
Data Pattern Seed Values = 37847, 44741, 50780 LBA Fencepost = 0xb275
cblk_read error - Device or resource busy

/dev/sg2.98 Jun 1 06:26:53 2016 err=00000010 sev=4 hxesurelock
READCMP5 numopers= 20000 loop= 10365 blk=0x86d5
len= 4096 offset=0 Seed Values= 37927, 41320, 50710
Data Pattern Seed Values = 37927, 41321, 50710 LBA Fencepost = 0xbc7c
cblk_read error - Device or resource busy

/dev/sg2.116 Jun 1 06:30:45 2016 err=00000005 sev=4 hxesurelock
RDCMP10 numopers= 20000 loop= 6383 blk=0xc33d
len= 4096 offset=0 Seed Values= 37945, 49039, 50726
Data Pattern Seed Values = 37945, 49040, 50726 LBA Fencepost = 0xd0b0
cblk_read error - Input/output error

/dev/fpu17 Jun 1 06:30:51 2016 err=0000000b sev=1 hxefpu64
pthread_create call failed with rc: 11, errno: 11, Resource temporarily unavailable

/dev/fpu17 Jun 1 06:30:51 2016 err=0000000b sev=1 hxefpu64
Hardware Exerciser stopped on an error

/dev/sctu43 Jun 1 06:30:51 2016 err=0000000b sev=1 hxesctu
pthread_create call failed with rc: 11, errno: 11, Resource temporarily unavailable

/dev/sctu43 Jun 1 06:30:51 2016 err=0000000b sev=1 hxesctu
Hardware Exerciser stopped on an error

Logs:
======
/gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1

/gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/htxerr
/gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/syslog
/gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/kern.log
/gsa/ausgsa/home/a/n/anitrap/web/public/fsbmc30/softbootme_fail_1/bootme.log

sample of syslog during first htx error:
================================================
Jun 1 06:19:20 fsbmc30p1 systemd[1]: Started Cleanup of Temporary Directories.
Jun 1 06:25:01 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next retry is Wed Jun 1 06:25:31 2016 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Jun 1 06:25:01 fsbmc30p1 CRON[99327]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))
Jun 1 06:26:53 fsbmc30p1 CXLBLK[37882]: cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
Jun 1 06:26:53 fsbmc30p1 rsyslogd-2007: action 'action 10' suspended, next retry is Wed Jun 1 06:27:23 2016 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Jun 1 06:26:53 fsbmc30p1 CXLBLK[37847]: cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
Jun 1 06:26:53 fsbmc30p1 CXLBLK[37927]: cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0

Jun 1 06:26:59 fsbmc30p1 CXLBLK[37961]: cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 0x607,for chunk->dev_name = /dev/sg3, chunk index = 0
Jun 1 06:26:59 fsbmc30p1 CXLBLK[37954]: cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
Jun 1 06:26:59 fsbmc30p1 CXLBLK[37887]: cflash_block_kern_mc.c,cblk_notify_mc_err,5504,LOG_EVENT reason 7 error_num = 0x607,for chunk->dev_name = /dev/sg2, chunk index = 0
Jun 1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 200250 ns

sample from kern.log during fail:
=================================
Jun 1 06:08:11 fsbmc30p1 kernel: [ 250.251041] nvidia-uvm: Loaded the UVM driver in lite mode, major device number 241
Jun 1 06:26:59 fsbmc30p1 kernel: [ 1378.248405] hrtimer: interrupt took 200250 ns
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.764382] hxesurelock[40392]: unhandled signal 11 at 0000000000000024 nip 00003fff84602978 lr 00003fff84602974 code 30001
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868242] Unable to handle kernel paging request for data at address 0x0000000c
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868599] Faulting instruction address: 0xc00000000035e2b0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868865] Oops: Kernel access of bad area, sig: 11 [#1]
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868928] SMP NR_CPUS=2048 NUMA PowerNV
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.868992] Modules linked in: nvidia_uvm(POE) iptable_filter ip_tables x_tables nvidia(POE) ipmi_devintf joydev input_leds mac_hid opal_prd ofpart cmdlinepart powernv_flash mtd at24 ipmi_powernv ipmi_msghandler uio_pdrv_genirq uio ibmpowernv powernv_rng binfmt_misc nfsd ib_iser auth_rpcgss rdma_cm iw_cm ib_cm nfs_acl ib_sa ib_mad lockd ib_core grace ib_addr sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear mlx4_en hid_generic usbhid hid uas usb_storage cxlflash ast bnx2x i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm cxl vxlan mlx4_core ahci ip6_udp_tunnel udp_tunnel libahci mdio libcrc32c
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870299] CPU: 80 PID: 40392 Comm: hxesurelock Tainted: P OE 4.4.8c0ffee0+ #2
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870379] task: c000007935fe23a0 ti: c000007910810000 task.ti: c000007910810000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870476] NIP: c00000000035e2b0 LR: c00000000035e280 CTR: 0000000000000000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870552] REGS: c0000079108135e0 TRAP: 0300 Tainted: P OE (4.4.8c0ffee0+)
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870642] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28053988 XER: 00000000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] CFAR: c000000000008468 DAR: 000000000000000c DSISR: 40000000 SOFTE: 1
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR00: c00000000035e280 c000007910813860 c000000001594600 0000000000000000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR04: c000007823192400 000000000002574f 0000000000000001 0000000000000000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR08: c0000079241b8a00 0000000000000000 00000000000044fb 65776f702f62696c
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR12: 2d656c3436637072 c00000000fb6f800 00000000464c457f 0000000000010c78
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR16: 0000000000000000 0000000000000039 d000000034fa04c5 0000000000010000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR20: 00000000000000cd 0000000000000550 0000000000010000 00000000039e0000
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR24: 00003fffffffffff c000007910813af8 c000007823192600 c00000793f57b980
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.870852] GPR28: c00000793f573e80 00003fffffffffff 000000000000001f c000007926f29790
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872149] NIP [c00000000035e2b0] elf_core_dump+0xd60/0x1300
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872277] LR [c00000000035e280] elf_core_dump+0xd30/0x1300
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872351] Call Trace:
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872407] [c000007910813860] [c00000000035e280] elf_core_dump+0xd30/0x1300 (unreliable)
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872527] [c000007910813a60] [c00000000036898c] do_coredump+0xcec/0x11e0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872625] [c000007910813c20] [c0000000000ce7a0] get_signal+0x540/0x7b0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872705] [c000007910813d10] [c000000000017344] do_signal+0x54/0x2b0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872785] [c000007910813e00] [c00000000001776c] do_notify_resume+0xbc/0xd0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872877] [c000007910813e30] [c000000000009838] ret_from_except_lite+0x64/0x68
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.872963] Instruction dump:
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.873004] 60000000 2fa30000 409effa8 e95f0050 39200000 794737e3 4082ffa4 e91f00a0
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.873148] 2fa80000 419e002c e92800f8 e9290000 <8129000c> 79279fe3 41820018 7948efe3
Jun 1 06:28:16 fsbmc30p1 kernel: [ 1454.884655] ---[ end trace f8abb6e0d0322daa ]---

gsave info:
==============
GSA Location: /gsa/ausgsa/projects/s/sift/hst/trial_data/Surelock/Ubuntu/flashgt/fsbmc30p1_ubuntu1604_FlashGT_bootme_test5/FAIL201606011024

<===== This is from RTC side description =====>
See the Discussion field for the initial comments from CQ.
</===== This is from RTC side description =====>
==== State: Open by: mpvageli on 02 June 2016 14:20:06 ====

 Oops: Kernel access of bad area, sig: 11 [#1]

# ipmitool -H 127.0.0.1 -I lanplus -U ADMIN -P admin fru list 47
Product Name : OpenPOWER Firmware
Product Version : IBM-firestone-ibm-OP8_v1.7_1.62
Product Extra : hostboot-bc98d0b-1a29dff
Product Extra : occ-0362706-16fdfa7
Product Extra : skiboot-5.1.13
Product Extra : hostboot-binaries-43d5a59
Product Extra : firestone-xml-e7b4fa2-c302f0e
Product Extra : capp-ucode-105cb8f

== Comment: #9 - VIPIN K. PARASHAR <email address hidden> - 2016-06-07 12:04:49 ==
root@fsbmc30p1:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04 LTS
Release: 16.04
Codename: xenial
root@fsbmc30p1:~# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"
NAME="Ubuntu"
VERSION="16.04 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
UBUNTU_CODENAME=xenial
root@fsbmc30p1:~# uname -a
Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux
root@fsbmc30p1:~#

== Comment: #24 - VIPIN K. PARASHAR <email address hidden> - 2016-07-07 07:14:05 ==
From kernel logs
===========

[ 7087.918089] device enP3p5s0f2 left promiscuous mode
[ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 nip 00003fff852c2ee8 lr 00003fff852c2938 code 30001
[ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 nip 00003fff890b2ee8 lr 00003fff890b2938 code 30001
[ 8816.526807] Unable to handle kernel paging request for data at address 0x0000000c
[ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
[ 8816.530233] Unable to handle kernel paging request for data at address 0x0000000c
[ 8816.530596] Faulting instruction address: 0xc00000000035e2b0
3f:mon> t
[c000000686a13a60] c00000000036898c do_coredump+0xcec/0x11e0
[c000000686a13c20] c0000000000ce7a0 get_signal+0x540/0x7b0
[c000000686a13d10] c000000000017344 do_signal+0x54/0x2b0
[c000000686a13e00] c00000000001776c do_notify_resume+0xbc/0xd0
[c000000686a13e30] c000000000009838 ret_from_except_lite+0x64/0x68
--- Exception: 300 (Data Access) at 00003fff890b2ee8
SP (3fff83c2c490) is in userspace
3f:mon> r
R00 = c00000000035e280 R16 = 0000000000000000
R01 = c000000686a13860 R17 = 0000000000000042
R02 = c000000001594600 R18 = d000000021b104fa
R03 = 0000000000000000 R19 = 0000000000010000
R04 = c000002fb7463400 R20 = 00000000000000cd
R05 = 00000000000001bf R21 = 0000000000000628
R06 = 0000000000000001 R22 = 0000000000010000
R07 = 0000000000000000 R23 = 0000000000250000
R08 = c00000281af21500 R24 = 00003fffffffffff
R09 = 0000000000000000 R25 = c000000686a13af8
R10 = 00000000000044fb R26 = c000002fb7463800
R11 = 6c2d656c34366370 R27 = c000002ff0e05cc0
R12 = 756e672d78756e69 R28 = c000002ff0e05c40
R13 = c00000000fb65680 R29 = 00003fffffffffff
R14 = 00000000464c457f R30 = 0000000000000016
R15 = 0000000000010e70 R31 = c000002fb94bd3b8
pc = c00000000035e2b0 elf_core_dump+0xd60/0x1300
cfar= c000000000008468 slb_miss_realmode+0x50/0x78
lr = c00000000035e280 elf_core_dump+0xd30/0x1300
msr = 9000000100009033 cr = 28053828
ctr = 0000000000000000 xer = 0000000000000000 trap = 300
dar = 000000000000000c dsisr = 40000000
3f:mon>

hxesurelock process has segfaulted and kernel has crashed while
dumping core.

== Comment: #87 - Frederic Barrat <email address hidden> - 2017-02-21 11:50:40 ==
Fix is in kernel v4.10:
bdecf76e319a29735d828575f4a9269f0e17c547
"cxl: Fix coredump generation when cxl_get_fd() is used"

We'd like to have it backported to 16.10 and 16.04 LTS.

CVE References

Revision history for this message
bugproxy (bugproxy) wrote : syslog

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-142129 severity-critical targetmilestone-inin16042
Revision history for this message
bugproxy (bugproxy) wrote : htxerr

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : kern.log

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - capiredfsp

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - surelock02p03

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : syslog

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : htxerr

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : kern.log

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - capiredfsp

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - surelock02p03

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-02-23 04:02 EDT-------
</===== This is from RTC side description =====>==== State: Verify by: cde00 on 23 February 2017 02:30:16 ====

== Comment: #1 - Application Cdeadmin <email address hidden> - 2016-06-02 15:28:27 ====== State: Verify by: cde00 on 23 February 2017 02:40:32 ====

Revision history for this message
Michael Hohnbaum (hohnbaum) wrote : Re: [Bug 1667239] [NEW] FlashGT Integration and Setup: fsbmc30: After 17th reboot of soft bootme, HTX & Linux errors seen with 256 virtual LUNs
Download full text (19.2 KiB)

Request for a fix that showed up in the 4.10 kernel to be backported to
16.04 and 16.10. Please have the Kernel team review/respond. Thanks.

                           Michael

On 02/23/2017 12:20 AM, Launchpad Bug Tracker wrote:
> bugproxy (bugproxy) has assigned this bug to you for Ubuntu:
>
> == Comment: #1 - Application Cdeadmin <email address hidden> - 2016-06-02 15:28:27 ==
> ==== State: Open by: anitrap on 01 June 2016 17:36:39 ====
>
> Contact: Anitra Powell (<email address hidden> )
> Backup: Dion Bell (<email address hidden>)
>
>
> Primary BMC (1603G):
> =====================================================
> # cat /proc/ractrends/Helper/FwInfo
> FW_VERSION=2.13.91819
> FW_DATE=Mar 10 2016
> FW_BUILDTIME=10:59:31 CDT
> FW_DESC=8335 SRC BUILD RR9 03102016
> FW_PRODUCTID=1
> FW_RELEASEID=RR9
> FW_CODEBASEVERSION=2.X
> #
>
> PNOR (1603G):
> ========================
> # ipmitool -H 127.0.0.1 -I lanplus -U ADMIN -P admin fru list 47
> Product Name : OpenPOWER Firmware
> Product Version : IBM-firestone-ibm-OP8_v1.7_1.62
> Product Extra : hostboot-bc98d0b-1a29dff
> Product Extra : occ-0362706-16fdfa7
> Product Extra : skiboot-5.1.13
> Product Extra : hostboot-binaries-43d5a59
> Product Extra : firestone-xml-e7b4fa2-c302f0e
> Product Extra : capp-ucode-105cb8f
>
> Partition Info:
> =================
> ver 1.5.4.3 - OS, HTX, Firmware and Machine details
>
> OS: GNU/Linux
> OS Version: Ubuntu 16.04 LTS \n \l
> Kernel Version: 4.4.8c0ffee0+
> HTX Version: htxubuntu-396
> Host Name: fsbmc30p1
> Machine Serial No: 210995A
> Machine Type/Model: 8335-GCA
>
> root@fsbmc30p1:~# uname -a
> Linux fsbmc30p1 4.4.8c0ffee0+ #2 SMP Tue May 24 10:50:26 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux
>
> FlashGT NVMe setup:
> ===================
> 1 FlashGT card in slot 1 running in superpipe mode with 128 LUNs per port (total of 256 LUNs).
>
> lsscsi
> [0:0:0:0] disk ATA ST1000NX0313 BE33 /dev/sda
> [1:0:0:0] disk ATA ST1000NX0313 BE33 /dev/sdb
> [4:0:0:0] disk NVMe SAMSUNG MZ1LV960 3011 /dev/sdc
> [4:1:0:0] disk NVMe SAMSUNG MZ1LV960 3011 /dev/sdd
> [5:0:0:0] cd/dvd AMI Virtual CDROM0 1.00 /dev/sr0
> [5:0:0:1] cd/dvd AMI Virtual CDROM1 1.00 /dev/sr1
> [5:0:0:2] cd/dvd AMI Virtual CDROM2 1.00 /dev/sr2
> [5:0:0:3] cd/dvd AMI Virtual CDROM3 1.00 /dev/sr3
> [6:0:0:0] disk AMI Virtual Floppy0 1.00 /dev/sde
> [6:0:0:1] disk AMI Virtual Floppy1 1.00 /dev/sdf
> [6:0:0:2] disk AMI Virtual Floppy2 1.00 /dev/sdg
> [6:0:0:3] disk AMI Virtual Floppy3 1.00 /dev/sdh
> [7:0:0:0] disk AMI Virtual HDisk0 1.00 /dev/sdi
> [7:0:0:1] disk AMI Virtual HDisk1 1.00 /dev/sdj
> [7:0:0:2] disk AMI Virtual HDisk2 1.00 /dev/sdk
> [7:0:0:3] disk AMI Virtual HDisk3 1.00 /dev/sdl
> [7:0:0:4] disk AMI Virtual HDisk4 1.00 /dev/sdm
>
> lspci | grep -i acc
> 0004:01:00.0 Processing accelerators...

Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Seth Forshee (sforshee)
Changed in linux (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Seth Forshee (sforshee) wrote :
Changed in linux (Ubuntu Xenial):
assignee: nobody → Seth Forshee (sforshee)
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Seth Forshee (sforshee)
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: In Progress → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : syslog

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : htxerr

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : kern.log

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - capiredfsp

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - surelock02p03

Default Comment by Bridge

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-03-24 05:19 EDT-------
Hello Canonical,

I don't see the source code for the fix in the -proposed kernels (Ubuntu-4.4.0-70.91 for xenial and Ubuntu-4.8.0-44.47 for yakkety).
Did the fix get dropped?? Please advise.

Revision history for this message
Seth Forshee (sforshee) wrote :

We had to replace the previous -proposed kernels (which did have the fix) to fix a regression in a previous update, and the kernels currently in -proposed only have the reverts to fix this regression. The next -proposed kernels should have the fix. I'll remove the verification-needed-* tags for now.

Sorry for the confusion!

tags: removed: verification-needed-xenial verification-needed-yakkety
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-yakkety
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

bugproxy (bugproxy)
tags: added: verification-done-xenial
removed: verification-needed-xenial
bugproxy (bugproxy)
tags: added: verification-done-yakkety
removed: verification-needed-yakkety
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (29.1 KiB)

This bug was fixed in the package linux - 4.4.0-75.96

---------------
linux (4.4.0-75.96) xenial; urgency=low

  * linux: 4.4.0-75.96 -proposed tracker (LP: #1684441)

  * [Hyper-V] hv: util: move waiting for release to hv_utils_transport itself
    (LP: #1682561)
    - Drivers: hv: util: move waiting for release to hv_utils_transport itself

linux (4.4.0-74.95) xenial; urgency=low

  * linux: 4.4.0-74.95 -proposed tracker (LP: #1682041)

  * [Hyper-V] hv: vmbus: Raise retry/wait limits in vmbus_post_msg()
    (LP: #1681893)
    - Drivers: hv: vmbus: Raise retry/wait limits in vmbus_post_msg()

linux (4.4.0-73.94) xenial; urgency=low

  * linux: 4.4.0-73.94 -proposed tracker (LP: #1680416)

  * CVE-2017-6353
    - sctp: deny peeloff operation on asocs with threads sleeping on it

  * vfat: missing iso8859-1 charset (LP: #1677230)
    - [Config] NLS_ISO8859_1=y

  * Regression: KVM modules should be on main kernel package (LP: #1678099)
    - [Config] powerpc: Add kvm-hv and kvm-pr to the generic inclusion list

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in complain mode

  * Permission denied and inconsistent behavior in complain mode with 'ip netns
    list' command (LP: #1648903)
    - SAUCE: fix regression with domain change in complain mode

  * unexpected errno=13 and disconnected path when trying to open /proc/1/ns/mnt
    from a unshared mount namespace (LP: #1656121)
    - SAUCE: apparmor: null profiles should inherit parent control flags

  * apparmor refcount leak of profile namespace when removing profiles
    (LP: #1660849)
    - SAUCE: apparmor: fix ns ref count link when removing profiles from policy

  * tor in lxd: apparmor="DENIED" operation="change_onexec"
    namespace="root//CONTAINERNAME_<var-lib-lxd>" profile="unconfined"
    name="system_tor" (LP: #1648143)
    - SAUCE: apparmor: Fix no_new_privs blocking change_onexec when using stacked
      namespaces

  * apparmor oops in bind_mnt when dev_path lookup fails (LP: #1660840)
    - SAUCE: apparmor: fix oops in bind_mnt when dev_path lookup fails

  * apparmor auditing denied access of special apparmor .null fi\ le
    (LP: #1660836)
    - SAUCE: apparmor: Don't audit denied access of special apparmor .null file

  * apparmor label leak when new label is unused (LP: #1660834)
    - SAUCE: apparmor: fix label leak when new label is unused

  * apparmor reference count bug in label_merge_insert() (LP: #1660833)
    - SAUCE: apparmor: fix reference count bug in label_merge_insert()

  * apparmor's raw_data file in securityfs is sometimes truncated (LP: #1638996)
    - SAUCE: apparmor: fix replacement race in reading rawdata

  * unix domain socket cross permission check failing with nested namespaces
    (LP: #1660832)
    - SAUCE: apparmor: fix cross ns perm of unix domain sockets

  * Xenial update to v4.4.59 stable release (LP: #1678960)
    - xfrm: policy: init locks early
    - virtio_balloon: init ...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.5 KiB)

This bug was fixed in the package linux - 4.8.0-49.52

---------------
linux (4.8.0-49.52) yakkety; urgency=low

  * linux: 4.8.0-49.52 -proposed tracker (LP: #1684427)

  * [Hyper-V] hv: util: move waiting for release to hv_utils_transport itself
    (LP: #1682561)
    - Drivers: hv: util: move waiting for release to hv_utils_transport itself

linux (4.8.0-48.51) yakkety; urgency=low

  * linux: 4.8.0-48.51 -proposed tracker (LP: #1682034)

  * [Hyper-V] hv: vmbus: Raise retry/wait limits in vmbus_post_msg()
    (LP: #1681893)
    - Drivers: hv: vmbus: Raise retry/wait limits in vmbus_post_msg()

linux (4.8.0-47.50) yakkety; urgency=low

  * linux: 4.8.0-47.50 -proposed tracker (LP: #1679678)

  * CVE-2017-6353
    - sctp: deny peeloff operation on asocs with threads sleeping on it

  * CVE-2017-5986
    - sctp: avoid BUG_ON on sctp_wait_for_sndbuf

  * vfat: missing iso8859-1 charset (LP: #1677230)
    - [Config] NLS_ISO8859_1=y

  * [Hyper-V] pci-hyperv: Use device serial number as PCI domain (LP: #1667527)
    - net/mlx4_core: Use cq quota in SRIOV when creating completion EQs

  * Regression: KVM modules should be on main kernel package (LP: #1678099)
    - [Config] powerpc: Add kvm-hv and kvm-pr to the generic inclusion list

  * linux-lts-xenial 4.4.0-63.84~14.04.2 ADT test failure with linux-lts-xenial
    4.4.0-63.84~14.04.2 (LP: #1664912)
    - SAUCE: apparmor: fix link auditing failure due to, uninitialized var

  * regession tests failing after stackprofile test is run (LP: #1661030)
    - SAUCE: fix regression with domain change in complain mode

  * Permission denied and inconsistent behavior in complain mode with 'ip netns
    list' command (LP: #1648903)
    - SAUCE: fix regression with domain change in complain mode

  * unexpected errno=13 and disconnected path when trying to open /proc/1/ns/mnt
    from a unshared mount namespace (LP: #1656121)
    - SAUCE: apparmor: null profiles should inherit parent control flags

  * apparmor refcount leak of profile namespace when removing profiles
    (LP: #1660849)
    - SAUCE: apparmor: fix ns ref count link when removing profiles from policy

  * tor in lxd: apparmor="DENIED" operation="change_onexec"
    namespace="root//CONTAINERNAME_<var-lib-lxd>" profile="unconfined"
    name="system_tor" (LP: #1648143)
    - SAUCE: apparmor: Fix no_new_privs blocking change_onexec when using stacked
      namespaces

  * apparmor oops in bind_mnt when dev_path lookup fails (LP: #1660840)
    - SAUCE: apparmor: fix oops in bind_mnt when dev_path lookup fails

  * apparmor auditing denied access of special apparmor .null fi\ le
    (LP: #1660836)
    - SAUCE: apparmor: Don't audit denied access of special apparmor .null file

  * apparmor label leak when new label is unused (LP: #1660834)
    - SAUCE: apparmor: fix label leak when new label is unused

  * apparmor reference count bug in label_merge_insert() (LP: #1660833)
    - SAUCE: apparmor: fix reference count bug in label_merge_insert()

  * apparmor's raw_data file in securityfs is sometimes truncated (LP: #1638996)
    - SAUCE: apparmor: fix replacement race in reading rawdata

  * unix domain socket cross permission check failing with n...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-07-04 08:35 EDT-------
Anitra,

Were you able to verify this one on Ubuntu 16.04.2?

Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (5.9 KiB)

------- Comment From <email address hidden> 2017-07-04 09:03 EDT-------
This CMVC defect is being cancelled by the CDE Bridge because the corresponding CQ Defect [SW354783] was transferred out of the bridge domain.
Here are the additional details:
New Subsystem = ppc_triage
New Release = unspecified
New Component = ubuntu_linux
New OwnerInfo = Chavez, Luciano (<email address hidden>)
To continue tracking this issue, please follow CQ defect [SW354783].

Opened defect SW355478 on new fail to see if it is the same issue. I made sev 1 since system in XMON right now and is preventing further testing.

Like I mentioned earlier, the fail could be related to this defect.

For this defect...

The "Oops: Kernel access of bad area, sig: 11 [#1]" in the logs happens during HTX run.

On the reboot (that happened ~30 minutes after first error), I saw partition hang/crash. I had to use ipmitool to power down system.
Current xmon crash in SW355478 / 142348 is different than
one being tracked in this bug. Will wait for recreate of original issue.

The FlashGT HST team still needs to recreate this issue.

SW357236 "HTX fail during superpipe 128 per LUN testing...during Guardband Testing" is now marked as a duplicate of this SW354783.
Per comment from JVP (SW357236 submitter), he is attempting a recreate again with the latest Firmware for his Tuleta-L.
We will monitor that attempt at recreate, and reopen this SW354783 if a new recreate is achieved.

This original recreate attempt on Firestone, fsbmc30, may be delayed, as it is currently tied up with debugging a link training issue.

<Automated Update> The severity of defect SW354783 was increased from 2 to 1 because defect SW358210 was rejected as the duplicate of defect SW354783 and the severity of defect SW358210 was higher than 2

Defect submitter, Dion is out on vacation until 7/11. So we can make progress on this most recent recreate, SW358210 dup'd to this SW354783,
I request the defect Owner, Luciano/ScreenTeam, to please reopen this SW354783 and continue live debug on the held system from SW358210:

#=#=# 2016-07-05 17:12:28 (CDT) #=#=#
Action = [reopen]

I'm not quite sure how to handle this (I'll ping Mark Smith) defect.

Dion's defect
SW358210 : FlashGT STC GA3: capiredp01: TMF timed out and Unable to handle kernel paging request before system drops into xmon debugger, was running HTX for superpipe with 1600 virtual luns across 4 FlashGT NVME cards

was just dup'd to this one.

That system is currently in XMON debugger now and can be debugged to 1) verify it is same issue and 2) maybe try to find root cause (his defect can be re-opened if not the same issue).
#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#
Not able to look SW358210.
Looking into machine capiredp01 box.
Machine details:

FSP: capiredfsp.aus.stglabs.ibm.com (dev/FipSdev)
Partition: capiredp01.aus.stglabs.ibm.com
IPMI console: ipmitool -I lanplus -H capiredfsp.aus.stglabs.ibm.com -P abc123 sol activate

Fail on "capiredfsp" seems same as reported in this bug.
hxesurelock process has segfaulted and kernel has crashed
while generating core dump.

cde00 (<email address hidden>) added native attachment /tmp/AIXOS05866176/dmesg_backtrace_capiredfsp on 2016-07...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-08-03 16:32 EDT-------
Anitra,

Any progress on this one? Were we able to verify if it is fixed or the same as the new bug that has been opened?

Revision history for this message
bugproxy (bugproxy) wrote : kern.log

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - capiredfsp

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, backtrace - surelock02p03

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-11-29 14:29 EDT-------
==== State: Verify by: anitrap on 29 November 2017 13:18:44 ====

I have accepted my verify records based on results from other testers. An official fix was not available when I had this hardware. There are still other verify records opened (we used a workaround to avoid seeing issue... seems like this bug is for ubuntu only?...)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.