Comment 28 for bug 1667239

------- Comment From <email address hidden> 2017-07-04 09:03 EDT-------
This CMVC defect is being cancelled by the CDE Bridge because the corresponding CQ Defect [SW354783] was transferred out of the bridge domain.
Here are the additional details:
New Subsystem = ppc_triage
New Release = unspecified
New Component = ubuntu_linux
New OwnerInfo = Chavez, Luciano (<email address hidden>)
To continue tracking this issue, please follow CQ defect [SW354783].

Opened defect SW355478 on new fail to see if it is the same issue. I made sev 1 since system in XMON right now and is preventing further testing.

Like I mentioned earlier, the fail could be related to this defect.

For this defect...

The "Oops: Kernel access of bad area, sig: 11 [#1]" in the logs happens during HTX run.

On the reboot (that happened ~30 minutes after first error), I saw partition hang/crash. I had to use ipmitool to power down system.
Current xmon crash in SW355478 / 142348 is different than
one being tracked in this bug. Will wait for recreate of original issue.

The FlashGT HST team still needs to recreate this issue.

SW357236 "HTX fail during superpipe 128 per LUN testing...during Guardband Testing" is now marked as a duplicate of this SW354783.
Per comment from JVP (SW357236 submitter), he is attempting a recreate again with the latest Firmware for his Tuleta-L.
We will monitor that attempt at recreate, and reopen this SW354783 if a new recreate is achieved.

This original recreate attempt on Firestone, fsbmc30, may be delayed, as it is currently tied up with debugging a link training issue.

<Automated Update> The severity of defect SW354783 was increased from 2 to 1 because defect SW358210 was rejected as the duplicate of defect SW354783 and the severity of defect SW358210 was higher than 2

Defect submitter, Dion is out on vacation until 7/11. So we can make progress on this most recent recreate, SW358210 dup'd to this SW354783,
I request the defect Owner, Luciano/ScreenTeam, to please reopen this SW354783 and continue live debug on the held system from SW358210:

#=#=# 2016-07-05 17:12:28 (CDT) #=#=#
Action = [reopen]

I'm not quite sure how to handle this (I'll ping Mark Smith) defect.

Dion's defect
SW358210 : FlashGT STC GA3: capiredp01: TMF timed out and Unable to handle kernel paging request before system drops into xmon debugger, was running HTX for superpipe with 1600 virtual luns across 4 FlashGT NVME cards

was just dup'd to this one.

That system is currently in XMON debugger now and can be debugged to 1) verify it is same issue and 2) maybe try to find root cause (his defect can be re-opened if not the same issue).
#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#
Not able to look SW358210.
Looking into machine capiredp01 box.
Machine details:

FSP: capiredfsp.aus.stglabs.ibm.com (dev/FipSdev)
Partition: capiredp01.aus.stglabs.ibm.com
IPMI console: ipmitool -I lanplus -H capiredfsp.aus.stglabs.ibm.com -P abc123 sol activate

Fail on "capiredfsp" seems same as reported in this bug.
hxesurelock process has segfaulted and kernel has crashed
while generating core dump.

cde00 (<email address hidden>) added native attachment /tmp/AIXOS05866176/dmesg_backtrace_capiredfsp on 2016-07-07 06:19:39

Hi Dominic,
Can you please have some one from kernel team look at this ?
HTX (hxesurelock) process has segfaulted and kernel has crashed while
generating core. Attached kernel logs with bug . Machine is sitting in
xmon and available for debug.
(In reply to comment #25)
> Hi Dominic,
> Can you please have some one from kernel team look at this ?
> HTX (hxesurelock) process has segfaulted and kernel has crashed while
> generating core. Attached kernel logs with bug . Machine is sitting in
> xmon and available for debug.

Vipin,

I cannot ssh to capiredfsp.aus.stglabs.ibm.com (dev/FipSdev). Is the machine still in xmon?

(In reply to comment #26)
> Vipin,
> I cannot ssh to capiredfsp.aus.stglabs.ibm.com (dev/FipSdev). Is the machine
> still in xmon?

Yes its still sitting in xmon. You can open console via IPMI.
Please see comment 22 for machine access details.

Just wanted to point out the send_tmf timeout (at the end of the kernel log) before the crash even though I am not sure it is the cause. The system is in xmon. Please advise if additional debug data need to be collected. Thanks.

Snippet at the end of the kernel log:

[ 8801.190528] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8806.190383] cxlflash 0007:00:00.0: send_tmf: TMF timed out!
[ 8816.507485] hxesurelock[14180]: unhandled signal 11 at 0000000000000024 nip 00003fff852c2ee8 lr 00003fff852c2938 code 30001
[ 8816.511368] hxesurelock[13501]: unhandled signal 11 at 0000000000000024 nip 00003fff890b2ee8 lr 00003fff890b2938 code 30001
[ 8816.526807] Unable to handle kernel paging request for data at address 0x0000000c
[ 8816.526928] Faulting instruction address: 0xc00000000035e2b0
[ 8816.530233] Unable to handle kernel paging request for data at address 0x0000000c
[ 8816.530596] Faulting instruction address: 0xc00000000035e2b0

Snippet of the send_tmf() code:
453 cmd_checkin(cmd);
454 spin_lock_irqsave(&cfg->tmf_slock, lock_flags);
455 cfg->tmf_active = false;
456 spin_unlock_irqrestore(&cfg->tmf_slock, lock_flags);
457 goto out;
458 }
459
460 spin_lock_irqsave(&cfg->tmf_slock, lock_flags);
461 to = msecs_to_jiffies(5000);
462 to = wait_event_interruptible_lock_irq_timeout(cfg->tmf_waitq,
463 !cfg->tmf_active,
464 cfg->tmf_slock,
465 to);
466 if (!to) {
467 cfg->tmf_active = false;
468 dev_err(dev, "%s: TMF timed out!\n", __func__);
469 rc = -1;
470 }
471 spin_unlock_irqrestore(&cfg->tmf_slock, lock_flags);

Boqun,