NVMe device driver failing w/nvme-user and htx tools

Bug #1442186 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Leann Ogasawara

Bug Description

---Problem Description---
NVMe device driver running with a Samsung SM1715 card fails (hits EEH) w/nvme-user and htx tools

Machine Type = Tuleta P8

---Steps to Reproduce---
 1. Install the nvme-user tools package and run the nvme_rw tool
or
2. Install the htx I/O exerciser tool and run htx

---uname output---
Linux ubuntu1504 3.18.0-12-generic #13-Ubuntu SMP Thu Jan 29 13:44:26 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

EEH error in kernel log after running HTX:

[503195.365335] EEH: Frozen PE#2 on PHB#4 detected
[503195.365346] EEH: PE location: U78CB.001.D123456-P1-C5 , PHB location: U
78CB.001.D123456-P1-C33
[503195.366226] EEH: This PCI device has failed 1 times in the last hour
[503195.366227] EEH: Notify device drivers to shutdown
[503195.366233] EEH: Collect temporary log
[503195.366315] EEH: of node=/pciex@3fffe42000000/pci@0/mass-storage@0
[503195.366321] EEH: PCI device/vendor: a820144d
[503195.366327] EEH: PCI cmd/status register: 00100546
[503195.366329] EEH: PCI-E capabilities and status follow:
[503195.366356] EEH: PCI-E 00: 00024010 10008fe1 0000202e 00416c43
[503195.366372] EEH: PCI-E 10: 10430040 00000000 00000000 00000000
[503195.366374] EEH: PCI-E 20: 00000000
[503195.366375] EEH: PCI-E AER capability register set follows:
[503195.366396] EEH: PCI-E AER 00: 18020001 00000000 00400000 00440010
[503195.366413] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
[503195.366429] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[503195.366435] EEH: PCI-E AER 30: 00000000 00000000
[503195.366438] PHB3 PHB#4 Diag-data (Version: 1)
[503195.366440] brdgCtl: 00000002
[503195.366442] UtlSts: 00200000 00000000 00000000
[503195.366444] RootSts: 00000020 00400000 f0430048 00100147 00000000
[503195.366446] PhbSts: 0000001c00000000 0000001c00000000
[503195.366448] Lem: 0000000004000000 42498e327f502eae 0000000000000000
[503195.366450] InAErr: 0000000000004000 0000000000004000 0000000000000000 0000000000000000
[503195.366452] PE[ 2] A/B: 8000302500000000 8000000000000000
[503195.366454] EEH: Reset with hotplug activity
[503195.367741] nvme 0004:01:00.0: Cancelling I/O 0 QID 57
[503199.411764] EEH: Sleep 5s ahead of complete hotplug
[503204.415815] pci 0004:01:00.0: [144d:a820] type 00 class 0x010802
[503204.415874] pci 0004:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[503204.423731] pci 0004:01:00.0: BAR 0: assigned [mem 0x3ff000000000-0x3ff000003fff 64bit]
[503204.423970] nvme 0004:01:00.0: enabling device (0140 -> 0142)
[503204.423998] nvme 0004:01:00.0: Using 64-bit DMA iommu bypass
[503207.914466] nvme1n1: unknown partition table
[503207.914643] EEH: Notify device driver to resume

The PHB3 diag-data is telling: DMA write to PCI bus address 0x0, which doesn't have corresponding valid TCE entries. As I know, there are 2 possibilities and (A) might be the case.

(A) As MSI/MSIx interrupt is essentially DMA write transaction, the PCI bus address 0x0 would be MSI/MSIx message address. The address 0x0 is obviously invalid. It indicates there is spurious MSI/MSIx interrupt whose MSI/MSIx message wasn't populated and updated to hardware yet.

(B) We really have 0x0 DMA write transaction and the TCE entries wasn't populated and built yet. From the kernel log, we're using direct DMA window. So the DMA address shouldn't less than (0x1ul << 59).

Recap the PHB3 diag-data
================

[503195.366438] PHB3 PHB#4 Diag-data (Version: 1)
[503195.366440] brdgCtl: 00000002
[503195.366442] UtlSts: 00200000 00000000 00000000
[503195.366444] RootSts: 00000020 00400000 f0430048 00100147 00000000
[503195.366446] PhbSts: 0000001c00000000 0000001c00000000
[503195.366448] Lem: 0000000004000000 42498e327f502eae 0000000000000000
[503195.366450] InAErr: 0000000000004000 0000000000004000 0000000000000000 0000000000000000
[503195.366452] PE[ 2] A/B: 8000302500000000 8000000000000000

William helped run following command, which triggers EEH error, which has same root cause as we saw before. After that, re-running the command won't trigger EEH again:

# sudo dd if=/dev/nvme1n1 of=/dev/null bs=1k count=512
# dmesg | grep EEH.*has\ failed
[ 2377.124104] EEH: This PCI device has failed 1 times in the last hour

I'm not sure if there is possibility of content in MSIx table is overwritten because the device has only one memory BAR. The top 4KB region of the BAR is reserved for MSIx table and left space for MMIO registers. So we probably need dump the MSIx table to have a check when detecting EEH error:

# sudo lspci -vv -s 0004:01:00.0
0004:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd \
NVMe SSD Controller 171X (rev 03) (prog-if 02 [NVM Express])
 Subsystem: Samsung Electronics Co Ltd Device a809
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 512
 Region 0: Memory at 3ff000000000 (64-bit, non-prefetchable) [size=16K]
        :
        :
 Capabilities: [e0] MSI-X: Enable+ Count=129 Masked-
  Vector table: BAR=0 offset=00002000
  PBA: BAR=0 offset=00003000

> I'm not sure if there is possibility of content in MSIx table is overwritten
> because the device has only one memory BAR. The top 4KB region of the BAR is
> reserved for MSIx table and left space for MMIO registers. So we probably
> need dump the MSIx table to have a check when detecting EEH error:
>
> Region 0: Memory at 3ff000000000 (64-bit, non-prefetchable) [size=16K]
> :
> :
> Capabilities: [e0] MSI-X: Enable+ Count=129 Masked-
> Vector table: BAR=0 offset=00002000
> PBA: BAR=0 offset=00003000

Actually this adapter has little different BAR mapping.

BAR0 + 0x0000 - 0x1FFF - 8 KB reserved for controller/device management
BAR0 + 0x2000 - 0x2FFF - 4K meant for MSI-X table, this device supports 129 vectors so, 0x2000 - 0x280F (2046 bytes) used
BAR0 + 0x3000 - 0x3FFF - 4K meant for Pending Bit Array (PBA)

I see 128 MSI-X are used/enabled since this system has 128 CPU threads and NVMe driver allocates one MSI-X for each online CPU/completion Q.
Admin Q and first I/O Q share the same interrupt.

http://www.nvmexpress.org/wp-content/uploads/NVM-Express-1_2-Gold-20141209.pdf

I see MSI-X 0 uses 0x1000000000000000 address with MSg. pay load data value of 0x0000

0x3FF000002000 0x1000000000000000 0x0000000000000000 -- MSI-X 0
0x3FF000002010 0x1000000000000000 0x0000000000000001
....
0x3FF0000021F0 0x1000000000000000 0x000000000000001F
0x3FF000002200 0x1000000000000200 0x0000000000000000
0x3FF000002210 0x1000000000000200 0x0000000000000001
...
0x3FF0000023F0 0x1000000000000200 0x000000000000001F
0x3FF000002400 0x1000000000000400 0x0000000000000000
0x3FF000002410 0x1000000000000400 0x0000000000000001
...
0x3FF000002600 0x1000000000000600 0x0000000000000000
0x3FF000002610 0x1000000000000600 0x0000000000000001
0x3FF000002620 0x1000000000000600 0x0000000000000002
...
0x3FF0000027F0 0x1000000000000600 0x000000000000001F ---> MSI-X 127
0x3FF000002800 0x0000000000000000 0x0000000100000000 ---> MSI-X 128 (129th) and not enabled by the driver

Do you see any implementation issue w/ supporting 129 MSI-X and having MSI-X table at BAR0+offset 0x2000?

Also MSI-X table seems to be intact after EEH recovery.

The MSIx table looks good except MSIx 128 (129th), which shouldn't have 0x0 as its message address. Once this address is sent to root port and it would trigger the EEH error caused by failing DMA address 0x0.
Could you please claim when you dumped the MSIx table? If you dumped the table before any data transfer to the device, I would believe the MSIx interrupt#128 was never populated properly.

I installed one kernel built from 4.0.rc3, which has 4KB system memory page size enabled, on William's machine. There're no problems found after 5 hours HTX testing. As discussed with Murali: IBM Power is using 64KB system page size by default, which brought troubles for NVME driver to calculate DMA address written to the device properly. More details can be found in drivers/block/nvme-core.c::nvme_setup_prps(). I believe Murali is working/testing on fix against it and hopefully it will be fixed pretty soon.

The attached patch will work for 3.19.x kernel or later but won't apply on 3.18.x based kernel. Please update your system to latest Ubuntu 15.04 that uses 3.19.x and then apply the patch on that kernel. nvme_setup_prps() in 3.18.x is different and needs to change quite a bit. Let's see whether it is worth to back port or not.

This patch works as expected. HTX is running over an hour w/o any mis-compare. Typically fails right away w/o this fix..

Revision history for this message
bugproxy (bugproxy) wrote : proposed fix #2

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-121568 severity-high targetmilestone-inin1504
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: nobody → Taco Screen team (taco-screen-team)
status: New → Confirmed
Revision history for this message
Breno Leitão (breno-leitao) wrote :

Brian,

Just confirm that the patch was not accepted yet, and, we will wait until the patch gets accepted before including in Ubuntu.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-04-10 12:38 EDT-------
The patch is accepted upstream in the subsystem maintainer's tree, and will be pushed to Linus in the next merge window.

http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-next&id=f137e0f15101c0a0ff8a5d956042e50247f1fef9

Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Leann Ogasawara (leannogasawara)
importance: Undecided → High
status: Confirmed → In Progress
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-14.14

---------------
linux (3.19.0-14.14) vivid; urgency=low

  [ Upstream Kernel Changes ]

  * powerpc: fix memory corruption by pnv_alloc_idle_core_states
    - LP: #1442678
  * nvme: Fix PRP list calculation for non-4k system page size
    - LP: #1442186
  * powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH
    - LP: #1442180
 -- Leann Ogasawara <email address hidden> Fri, 10 Apr 2015 11:40:37 -0700

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : proposed fix #2

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote :

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote :

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote :

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-11-10 16:44 EDT-------
Talked to Bill. He has verified the fix.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.