APST quirk needed for Intel NVMe

Bug #1686592 reported by Kai-Heng Feng on 2017-04-27
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Yakkety
Undecided
Unassigned
Zesty
Undecided
Unassigned

Bug Description

[Impact]
Intel NVMe failed to work after some disk I/O.

[Test Case]
Use the system for a while. Any disk I/O may make Intel NVMe failed to operate.

[Regression Potential]
None. It only applies to limited Intel NVMe devices.

Two users reports issue on Intel NVMe [1] (comment #34, #35).

File a new bug to let the original bug report stays on Dell & Samsung combination.

[1] https://bugs.launchpad.net/bugs/1678184

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1686592

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
sridhar basam (sri-7) wrote :

Output from nvme id-ctrl on a Dell 9360 with Intel 600p nvme drive.

NVME Identify Controller:
vid : 0x8086
ssvid : 0x8086
sn : BTPY63850F281P0H
mn : INTEL SSDPEKKW010T7
fr : PSF104C
rab : 6
ieee : 5cd2e4
cmic : 0
mdts : 5
cntlid : 1
ver : 10200
rtd3r : 249f0
rtd3e : 13880
oaes : 0
oacs : 0x6
acl : 4
aerl : 7
frmw : 0x12
lpa : 0x3
elpe : 63
npss : 4
avscc : 0
apsta : 0x1
wctemp : 343
cctemp : 353
mtfa : 20
hmpre : 0
hmmin : 0
tnvmcap : 0
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1e
fuses : 0
fna : 0x4
vwc : 0x1
awun : 0
awupf : 0
nvscc : 0
acwu : 0
sgls : 0
ps 0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

sridhar basam (sri-7) wrote :

Booting with the kernel option nvme_core.default_ps_max_latency_us=11000 makes a difference for me. i no longer see filesystem crashes with this option on 4.10.x and 4.8.0-49 kernel. Prior to this kernel change, filesytem crashed soon after finishing the boot process.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Triaged
tags: added: kernel-da-key
Javi (javigs) wrote :

Booting with 4.8.0-49 or 4.10.0-20 crash my system: Dell 9550 with PM951 NVMe SAMSUNG 512GB

Now I'm working with 4.8.0-46 without crashes

sridhar basam (sri-7) wrote :

@javigs Can you post the output from nvme id-ctrl?

Josep Torra (adn770) wrote :

This http://paste.ubuntu.com/24467585/ is the output on the Skull Canyon systems we have here.

sridhar basam (sri-7) wrote :

@adn770 can you try to boot the newer kernels with the boot option nvme_core.default_ps_max_latency_us=11000

That should disable the last state.

Javi (javigs) wrote :

With 4.10.0-20 kernel:

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S29PNXAH302364
mn : PM951 NVMe SAMSUNG 512GB
fr : BXV77D0Q
rab : 2
ieee : 002538
cmic : 0
mdts : 5
cntlid : 1
ver : 0
rtd3r : 0
rtd3e : 0
oaes : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x6
lpa : 0
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 0
cctemp : 0
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 0
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ps 0 : mp:6.00W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:4.20W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:3.10W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0700W non-operational enlat:500 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2000 exlat:22000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Thanks

Javi (javigs) wrote :

Sorry, to be more precise that my first comment, my crash not while booting. Really system crashed in less than an hour using 4.8.0-49 or 4.10.0-20 kernels.

Kai-Heng Feng (kaihengfeng) wrote :

Anyone who's using Intel NVMe, please try kernel in http://people.canonical.com/~khfeng/lp1686592/

sridhar basam (sri-7) wrote :

I am unable to boot up the above kernel reliably on my laptop.Both times i booted it, my system hung. I had to do a hard reset (force power down).

I am not able to tell if the new kernel fixes my issue. When the system hangs, keyboard input doesn't work, this happened immediately after kernel booted. The 2nd time the system was able to get to the unity login screen and i was able to login. SYstem hung soon after i launched terminal to test.

Kai-Heng Feng (kaihengfeng) wrote :

Can you try again with http://people.canonical.com/~khfeng/lp1686592-2/?

If the issue persists, please attach dmesg here.

Andy Lutomirski (luto-mit) wrote :

@kaihengfeng, I'm not sure what your test kernel is, but could you build one with the two patches here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=nvme/power&id=1115e17962c597d8e6dd140d903a51a58d0ec2c0

Those delete the earlier quirk and replace the APST table computation with something closer to what Intel RSTe does.

Thanks!

Kai-Heng Feng (kaihengfeng) wrote :

Hi guys, can you try the kernel in [1] and remove the nvme kernel parameter? I slightly tested it on the Precision 5510 at my hand, haven't notice any errors so far.

It should also work on Intel NVMe.

[1] http://people.canonical.com/~khfeng/apst-rste-z/

Hi,

Within our company we have multiple Intel NUC with Intel NVMe drives that regularly crash. Programs throw errors and hang and the filesystem gets corrupt. When (re)booting it enters initramfs where a "fsck -y /dev/nvme0n1p2" fixes the issues (for now).

As most issues happen after a period of inactivity I suspect APST. I have deployed the custom kernel on two devices and as the corruption occurs every day I hope to update this thread soon.

If you suspect that this is a different issue, please comment and I will open a new bug.

The kernel seems to resolve the issue, both machines haven't crashed since I installed it.

Do you have any idea when this patch is included into the main repo?

Kai-Heng Feng (kaihengfeng) wrote :

As soon as the patch author (Andy Lutomirski) makes the change into nvme/block/linus tree, I'll cherry pick the patches into Ubuntu's kernel.

Andy Lutomirski (luto-mit) wrote :

For those of you with problematic Intel devices, could you post the relevant line from 'lspci -nn'? It looks like there are a few known firmware issues with some Intel SSDs and a fix is in the works. Meanwhile, I want to quirk the correct set of devices.

(I'm hoping they're all 8086:f1a5)

Kai-Heng Feng (kaihengfeng) wrote :

@Onno
Do you still encounter this issue on the kernel in comment #14?
Other user reported the issue may still occur albeit with much lower rate.

sridhar basam (sri-7) wrote :

00:00.0 Host bridge [0600]: Intel Corporation Device [8086:5904] (rev 02)
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:5916] (rev 02)
00:04.0 Signal processing controller [1180]: Intel Corporation Skylake Processor Thermal Subsystem [8086:1903] (rev 02)
00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f] (rev 21)
00:14.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Thermal subsystem [8086:9d31] (rev 21)
00:15.0 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:9d60] (rev 21)
00:15.1 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Serial IO I2C Controller [8086:9d61] (rev 21)
00:16.0 Communication controller [0780]: Intel Corporation Sunrise Point-LP CSME HECI [8086:9d3a] (rev 21)
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1)
00:1c.4 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port [8086:9d14] (rev f1)
00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port [8086:9d15] (rev f1)
00:1d.0 PCI bridge [0604]: Intel Corporation Device [8086:9d18] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:9d58] (rev 21)
00:1f.2 Memory controller [0580]: Intel Corporation Sunrise Point-LP PMC [8086:9d21] (rev 21)
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:9d71] (rev 21)
00:1f.4 SMBus [0c05]: Intel Corporation Sunrise Point-LP SMBus [8086:9d23] (rev 21)
3a:00.0 Network controller [0280]: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter [168c:003e] (rev 32)
3b:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader [10ec:525a] (rev 01)
3c:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:f1a5] (rev 03)

Output from lspci -nn

Kai-Heng Feng (kaihengfeng) wrote :

For intel nvme, please try this zesty kernel:
http://people.canonical.com/~khfeng/lp1686592-170525/

It added two more patches from Andy:
[1/2] nvme: Wait at least 6000ms before entering the deepest idle state
[2/2] nvme: Quirk APST on Intel 600P/P3100 devices

@kaihengfeng I haven't seen an issue on both devices with the kernel from #14. I just installed a third devices with the kernel from #21.

If you need any logging or device specs please let me know.

Kai-Heng Feng (kaihengfeng) wrote :

Hi, an Intel developer [1] said firmware update [2] can solve the issue.
After update the firmware, can you use kernel from Ubuntu's repo instead of what I compiled?

[1] http://lists.infradead.org/pipermail/linux-nvme/2017-May/010560.html
[2] https://downloadcenter.intel.com/download/26491?v=t

We had the firmware 2.2.0 installed on the two machines, that did not help.

Changelog seems to contain fixes that might help:

> This firmware versions contain fixes for the following issues:
> • Drive hangs intermittently after Format NVM command.
> • Format NVM command occasionally failing with PCIe ASPM enabled.
> • Data miscompare caused by intermittent data corruption during heavy write workload with small file transfer size.
> • Incorrect drive behavior for command with Forced Unit Access setting.

Will try to convince my colleague to downgrade his workstation for testing.

Forgot to mention that the changelog is for version 2.2.1.

Kai-Heng Feng (kaihengfeng) wrote :

Hmm, does 2.2.1 help?

description: updated

Two machines on stock kernel with firmware 2.2.1, no issues so far.

Two different machines are running the kernel from #21 with 2.2.0, no problems as well.

Should be able to tell you next week if we have encountered issues.

Seth Forshee (sforshee) on 2017-06-05
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: New → Fix Committed
Changed in linux (Ubuntu Zesty):
status: New → Fix Committed

One of the machines running 2.2.1 with stock firmware has crashed overnight.

Machines with the kernel from #21 haven't had any issues since installation (three weeks or so)

Kai-Heng Feng (kaihengfeng) wrote :

Thanks for the info, Onno.

So the firmware didn't fix the issue. The kernel fix in #21 will be in newer kernel.

Hope to see the new kernel soon. Thanks for the work Kai-Heng!

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-zesty' to 'verification-done-zesty'. If the problem still exists, change the tag 'verification-needed-zesty' to 'verification-failed-zesty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-zesty
Launchpad Janitor (janitor) wrote :
Download full text (3.8 KiB)

This bug was fixed in the package linux - 4.8.0-58.63

---------------
linux (4.8.0-58.63) yakkety; urgency=low

  * linux: 4.8.0-58.63 -proposed tracker (LP: #1700533)

  * CVE-2017-1000364
    - Revert "UBUNTU: SAUCE: mm: Only expand stack if guard area is hit"
    - Revert "mm: do not collapse stack gap into THP"
    - Revert "mm: enlarge stack guard gap"
    - mm: vma_adjust: remove superfluous confusing update in remove_next == 1 case
    - mm: larger stack guard gap, between vmas
    - mm: fix new crash in unmapped_area_topdown()
    - Allow stack to grow up to address space limit

linux (4.8.0-57.62) yakkety; urgency=low

  * linux: 4.8.0-57.62 -proposed tracker (LP: #1699035)

  * CVE-2017-1000364
    - SAUCE: mm: Only expand stack if guard area is hit

  * CVE-2017-7374
    - fscrypt: remove broken support for detecting keyring key revocation

  * CVE-2017-100363
    - char: lp: fix possible integer overflow in lp_setup()

  * CVE-2017-9242
    - ipv6: fix out of bound writes in __ip6_append_data()

  * CVE-2017-9075
    - sctp: do not inherit ipv6_{mc|ac|fl}_list from parent

  * CVE-2017-9074
    - ipv6: Prevent overrun when parsing v6 header options

  * CVE-2017-9076
    - ipv6/dccp: do not inherit ipv6_mc_list from parent

  * CVE-2017-9077
    - ipv6/dccp: do not inherit ipv6_mc_list from parent

  * CVE-2017-8890
    - dccp/tcp: do not inherit mc_list from parent

  * extend-diff-ignore should use exact matches (LP: #1693504)
    - [Packaging] exact extend-diff-ignore matches

  * APST quirk needed for Intel NVMe (LP: #1686592)
    - nvme: Quirk APST on Intel 600P/P3100 devices

  * regression: the 4.8 hwe kernel does not create the
    /sys/block/*/device/enclosure_device:* symlinks (LP: #1691899)
    - scsi: ses: Fix SAS device detection in enclosure

  * datapath: Add missing case OVS_TUNNEL_KEY_ATTR_PAD (LP: #1676679)
    - openvswitch: Add missing case OVS_TUNNEL_KEY_ATTR_PAD

  * connection flood to port 445 on mounting cifs volume under kernel
    (LP: #1686099)
    - cifs: Do not send echoes before Negotiate is complete

  * Support IPMI system interface on Cavium ThunderX (LP: #1688132)
    - i2c: octeon: Rename driver to prepare for split
    - i2c: octeon: Split the driver into two parts
    - [Config] CONFIG_I2C_THUNDERX=m
    - i2c: thunderx: Add i2c driver for ThunderX SOC
    - i2c: thunderx: Add SMBUS alert support
    - i2c: octeon,thunderx: Move register offsets to struct
    - i2c: octeon: Sort include files alphabetically
    - i2c: octeon: Use booleon values for booleon variables
    - i2c: octeon: thunderx: Add MAINTAINERS entry
    - i2c: octeon: Fix set SCL recovery function
    - i2c: octeon: Avoid sending STOP during recovery
    - i2c: octeon: Fix high-level controller status check
    - i2c: octeon: thunderx: TWSI software reset in recovery
    - i2c: octeon: thunderx: Remove double-check after interrupt
    - i2c: octeon: thunderx: Limit register access retries
    - i2c: thunderx: Enable HWMON class probing

  * CVE-2017-5577
    - drm/vc4: Return -EINVAL on the overflow checks failing.

  * Merlin SGMII fail on Ubuntu Xenial HWE kernel (LP: #1686305)
    - net: phy: marvell: fix Marvell 88E1512 u...

Read more...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (19.7 KiB)

This bug was fixed in the package linux - 4.10.0-26.30

---------------
linux (4.10.0-26.30) zesty; urgency=low

  * linux: 4.10.0-26.30 -proposed tracker (LP: #1700528)

  * CVE-2017-1000364
    - Revert "UBUNTU: SAUCE: mm: Only expand stack if guard area is hit"
    - Revert "mm: do not collapse stack gap into THP"
    - Revert "mm: enlarge stack guard gap"
    - mm: larger stack guard gap, between vmas
    - mm: fix new crash in unmapped_area_topdown()
    - Allow stack to grow up to address space limit

linux (4.10.0-25.29) zesty; urgency=low

  * linux: 4.10.0-25.29 -proposed tracker (LP: #1699028)

  * CVE-2017-1000364
    - SAUCE: mm: Only expand stack if guard area is hit

  * CVE-2017-9074
    - ipv6: Prevent overrun when parsing v6 header options
    - ipv6: Check ip6_find_1stfragopt() return value properly.

  * [Zesty] QDF2400 ARM64 server - NMI watchdog: BUG: soft lockup - CPU#8 stuck
    for 22s! (LP: #1680549)
    - iommu/dma: Stop getting dma_32bit_pfn wrong
    - iommu/dma: Implement PCI allocation optimisation
    - iommu/dma: Convert to address-based allocation
    - iommu/dma: Clean up MSI IOVA allocation
    - iommu/dma: Plumb in the per-CPU IOVA caches
    - iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range

  * Zesty update to 4.10.17 stable release (LP: #1692898)
    - xen: adjust early dom0 p2m handling to xen hypervisor behavior
    - target: Fix compare_and_write_callback handling for non GOOD status
    - target/fileio: Fix zero-length READ and WRITE handling
    - iscsi-target: Set session_fall_back_to_erl0 when forcing reinstatement
    - usb: xhci: bInterval quirk for TI TUSB73x0
    - usb: host: xhci: print correct command ring address
    - USB: serial: ftdi_sio: add device ID for Microsemi/Arrow SF2PLUS Dev Kit
    - USB: Proper handling of Race Condition when two USB class drivers try to
      call init_usb_class simultaneously
    - USB: Revert "cdc-wdm: fix "out-of-sync" due to missing notifications"
    - staging: vt6656: use off stack for in buffer USB transfers.
    - staging: vt6656: use off stack for out buffer USB transfers.
    - staging: gdm724x: gdm_mux: fix use-after-free on module unload
    - staging: wilc1000: Fix problem with wrong vif index
    - staging: comedi: jr3_pci: fix possible null pointer dereference
    - staging: comedi: jr3_pci: cope with jiffies wraparound
    - usb: misc: add missing continue in switch
    - usb: gadget: legacy gadgets are optional
    - usb: Make sure usb/phy/of gets built-in
    - usb: hub: Fix error loop seen after hub communication errors
    - usb: hub: Do not attempt to autosuspend disconnected devices
    - x86/boot: Fix BSS corruption/overwrite bug in early x86 kernel startup
    - selftests/x86/ldt_gdt_32: Work around a glibc sigaction() bug
    - x86, pmem: Fix cache flushing for iovec write < 8 bytes
    - um: Fix PTRACE_POKEUSER on x86_64
    - perf/x86: Fix Broadwell-EP DRAM RAPL events
    - KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
    - KVM: arm/arm64: fix races in kvm_psci_vcpu_on
    - arm64: KVM: Fix decoding of Rt/Rt2 when trapping AArch32 CP accesses
    - block: fix blk_integrity_register to use templ...

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers