LiteOn NVMe issue

Bug #1694596 reported by Kai-Heng Feng
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Kai-Heng Feng

Bug Description

Originally reported at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184/comments/79

The user uses "nvme_core.default_ps_max_latency_us=6000" but the issue still happens.

$ sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x14a4
ssvid : 0x1b4b
sn : TW0XVRV7LOH006AJ09F1
mn : CX2-8B256-Q11 NVMe LITEON 256GB
fr : 48811QD
rab : 0
ieee : 002303
cmic : 0
mdts : 5
cntlid : 1
ver : 10200
rtd3r : f4240
rtd3e : f4240
oaes : 0
oacs : 0x1f
acl : 3
aerl : 3
frmw : 0x14
lpa : 0x2
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 358
cctemp : 368
mtfa : 50
hmpre : 0
hmmin : 0
tnvmcap : 256060514304
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0
vwc : 0x1
awun : 255
awupf : 7
nvscc : 1
acwu : 0
sgls : 0
ps 0 : mp:8.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:4.00W operational enlat:5 exlat:5 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:2.10W operational enlat:5 exlat:5 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

@John
If 6000 doesn't work but 0 works, it means operational states PS1 and PS2 have issue.
Can you try linux kernel in http://people.canonical.com/~khfeng/apst-rste-z/?

@Andy
Do we need to quirk off LiteOn completely, just like Toshiba?

@Mario
Does this also happen on Windows?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Wait, APST shouldn't be enabled when the value is 6000. It should have the same effect as nvme_core.default_ps_max_latency_us=0.

@John, can you check again?

Revision history for this message
John Neffenger (jgneff) wrote :

My NVMe drive failed again yesterday. I no longer think the errors are due to the APST support. I'm starting to think it's either something in the Linux kernel 4.8.0-53 upgrade or simply a problem with the drive hardware or firmware.

The time-line of the errors is shown below. Note that I never defined the "default_ps_max_latency_us" parameter to 6000. I only set it to a value of zero.

Parameter "nvme_core.default_ps_max_latency_us" is not defined:

2017-04-04 Upgraded to 4.8.0-46.49
2017-04-24 Upgraded to 4.8.0-49.52 ← Adds NVMe APST support (LP: #1664602)
2017-05-01 Upgraded to 4.8.0-51.54
2017-05-16 Upgraded to 4.8.0-52.55
2017-05-25 Upgraded to 4.8.0-53.56 ← Errors on NVMe drive

Parameter "nvme_core.default_ps_max_latency_us" is set to zero:

2017-05-29 Linux 4.8.0-53.56 (APST disabled) ← Running OK
2017-05-30 Linux 4.8.0-53.56 (APST disabled) ← Running OK
2017-05-31 Linux 4.8.0-53.56 (APST disabled) ← Errors on NVMe drive

At this point, I switched from the Hardware Enablement (HWE) stack back to the original General Availability (GA) stack for the Linux kernel and X server on Ubuntu 16.04.

2017-05-31 Linux 4.4.0-78.90 (GA) ← Running OK so far
2017-06-01 Linux 4.4.0-78.90 (GA) ← Running OK so far

I hope this change will help determine whether the errors are due to the drive itself or something in the Linux 4.8 upgrade. I guess we should close this bug report, or at least rename it.

Thank you,
John

summary: - LiteOn NVMe APST issue
+ LiteOn NVMe issue
Revision history for this message
John Neffenger (jgneff) wrote :

I'm adding the actual errors I saw, just for the record.

What happens is that the root file system becomes read-only due to errors on the drive. My root file system is a logical volume on an encrypted partition (LVM with LUKS encryption), mounted and encrypted with the following options:

$ mount | grep ubuntu
/dev/mapper/ubuntu--vg-root on / type ext4 (rw,relatime,errors=remount-ro,data=ordered)

$ sudo cryptsetup status nvme0n1p3_crypt
/dev/mapper/nvme0n1p3_crypt is active and is in use.
  type: LUKS1
  cipher: aes-xts-plain64
  keysize: 512 bits
  device: /dev/nvme0n1p3
  offset: 4096 sectors
  size: 498063360 sectors
  mode: read/write
  flags: discards

When the file system was re-mounted read-only due to errors, I booted Ubuntu 16.04.2 from a USB flash drive and ran "e2fsck" to repair the file system:

ubuntu@ubuntu:~$ sudo e2fsck -pfv /dev/mapper/ubuntu--vg-root
/dev/mapper/ubuntu--vg-root: recovering journal
/dev/mapper/ubuntu--vg-root: Inodes that were part of a corrupted orphan linked list found.

/dev/mapper/ubuntu--vg-root: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)

So then I ran the command again without the "-p" option, but with the "-y" option to answer "yes" to all questions, and the command fixed 329 errors:

ubuntu@ubuntu:~$ sudo e2fsck -fvy /dev/mapper/ubuntu--vg-root
e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix? yes

[ ... 327 errors not shown ...]

Free inodes count wrong (13486735, counted=13478898).
Fix? yes

/dev/mapper/ubuntu--vg-root: ***** FILE SYSTEM WAS MODIFIED *****

     1045518 inodes used (7.20%, out of 14524416)
        2413 non-contiguous files (0.2%)
         851 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 27/27/27
             Extent depth histogram: 975523/235
    30224290 blocks used (52.03%, out of 58093568)
           0 bad blocks
           6 large files

      838712 regular files
      113346 directories
          67 character device files
          29 block device files
           3 fifos
          28 links
       93308 symbolic links (69609 fast symbolic links)
          44 sockets
------------
     1045537 files

This is the second day I'm running with the Ubuntu 16.04 GA Linux kernel version 4.4.0-78 (instead of the HWE Linux kernel), and it still seems to be okay, but the problem can take a few days to appear.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Have you seen this issue since last comment?

Revision history for this message
John Neffenger (jgneff) wrote :

No, the errors stopped when I reverted to the Ubuntu 16.04 GA kernel and X server on May 31.

I have been running error-free for over two weeks with the "linux-generic" package version 4.4.0.79.85. It appears there may be something in Linux kernel version 4.8.0-53.56 that caused the problems, but I'm reluctant to risk more data loss to double check.

Note that I was running with the HWE kernel version 4.8 for a couple of months, but I didn't see the errors until I upgraded to 4.8.0-53, and then I saw the file system errors all the time.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Hmm, eventually I still want to backport NVMe APST support to Xenial (4.4) kernel - the reduction on energy consumption is quite huge.

My same LiteON NVMe works flawlessly for months with APST enabled, but with a different firmware version -

NVME Identify Controller:
vid : 0x14a4
ssvid : 0x1b4b
sn : TW09F8D15508563B006Z
mn : CX2-8B256-Q11 NVMe LITEON 256GB
fr : 488110B

Revision history for this message
John Neffenger (jgneff) wrote :

My system is under warranty, so I'll find out in the next week what Dell says about the problem. I don't see any way for me to upgrade the firmware on the device:

http://www.liteonssd.com/en/pcie-ssd/item/client-pcie-ssd/CLIENT-CX2-SERIES.html

I'm on firmware revision 48811QD, while you're on revision 488110B, but otherwise it's the same drive model (CX2-8B256-Q11 NVMe LITEON 256GB).

I had errors on the file system even after I disabled the APST support by adding "nvme_core.default_ps_max_latency_us=0" to the kernel parameters. Is there any chance that running for a month with the APST support could have made the errors continue even after disabling it?

I will post an update with Dell's response before the end of the month.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the output of `sudo nvme get-feature -f 0x0c -H /dev/nvme0`?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Also, can you attach the filesystem errors? Filesystems (EXT4, XFS, etc.) are pretty robust nowadays.

Revision history for this message
John Neffenger (jgneff) wrote :
Download full text (5.1 KiB)

I opened a Service Request yesterday with Dell regarding my Lite-On CX2 NVMe Series drive. The workstation group escalated the support ticket to their "High Complexity or Ubuntu" group and told me they would get back to me by tomorrow (in 24-48 hours).

The last time this happened, there were 329 different errors on the root file system, all seemingly random. I included a couple of them in Comment #4 above. I think I've deleted the file that captured all of the errors. Even after recovering from the file system errors, it does seem that I lost some files, such as the icons in Thunderbird context menus and other random things.

$ sudo nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0x0c (Autonomous Power State Transition), Current value: 00000000
 Autonomous Power State Transition Enable (APSTE): Disabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 5000 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 5000 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 5000 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 10000 ms
 Idle Transition Power State (ITPS): 4
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[13]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[14]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[15]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[16]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 En...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I am no expert on this area, but you probably also need to check the integrity of dm-crypt layer and lvm layer.

Revision history for this message
John Neffenger (jgneff) wrote :

Dell asked me to apply the following firmware upgrade, which fixed some problems on Windows:

LITEONIT Solid State Drive Firmware
http://www.dell.com/support/home/ca/en/cabsdt1/Drivers/DriversDetails?driverId=CWX68

The update program, though, works only on Windows. I'm waiting to find out from Dell whether there's a way to apply the update without installing Windows over my current Ubuntu installation. For example, I thought I might use a WinPE bootable flash drive to apply the update:

WinPE: Create USB Bootable drive
https://docs.microsoft.com/en-us/windows-hardware/manufacture/desktop/winpe-create-usb-bootable-drive

Revision history for this message
John Neffenger (jgneff) wrote :

After wiping out my Ubuntu installation with Windows 10 Pro and running the firmware upgrade, I discovered that it was an upgrade to the version I already have (48811QD). I also tried a separate DOS firmware upgrade program that Dell sent to me by e-mail, with the same result:

Model: CX2-8B256-Q11 NVMe LITEON 256GB
Current FW: 48811QD
No Firmware Upgrade Needed(N01)

I reinstalled Ubuntu but this time with the Hardware Enablement (HWE) kernel version 4.8.0.58.29 installed by Ubuntu 16.04.2, just to see whether the error happens again. My home directory is safe on a secondary drive this time.

At this point, if the Lite-On drive fails as it did before, I guess I'll just buy the Samsung 960 EVO SSD (MZ-V6E250BW) as a replacement and see whether Dell will let me return the Lite-On Technology CX2 NVMe Series SSD (CX2-8B256) for a refund.

Revision history for this message
John Neffenger (jgneff) wrote :

I swapped my Lite-On drive for a replacement Samsung drive that Dell sent to me. As a result, I'm unable to help any further on this bug report. I returned the Lite-On drive to Dell for further testing.

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.