Samsung SSD 960 EVO 500GB refused to change power state

Bug #1705748 reported by Steve Roberts on 2017-07-21
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Kai-Heng Feng
Artful
Medium
Kai-Heng Feng

Bug Description

=== SRU Justification ===
[Impact]
A user reported his NVMe went out to lunch after S3.

[Fix]
Disable APST for this particular NVMe + Motherboard setup.

[Test]
User confirmed this quirk works for his system.

[Regression Potential]
Very Low. This applies to a specific device setup, also user can
override this quirk by "nvme_core.force_apst=1".

=== Original Bug Report ===
Originally thought my issue was same as this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
but requested to report as separate bug

System becomes unusable at seemingly random times but especially after resume from suspend due to disk 'disappearing' becoming inaccessible, with hundreds of I/O errors logged.

After viewing the above bug report yesterday as a quick temporary fix I added kernel param, updated grub, etc with:
GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"

System appears to have been stable for the last day, but is presumably using more power than it should.

System, drive details below:

M2 nvme drive: Samsung SSD 960 EVO 500GB

Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17

M/B Asus Prime B350m-A
Ryzen 1600 cpu

Jul 20 16:32:59 phs08 kernel: [ 190.893571] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]
Jul 20 16:33:05 phs08 kernel: [ 197.010928] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 20 16:33:05 phs08 kernel: [ 197.046980] pci_raw_set_power_state: 4 callbacks suppressed
Jul 20 16:33:05 phs08 kernel: [ 197.046985] nvme 0000:01:00.0: Refused to change power state, currently in D3
Jul 20 16:33:05 phs08 kernel: [ 197.047163] nvme nvme0: Removing after probe failure status: -19
Jul 20 16:33:05 phs08 kernel: [ 197.047182] nvme0n1: detected capacity change from 500107862016 to 0
Jul 20 16:33:05 phs08 kernel: [ 197.047793] blk_update_request: I/O error, dev nvme0n1, sector 0

nvme list

/dev/nvme0n1 S3EUNX0J305518L Samsung SSD 960 EVO 500GB 1.2 1 125.20 GB / 500.11 GB 512 B + 0 B 2B7QCXE7

sudo nvme id-ctrl /dev/nvme0

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S3EUNX0J305518L
mn : Samsung SSD 960 EVO 500GB
fr : 2B7QCXE7
rab : 2
ieee : 002538
cmic : 0
mdts : 9
cntlid : 2
ver : 10200
rtd3r : 7a120
rtd3e : 4c4b40
oaes : 0
oacs : 0x7
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 350
cctemp : 352
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 500107862016
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0x5
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:4.08W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
---
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: drgrumpy 2192 F.... pulseaudio
 /dev/snd/pcmC1D0p: drgrumpy 2192 F...m pulseaudio
 /dev/snd/controlC1: drgrumpy 2192 F.... pulseaudio
DistroRelease: Linux 18.2
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=a6896cdd-6bac-4e7f-9e13-55460859c3ec
InstallationDate: Installed on 2017-07-06 (15 days ago)
InstallationMedia: Linux Mint 18.2 "Sonya" - Release amd64 20170628
IwConfig:
 lo no wireless extensions.

 enp37s0 no wireless extensions.
MachineType: System manufacturer System Product Name
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.10.0-26-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.default_ps_max_latency_us=0
ProcVersionSignature: Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.10.0-26-generic N/A
 linux-backports-modules-4.10.0-26-generic N/A
 linux-firmware 1.157.11
RfKill:

Tags: sonya
Uname: Linux 4.10.0-26-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 06/20/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0805
dmi.board.asset.tag: Default string
dmi.board.name: PRIME B350M-A
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0805:bd06/20/2017:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnPRIMEB350M-A:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Steve Roberts (drgrumpy) wrote :
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc1/

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1705748

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected sonya
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Steve Roberts (drgrumpy) wrote :

This is a brand new build, with Mint 18.2 installed intitally with kernel 4.8.53, but upgraded straight away to latest kernel due to using Ryzen cpu.

Will try to test with the 4.13 in the next 24 h

Steve Roberts (drgrumpy) wrote :

Confirm that removing:
nvme_core.default_ps_max_latency_us=0
from boot params results in reproducible problem on resume from suspend.

Downloaded and installed 4.13 headers and generic kernel:
Linux version 4.13.0-041300rc1-generic

Unable to test issue with above kernel. System boots ok, but now fails to suspend, so cannot check result on resume. Starts to suspend, switches off monitors, then hangs and becomes totally unresponsive, but still powered up, fan running, etc, will not resume, requiring power down or reset button to be pressed, no clues in syslog:

Jul 22 00:40:05 phs08 systemd[1]: Reached target Sleep.
Jul 22 00:40:05 phs08 systemd[1]: Starting Suspend...
Jul 22 00:40:05 phs08 systemd-sleep[4059]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Jul 22 00:40:05 phs08 systemd-sleep[4060]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Jul 22 00:40:05 phs08 systemd-sleep[4059]: Suspending system...

Screens off, keyboard, mouse unresponsive, will not resume from suspend

Reset or Power off/on -> re-boots ok

System is also much noisier with fan seemingly running at higher speed, but cannot check as it87 will not load.

Also notice that asus_wmi is loaded, but this is not a laptop.

Steve Roberts (drgrumpy) on 2017-07-22
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Kai-Heng Feng (kaihengfeng) wrote :

Please remove "nvme_core.default_ps_max_latency_us=0" then try this linux kernel:
http://people.canonical.com/~khfeng/lp1705748/

Steve Roberts (drgrumpy) wrote :

Many thanks, suggested kernel now under test...

Steve Roberts (drgrumpy) wrote :

Unfortunately the problem is still there with this 4.11 kernel...
After initial install of the kernel and a suspend/resume it all seemed ok, but then after leaving suspended overnight, the disk goes missing again, see extracts below.

Jul 25 01:30:54 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-12-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro
...
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Suspending system...
Jul 25 01:43:33 phs08 kernel: [ 735.437613] PM: Syncing filesystems ... done.
...
Jul 25 01:52:37 phs08 systemd-sleep[5773]: Suspending system...
Jul 25 09:08:20 phs08 kernel: [ 1294.739373] PM: Syncing filesystems ... done.
...
Jul 25 09:10:24 phs08 systemd[1]: Removed slice User Slice of lightdm.
Jul 25 09:11:58 phs08 kernel: [ 1526.953121] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 25 09:12:03 phs08 kernel: [ 1531.977506] pci_raw_set_power_state: 4 callbacks suppressed
Jul 25 09:12:03 phs08 kernel: [ 1531.977511] nvme 0000:01:00.0: Refused to change power state, currently in D3
Jul 25 09:12:03 phs08 kernel: [ 1531.977676] nvme nvme0: Removing after probe failure status: -19
Jul 25 09:12:03 phs08 kernel: [ 1531.997538] nvme0n1: detected capacity change from 500107862016 to 0
Jul 25 09:12:03 phs08 kernel: [ 1531.997846] blk_update_request: I/O error, dev nvme0n1, sector 134367992
Jul 25 09:12:03 phs08 kernel: [ 1531.997857] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
... many i/o errors follow, reset needed to reboot

Steve Roberts (drgrumpy) wrote :

One curious thing I note, probably irrelevant, is that after resume, there is lots of stuff recorded in the logs that seem to be about suspending...... e.g. Freezing remaing tasks after the computer has resumed...

ul 25 01:42:58 phs08 systemd[1]: Reached target Sleep.
Jul 25 01:42:58 phs08 systemd[1]: Starting Suspend...
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Jul 25 01:42:58 phs08 systemd-sleep[5316]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Suspending system...

System is now sleeping, system wakes....

Jul 25 01:43:33 phs08 kernel: [ 735.437613] PM: Syncing filesystems ... done.
Jul 25 01:43:33 phs08 kernel: [ 735.469300] PM: Preparing system for sleep (mem)
Jul 25 01:43:33 phs08 kernel: [ 735.469480] Freezing user space processes ... (elapsed 0.002 seconds) done.
Jul 25 01:43:33 phs08 kernel: [ 735.471929] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Jul 25 01:43:33 phs08 kernel: [ 735.473849] PM: Suspending system (mem)
Jul 25 01:43:33 phs08 kernel: [ 735.473869] Suspending console(s) (use no_console_suspend to debug)
Jul 25 01:43:33 phs08 kernel: [ 735.474317] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Jul 25 01:43:33 phs08 kernel: [ 735.474394] sd 1:0:0:0: [sdb] Stopping disk
Jul 25 01:43:33 phs08 kernel: [ 735.474620] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jul 25 01:43:33 phs08 kernel: [ 735.482777] sd 0:0:0:0: [sda] Stopping disk
Jul 25 01:43:33 phs08 kernel: [ 735.483625] i8042 kbd 00:04: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 735.483765] nouveau 0000:27:00.0: DRM: suspending console...
Jul 25 01:43:33 phs08 kernel: [ 735.483771] nouveau 0000:27:00.0: DRM: suspending display...
Jul 25 01:43:33 phs08 kernel: [ 735.505047] nouveau 0000:27:00.0: DRM: evicting buffers...
Jul 25 01:43:33 phs08 kernel: [ 735.580082] nouveau 0000:27:00.0: DRM: waiting for kernel channels to go idle...
Jul 25 01:43:33 phs08 kernel: [ 735.580105] nouveau 0000:27:00.0: DRM: suspending fence...
Jul 25 01:43:33 phs08 kernel: [ 735.581462] nouveau 0000:27:00.0: DRM: suspending object tree...
Jul 25 01:43:33 phs08 kernel: [ 737.956791] PM: suspend of devices complete after 2482.515 msecs
Jul 25 01:43:33 phs08 kernel: [ 737.957846] PM: late suspend of devices complete after 1.050 msecs
Jul 25 01:43:33 phs08 kernel: [ 737.958651] xhci_hcd 0000:28:00.3: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959028] xhci_hcd 0000:22:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959043] r8169 0000:25:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959044] xhci_hcd 0000:03:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 743.441059] PM: noirq suspend of devices complete after 5482.825 msecs
Jul 25 01:43:33 phs08 kernel: [ 743.441087] ACPI: Preparing to enter system sleep state S3
Jul 25 01:43:33 phs08 kernel: [ 743.761477] PM: Saving platform NVS memory
Jul 25 01:43:33 phs08 kernel: [ 743.761495] Disabling non-boot CPUs ...

Kai-Heng Feng (kaihengfeng) wrote :

Try the kernel here, it adds a delay before controller reset.

http://people.canonical.com/~khfeng/lp1705748-2/

---

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f15f4017fb95..1dc42006a8fd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2308,6 +2308,8 @@ static const struct pci_device_id nvme_id_table[] = {
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x1c5f, 0x0540), /* Memblaze Pblaze4 adapter */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
+ { PCI_DEVICE(0x144d, 0xa804), /* Samsung SM961/PM961 */
+ .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x144d, 0xa821), /* Samsung PM1725 */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x144d, 0xa822), /* Samsung PM1725a */

Steve Roberts (drgrumpy) wrote :

Ok, thanks, will try and report back...

Steve Roberts (drgrumpy) wrote :

Unfortunately there seems no improvement, system boots fine, and operates fine, but nvme disappears again after suspend. However I didn't see any spontaneous disappearance of device (so could be some progress ? but will need to test for longer to be certain)

log extracts:

Jul 28 09:36:18 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-8-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

...booted and runs fine, then suspend...
Jul 28 11:28:57 phs08 systemd-sleep[7216]: Suspending system...
... and resume
Jul 28 12:24:36 phs08 kernel: [ 6773.441351] PM: resume of devices complete after 1369.759 msecs
Jul 28 12:24:36 phs08 kernel: [ 6773.441664] PM: Finishing wakeup.
Jul 28 12:24:36 phs08 systemd-sleep[7216]: System resumed.

Jul 28 12:27:40 phs08 kernel: [ 6957.315181] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Jul 28 12:27:40 phs08 kernel: [ 6957.339506] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Jul 28 12:27:40 phs08 kernel: [ 6957.339612] nvme nvme0: Removing after probe failure status: -19
Jul 28 12:27:40 phs08 kernel: [ 6957.375614] nvme0n1: detected capacity change from 500107862016 to 0

But one thing I note, is that the kernel records re-starting conventional spinning disks:

Jul 28 12:24:36 phs08 kernel: [ 6772.072165] sd 0:0:0:0: [sda] Starting disk
Jul 28 12:24:36 phs08 kernel: [ 6772.072166] sd 1:0:0:0: [sdb] Starting disk

but there seems no equivalent for the nvme drive (unless "nvme 0000:01:00.0: enabling device (0000 -> 0002)" is the equivalent.

Steve Roberts (drgrumpy) wrote :

I have also just noticed that the nvme drive seems to disappear 2 or 3 minutes after the system has re-awakened, so whilst it consistently seems to happen *after* resume from S3, it is not *on* resume.

Kai-Heng Feng (kaihengfeng) wrote :

Probably need to change the device to full functional state before S3, like how basic power management works.

Changed in linux (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Kai-Heng Feng (kaihengfeng)
Kai-Heng Feng (kaihengfeng) wrote :

Roberts, does the new kernel I built still have the issue?

Steve Roberts (drgrumpy) wrote :

Sorry, I have downloaded but haven't got to testing it yet.... have had to prioritise the day job... will aim to try this evening and report back tomorrow...

Steve Roberts (drgrumpy) wrote :

So last night I installed the kernel and it seems the issue persists:

Aug 3 02:30:59 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-9-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

suspend:

Aug 3 02:52:35 phs08 systemd[1]: Reached target Sleep.
Aug 3 02:52:35 phs08 systemd[1]: Starting Suspend...
Aug 3 02:52:35 phs08 systemd-sleep[5407]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Aug 3 02:52:35 phs08 systemd-sleep[5408]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Aug 3 02:52:35 phs08 systemd-sleep[5407]: Suspending system...

resume next morning:

Aug 3 09:29:21 phs08 kernel: [ 1300.773097] PM: Syncing filesystems ... done.
Aug 3 09:29:21 phs08 kern

Aug 3 09:29:21 phs08 kernel: [ 1305.552367] PM: resume of devices complete after 1226.588 msecs
Aug 3 09:29:21 phs08 kernel: [ 1305.552777] PM: Finishing wakeup.
Aug 3 09:29:21 phs08 kernel: [ 1305.552778] OOM killer enabled.
Aug 3 09:29:21 phs08 kernel: [ 1305.552778] Restarting tasks ... done.
Aug 3 09:29:24 phs08 kernel: [ 1307.942704] r8169 0000:25:00.0 enp37s0: link up
Aug 3 09:29:27 phs08 kernel: [ 1311.521223] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 3 09:29:27 phs08 kernel: [ 1311.533275] ata1.00: configured for UDMA/133
Aug 3 09:29:27 phs08 kernel: [ 1311.581201] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 3 09:29:28 phs08 kernel: [ 1311.593308] ata2.00: configured for UDMA/133

Aug 3 09:32:45 phs08 kernel: [ 1508.852213] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Aug 3 09:32:45 phs08 kernel: [ 1508.896322] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Aug 3 09:32:45 phs08 kernel: [ 1508.896429] nvme nvme0: Removing after probe failure status: -19
Aug 3 09:32:45 phs08 kernel: [ 1508.920265] nvme0n1: detected capacity change from 500107862016 to 0
Aug 3 09:32:45 phs08 kernel: [ 1508.920536] blk_update_request: I/O error, dev nvme0n1, sector 360686080

"Probably need to change the device to full functional state before S3, like how basic power management works"
I can't see anything in the logs that relates to this or suggests it is happening, what should I look for ?

I notice this in case of relevance
Aug 3 09:29:21 phs08 kernel: [ 1303.986393] CPU: 0 PID: 5407 Comm: systemd-sleep Tainted: G OE 4.12.0-9-generic #10

Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the full dmesg? I'd like to check some additional information.

Steve Roberts (drgrumpy) wrote :

I don't have a dmesg log file (due to systemd ?) but I attach syslog from start of boot with the kernel to when the reset button pressed...

Thanks for your attention to this Kai-Heng, really appreciated, Steve.

Kai-Heng Feng (kaihengfeng) wrote :

Please try this one:
http://people.canonical.com/~khfeng/lp1705748-3/

This kernel disables ASPT before controller shutdown. But looks like it's not enough (comment #31), this kernel also explicitly set power state 0 before controller shutdown.

Steve Roberts (drgrumpy) wrote :

Ok, will test and report and back

Steve Roberts (drgrumpy) wrote :
Download full text (4.2 KiB)

Unfortunately there seems another issue with this kernel:
The screen/video is blank on resume, I think this is the offending part:

ug 8 14:08:50 phs08 kernel: [ 165.478069] nouveau 0000:27:00.0: DRM: resuming fence...
Aug 8 14:08:50 phs08 kernel: [ 165.478082] nouveau 0000:27:00.0: DRM: resuming display...
Aug 8 14:08:50 phs08 kernel: [ 165.501235] nouveau 0000:27:00.0: DRM: resuming console...
Aug 8 14:08:50 phs08 kernel: [ 165.524428] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 8 14:08:50 phs08 kernel: [ 165.541138] ata6.00: configured for UDMA/133
Aug 8 14:08:50 phs08 kernel: [ 165.636931] usb 1-3: reset high-speed USB device number 2 using xhci_hcd
Aug 8 14:08:50 phs08 kernel: [ 166.032975] usb 1-8: reset high-speed USB device number 3 using xhci_hcd
Aug 8 14:08:50 phs08 kernel: [ 168.723550] r8169 0000:25:00.0 enp37s0: link up
Aug 8 14:08:50 phs08 kernel: [ 169.227911] PM: resume of devices complete after 4181.953 msecs
Aug 8 14:08:50 phs08 kernel: [ 169.228289] PM: Finishing wakeup.
Aug 8 14:08:50 phs08 kernel: [ 169.228290] OOM killer enabled.
Aug 8 14:08:50 phs08 kernel: [ 169.228290] Restarting tasks ... done.
Aug 8 14:08:50 phs08 kernel: [ 169.236474] usb 1-3.1: USB disconnect, device number 4
Aug 8 14:08:50 phs08 kernel: [ 169.655196] nouveau 0000:27:00.0: DVI-D-1: EDID is invalid:
Aug 8 14:08:50 phs08 kernel: [ 169.655198] [00] BAD 00 ff ff ff ff ff ff 00 4c 2d 40 02 39 31 42 4d
Aug 8 14:08:50 phs08 kernel: [ 169.655198] [00] BAD 09 12 01 03 00 26 1e 78 2a ee 95 a3 54 4c 99 26
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 0f 50 54 bf ef 80 81 80 81 40 71 4f 01 01 01 01
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 01 01 01 01 01 01 30 2a 00 98 51 00 2a 40 30 70
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 13 00 78 2d 11 00 00 1e 00 00 00 fd 00 38 4b 1e
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 51 0e 00 0a 20 20 20 20 20 20 00 00 00 fc 00 53
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 79 6e 63 4d 61 73 74 65 72 0a 20 20 00 00 00 ff
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 00 48 56 4e 51 32 30 30 37 37 35 0a 20 20 00 40
Aug 8 14:08:50 phs08 kernel: [ 169.655202] nouveau 0000:27:00.0: DRM: DDC responded, but no EDID for DVI-D-1
Aug 8 14:08:51 phs08 kernel: [ 169.798065] nouveau 0000:27:00.0: HDMI-A-1: EDID is invalid:
Aug 8 14:08:51 phs08 kernel: [ 169.798066] [00] BAD 00 ff ff ff ff ff ff 00 4c 2d 40 02 39 31 42 4d
Aug 8 14:08:51 phs08 kernel: [ 169.798067] [00] BAD 09 12 01 03 00 26 1e 78 2a ee 95 a3 54 4c 99 26
Aug 8 14:08:51 phs08 kernel: [ 169.798067] [00] BAD 0f 50 54 bf ef 80 81 80 81 40 71 4f 01 01 01 01
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 01 01 01 01 01 01 30 2a 00 98 51 00 2a 40 30 70
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 13 00 78 2d 11 00 00 1e 00 00 00 fd 00 38 4b 1e
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 51 0e 00 0a 20 20 20 20 20 20 00 00 00 fc 00 53
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 79 6e 63 4d 61 73 74 65 72 0a 20 20 00 00 00 ff
Aug 8 14:08:51 phs08 kernel: [ 169.798069] [00] BAD 00 48 56 4e 51 32 30 ...

Read more...

Kai-Heng Feng (kaihengfeng) wrote :

Yea, I think it's normal, because the kernel version is the same. The old one with the same version will be removed before installing the new one.

I have no idea about the nouveau part though. I'll build a new on based on 4.11 to focus on the NVMe bug.

Steve Roberts (drgrumpy) wrote :

Sorry for slow feedback....

Unfortunately the issue is still there:

Aug 15 12:52:58 phs08 kernel: [ 0.000000] Linux version 4.11.0-14-generic (root@ryzen) (gcc version 6.3.0 20170618 (Ubuntu 6.3.0-19ubuntu1) ) #20 SMP Wed Aug 9 20:56:51 CST 2017 (Ubuntu 4.11.0-14.20-generic 4.11.12)
Aug 15 12:52:58 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-14-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Tue Aug 15 13:31:03 BST 2017: performing suspend
Tue Aug 15 14:11:07 BST 2017: Awake.

Aug 15 14:11:07 phs08 kernel: [ 2297.422103] nvme nvme0: disabling APST...
Aug 15 14:11:07 phs08 kernel: [ 2297.423143] nvme nvme0: setting power state to 0...
...
Aug 15 14:11:07 phs08 kernel: [ 2302.067538] PM: resume of devices complete after 1158.582 msecs
Aug 15 14:11:07 phs08 kernel: [ 2302.067889] PM: Finishing wakeup.
Aug 15 14:11:07 phs08 kernel: [ 2302.067890] Restarting tasks ... done.
Aug 15 14:11:09 phs08 kernel: [ 2304.530720] r8169 0000:25:00.0 enp37s0: link up
Aug 15 14:11:13 phs08 kernel: [ 2307.860622] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 15 14:11:13 phs08 kernel: [ 2307.864647] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 15 14:11:13 phs08 kernel: [ 2307.872508] ata1.00: configured for UDMA/133
Aug 15 14:11:13 phs08 kernel: [ 2307.876591] ata2.00: configured for UDMA/133
Aug 15 14:14:02 phs08 kernel: [ 2477.191603] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Aug 15 14:14:02 phs08 kernel: [ 2477.227261] pci_raw_set_power_state: 4 callbacks suppressed
Aug 15 14:14:02 phs08 kernel: [ 2477.227266] nvme 0000:01:00.0: Refused to change power state, currently in D3
Aug 15 14:14:02 phs08 kernel: [ 2477.227394] nvme nvme0: Removing after probe failure status: -19
Aug 15 14:14:02 phs08 kernel: [ 2477.255682] nvme0n1: detected capacity change from 500107862016 to 0

It still seems odd to me that this apparently occurs 2-3 mins after the system has woken up...

One thing that confuses me is that the suspend operations like:

"nvme nvme0: disabling APST..."

are recorded as happening after the system has woken up ! Is this a matter of them not being logged until the system has woken up or actually not executing until the system has woken up ?... either
way it is confusing to see suspend operations occuring after the system has resumed.

Kai-Heng Feng (kaihengfeng) wrote :

I guess NVMe probably needs a to sleep exlat time (6000 here) to let it transit back to power state 0.

Please try http://people.canonical.com/~khfeng/lp1705748-sleep/

Kai-Heng Feng (kaihengfeng) wrote :

If this kernel still doesn't work, try latest mainline kernel with "nvme_core.default_ps_max_latency_us=1500", do disable deepest power state.

Steve Roberts (drgrumpy) wrote :

Unfortunately still no joy

Aug 21 18:02:54 phs08 kernel: [ 0.000000] Linux version 4.13.0-6-generic (root@linux) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)) #7~lp1705748 SMP Fri Aug 18 17:09:21 CST 2017 (Ubuntu 4.13.0-6.7~lp1705748-generic 4.13.0-rc5)
Aug 21 18:02:54 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-6-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro
...
Mon Aug 21 19:55:03 BST 2017: performing suspend
Tue Aug 22 02:17:23 BST 2017: Awake.

Aug 22 02:21:36 phs08 kernel: [ 6999.857721] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Aug 22 02:21:36 phs08 kernel: [ 6999.901881] print_req_error: I/O error, dev nvme0n1, sector 137284824
Aug 22 02:21:36 phs08 kernel: [ 6999.901892] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.901933] print_req_error: I/O error, dev nvme0n1, sector 137286408
Aug 22 02:21:36 phs08 kernel: [ 6999.901937] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.901970] print_req_error: I/O error, dev nvme0n1, sector 137296616
Aug 22 02:21:36 phs08 kernel: [ 6999.901974] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.902007] print_req_error: I/O error, dev nvme0n1, sector 137284312
Aug 22 02:21:36 phs08 kernel: [ 6999.902010] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.945333] pci_raw_set_power_state: 5 callbacks suppressed
Aug 22 02:21:36 phs08 kernel: [ 6999.945338] nvme 0000:01:00.0: Refused to change power state, currently in D3
Aug 22 02:21:36 phs08 kernel: [ 6999.945503] nvme nvme0: Removing after probe failure status: -19

Steve Roberts (drgrumpy) wrote :

Appreciate all your efforts Kai-Heng

Where do I get the latest mainline kernel ?

Thanks, Steve

Kai-Heng Feng (kaihengfeng) wrote :

Use the one I built should be recent enough.

You can still get one from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc6/

Steve Roberts (drgrumpy) wrote :

Using http://people.canonical.com/~khfeng/lp1705748-sleep/

with "nvme_core.default_ps_max_latency_us=1500", to disable deepest power state

Been through several suspend/resume cycles so far and seems to be ok.

Kai-Heng Feng (kaihengfeng) wrote :

I built a kernel with the quirk. Please try it without the nvme_core kernel parameter.

http://people.canonical.com/~khfeng/pm961/

Steve Roberts (drgrumpy) wrote :

I haven't yet tried with that latest kernel... but this morning, after 5 days of successful suspend/resume:

Aug 30 09:35:19 phs08 kernel: [71904.271956] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Aug 30 09:35:19 phs08 kernel: [71904.320269] print_req_error: I/O error, dev nvme0n1, sector 136098096
Aug 30 09:35:19 phs08 kernel: [71904.320280] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Aug 30 09:35:19 phs08 kernel: [71904.339992] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Aug 30 09:35:19 phs08 kernel: [71904.340163] nvme nvme0: Removing after probe failure status: -19
Aug 30 09:35:19 phs08 kernel: [71904.371974] nvme0n1: detected capacity change from 500107862016 to 0

Kai-Heng Feng (kaihengfeng) wrote :

But no more "nvme 0000:01:00.0: Refused to change power state, currently in D3"?

Steve Roberts (drgrumpy) wrote :

Yes, correct I just searched the kernel log and no more
"...Refused to change..."

since I booted with:

Command line: BOOT_IMAGE=/vmlinuz-4.13.0-6-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.default_ps_max_latency_us=1500

Kai-Heng Feng (kaihengfeng) wrote :

Originally can't read PCI status:
Aug 15 14:14:02 phs08 kernel: [ 2477.191603] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

Now only controller is dead:
Aug 30 09:35:19 phs08 kernel: [71904.271956] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

So I guess other than NVME_QUIRK_NO_DEEPEST_PS, quirk NVME_QUIRK_DELAY_BEFORE_CHK_RDY is also needed.

Steve Roberts (drgrumpy) wrote :

Tried that latest version, same result:

Sep 1 16:56:16 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-8-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

(Note without nvme_core.default_ps_max_latency_us=1500. hope this was correct)

...
Sep 1 17:01:29 phs08 kernel: [ 318.046495] PM: Suspending system (mem)
...
Resume system:

...
Sep 1 18:07:42 phs08 kernel: [ 688.834876] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 1 18:07:42 phs08 kernel: [ 688.875137] print_req_error: I/O error, dev nvme0n1, sector 134468320
Sep 1 18:07:42 phs08 kernel: [ 688.875148] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Sep 1 18:07:42 phs08 kernel: [ 688.902878] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Sep 1 18:07:42 phs08 kernel: [ 688.903048] nvme nvme0: Removing after probe failure status: -19
Sep 1 18:07:42 phs08 kernel: [ 688.930880] nvme0n1: detected capacity change from 500107862016 to 0

Steve Roberts (drgrumpy) wrote :

Also same result with nvme_core.default_ps_max_latency_us=1500

Kai-Heng Feng (kaihengfeng) wrote :

Just let you know, I am still working on this issue.
Currently digging through the spec, hopefully I can find some angle to try new workarounds.

Completely disable APST should be last resort.

Kai-Heng Feng (kaihengfeng) wrote :

This kernel [1] resets NVMe controller before shutdown.

[1] http://people.canonical.com/~khfeng/pm961-reset/

Steve Roberts (drgrumpy) wrote :

Tried that latest one... still fails after suspend but should I be using nvme_core.default_ps_max_latency_us=1500 ?

But as I have said before it does bother me that the suspend and resume operations seem to overlap and all are logged on resume.

Booted am:

Sep 7 10:02:44 phs08 kernel: [ 0.000000] Linux version 4.12.0-14-generic (root@Linux) (gcc version 7.1.0 (Ubuntu 7.1.0-13ubuntu1) ) #15~pm961+reset SMP Wed Sep 6 13:43:27 CST 2017 (Ubuntu 4.12.0-14.15~pm961+reset-generic 4.12.10)
Sep 7 10:02:44 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-14-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

system running all day...

from pm-suspend.log:
Thu Sep 7 19:34:17 BST 2017: performing suspend
Thu Sep 7 20:26:41 BST 2017: Awake.

from kernel.log
(nothing about any suspend operations prior to wakeup time, i.e. all kernel operations relating to suspend are logged on wakeup !)

Sep 7 20:26:41 phs08 kernel: [34307.434357] PM: Syncing filesystems ... done.
Sep 7 20:26:41 phs08 kernel: [34307.498356] PM: Preparing system for sleep (mem)
...
Sep 7 20:26:41 phs08 kernel: [34311.224400] PM: resume of devices complete after 698.454 msecs
Sep 7 20:26:41 phs08 kernel: [34311.224785] PM: Finishing wakeup
...
At 20:28 I successfully saved a spreadsheet file to the nvme disk, that was open before the system was suspended (the file is timestamped with 20:28)

then....
Sep 7 20:29:56 phs08 kernel: [34506.017034] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Sep 7 20:29:56 phs08 kernel: [34506.052638] pci_raw_set_power_state: 5 callbacks suppressed
Sep 7 20:29:56 phs08 kernel: [34506.052642] nvme 0000:01:00.0: Refused to change power state, currently in D3
Sep 7 20:29:56 phs08 kernel: [34506.052779] nvme nvme0: Removing after probe failure status: -19
etc.

So the nvme disk apparently does not become inaccessible until 3 min after the system has woken up !

Kai-Heng Feng (kaihengfeng) wrote :

So is "nvme disk apparently does not become inaccessible until 3 min after the system has woken up" a new behavior?

Steve Roberts (drgrumpy) wrote :

No, this seems to have been happening all along, if you look at the timings in earlier posts...see post #25

...but what throws me and makes it more obscure it that the suspend operations are logged on resume, and until yesterday I hadn't ever managed to check the drive was actually alive (I actually had assumed it wasn't, and the apparent delay was just until it was first accessed (var storing the logs is a mountpoint on one of the spinning SATA disks)

Kai-Heng Feng (kaihengfeng) wrote :

That's because disks need to be stopped before chipsets/CPU goes to suspend. So the internal messages are flushed *after* disk wakeup - disk is not available at the end the suspend process.

Kai-Heng Feng (kaihengfeng) wrote :
Changed in linux (Ubuntu):
status: In Progress → Triaged
Steve Roberts (drgrumpy) wrote :

Ok thanks and for the explanation - makes sense.

Kai-Heng Feng (kaihengfeng) wrote :

Hmm, there's no response from Samsung.

Do you think we should completely turn off APST for Asus Prime B350m-A + Samsung 960 EVO?

- (bplaa.yai) wrote :

FWIW, I'm also affected by this issue, but on a completely different system (Razer Blade Stealth, Samsung PM951, Fedora 26 - kernel 4.13.4), so I don't think specifically targeting Asus Prime B350m-A + Samsung 960 EVO would do it.

As the OP, I was having frequent random SSD disconnects (the issue appeared in late August), and thought I was affected by bug 1678184.

Using nvme_core.default_ps_max_latency_us=6000 allows the system to be stable again as long as it's awake, but still I sometimes have the issue when the system is awoken from sleep mode.

Please let me know if providing additional infos would help.

Kai-Heng Feng (kaihengfeng) wrote :

bplaa.yai,

Can you file a new bug?
SM/PM951 is an OEM version, which means it's already inside the laptop when you unbox it.
That means we can make specific workaround for the laptop/NVMe combination.

Steve Roberts (drgrumpy) wrote :

Re: turn off APST completely

Tricky...

So I have been running with:
BOOT_IMAGE=/vmlinuz-4.13.0-8-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.default_ps_max_latency_us=0

for the last month or so with no issues....

I suppose APST is about reducing power consumption, but I am not clear how much it saves.... in my (desktop) case the biggest reduction in power consumption likely comes from being able to suspend the system as a whole, so being able to suspend/resume likely outweighs the benefit of apst, on the other hand for someone with an always on system, or someone that prefers to hibernate or shutdown, having apst enabled could be beneficial...

Kai-Heng Feng (kaihengfeng) wrote :

Steve,

I'll write a patch to let the system not to enable APST.

FWIW, my desktop ues Asus Prime B350m-A + Ryzen 7 1700, but with Intel P600. Currently I don't have any issue when APST is enabled.

Kai-Heng Feng (kaihengfeng) wrote :

Steve, please try [1] without any NVMe parameters.

[1] http://people.canonical.com/~khfeng/lp1705748-evo+ryzen/

Steve Roberts (drgrumpy) wrote :

Kai-Heng,

In response to #66, I guess that definitely points to a firmware issue with Samsung drive

#67 I will try and let you know...

Steve Roberts (drgrumpy) wrote :

Kai-Heng,

In reply to #66: I guess this is a definite indication that it is a firmware issue.

#67: there are only two files to download, on all previous occasions there have been four, is that correct ?

Steve

Kai-Heng Feng (kaihengfeng) wrote :

This time I built the kernel on top of mainline kernel, so there are just two files.

Steve Roberts (drgrumpy) wrote :

Ah ok, so I will need to install 4.14 first ?

Kai-Heng Feng (kaihengfeng) wrote :

No, these two files are sufficient.

Steve Roberts (drgrumpy) wrote :

I installed the files,
removed the nvme_core.default_ps_max_latency_us=0,
did an update-grub and rebooted...
failed to boot into graphical interface and then hung... reset needed

Command line: BOOT_IMAGE=/vmlinuz-4.14.0-rc4-evo+ryzen root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspect this is the issue:

Oct 16 10:40:17 phs08 systemd[1]: Started Light Display Manager.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Main process exited, code=exited, status=1/FAILURE
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Unit entered failed state.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Failed with result 'exit-code'.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Service hold-off time over, scheduling restart.

Oct 16 10:40:18 phs08 gpu-manager[2291]: /etc/modprobe.d is not a file
Oct 16 10:40:18 phs08 gpu-manager[2291]: message repeated 4 times: [ /etc/modprobe.d is not a file]
Oct 16 10:40:18 phs08 gpu-manager[2291]: Error: can't open /lib/modules/4.14.0-rc4-evo+ryzen/updates/dkms

Oct 16 10:40:21 phs08 systemd[1]: gpu-manager.service: Start request repeated too quickly.
Oct 16 10:40:21 phs08 systemd[1]: Failed to start Detect the available GPUs and deal with any system changes.
Oct 16 10:40:21 phs08 systemd[1]: lightdm.service: Start request repeated too quickly.
Oct 16 10:40:21 phs08 systemd[1]: Failed to start Light Display Manager.

Kai-Heng Feng (kaihengfeng) wrote :

The fault is mine. Didn't take out-of-tree proprietary modules into account. Please try artful kernel here:

http://people.canonical.com/~khfeng/lp1705748-artful/

Steve Roberts (drgrumpy) wrote :

Sorry for slow response:

Command line: BOOT_IMAGE=/vmlinuz-4.13.0-17-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Resulted in spontaneous disappearance (no suspend/resume needed):

Oct 26 17:20:44 phs08 kernel: [ 3011.057465] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oct 26 17:20:44 phs08 kernel: [ 3011.089763] print_req_error: I/O error, dev nvme0n1, sector 131608072
Oct 26 17:20:44 phs08 kernel: [ 3011.089770] print_req_error: I/O error, dev nvme0n1, sector 202463672

Will try again to see if reproducible...

Steve Roberts (drgrumpy) wrote :

Oct 26 17:23:41 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-17-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspend...
Oct 26 19:00:46 phs08 systemd-sleep[5347]: Suspending system...

Resume....

Oct 26 19:57:11 phs08 kernel: [ 5841.689280] PM: resume of devices complete after 644.551 msecs
Oct 26 19:57:11 phs08 kernel: [ 5841.689579] PM: Finishing wakeup.
Oct 26 19:57:11 phs08 kernel: [ 5841.689580] OOM killer enabled.
...

Oct 26 20:00:29 phs08 kernel: [ 6039.045439] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oct 26 20:00:29 phs08 kernel: [ 6039.089704] print_req_error: I/O error, dev nvme0n1, sector 12208016

Again it seems to be ~3mins after wakeup that the disk becomes inaccessible...

I noticed this, probably unrelated:

Oct 26 21:42:45 phs08 kernel: [ 0.100002] ACPI Error: Needed [Integer/String/Buffer], found [Region] ffff991f1e97f5a0 (20170531/exresop-424)
Oct 26 21:42:45 phs08 kernel: [ 0.100002] ACPI Exception: AE_AML_OPERAND_TYPE, Could not execute arguments for [IOB2] (Region) (20170531/nsinit-412)
Oc

Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the output of `nvme get-feature -f 0x0c -H /dev/nvme0`?

Steve Roberts (drgrumpy) wrote :
Download full text (4.5 KiB)

Output from get-feature:

get-feature:0x0c (Autonomous Power State Transition), Current value: 0x000001
 Autonomous Power State Transition Enable (APSTE): Enabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 410 ms
 Idle Transition Power State (ITPS): 4
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[13]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[14]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[15]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[16]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[17]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[18]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[19]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[20]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[21]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 ...

Read more...

Kai-Heng Feng (kaihengfeng) wrote :

My apology, I typed wrong PCIID for the Samsung device.

Please try this one instead,
http://people.canonical.com/~khfeng/lp1705748-again/

Steve Roberts (drgrumpy) wrote :

Installed that latest version, can confirm that apst is disabled without the max_latency setting kernel parameter ....

I have also found a new BIOS for the m/b: 0902

So will also try and installing that and revert back to one of the kernels with apst turned on, to see if any difference...

or is it possible to turn the apst back on with a kernel parameter ?

Kai-Heng Feng (kaihengfeng) wrote :

Yes. Use "nvme_core.force_apst=1"

Kai-Heng Feng (kaihengfeng) wrote :

Steve, I'll send the patch if you have no concern.

Thanks for all the testing.

Steve Roberts (drgrumpy) wrote :

Sorry for slow response.
Yes all seems to be fine, I have had 9 days of uptime with numerous suspend and resumes without issue.

Steve Roberts (drgrumpy) wrote :

Will this make it into the mainstream kernel updates or how do I track when it is ?
I just installed 4.13.0-19 and the issue is still there.

Download full text (5.9 KiB)

> On 5 Jan 2018, at 7:28 PM, Steve Roberts <email address hidden> wrote:
>
> Will this make it into the mainstream kernel updates or how do I track when it is ?
> I just installed 4.13.0-19 and the issue is still there.

I’ll backport it to Artful kernel. Thanks for the notice.

>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1705748
>
> Title:
> Samsung SSD 960 EVO 500GB refused to change power state
>
> Status in linux package in Ubuntu:
> Triaged
>
> Bug description:
> Originally thought my issue was same as this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
> but requested to report as separate bug
>
> System becomes unusable at seemingly random times but especially after
> resume from suspend due to disk 'disappearing' becoming inaccessible,
> with hundreds of I/O errors logged.
>
> After viewing the above bug report yesterday as a quick temporary fix I added kernel param, updated grub, etc with:
> GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"
>
> System appears to have been stable for the last day, but is presumably
> using more power than it should.
>
> System, drive details below:
>
> M2 nvme drive: Samsung SSD 960 EVO 500GB
>
> Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17
>
> M/B Asus Prime B350m-A
> Ryzen 1600 cpu
>
> Jul 20 16:32:59 phs08 kernel: [ 190.893571] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]
> Jul 20 16:33:05 phs08 kernel: [ 197.010928] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> Jul 20 16:33:05 phs08 kernel: [ 197.046980] pci_raw_set_power_state: 4 callbacks suppressed
> Jul 20 16:33:05 phs08 kernel: [ 197.046985] nvme 0000:01:00.0: Refused to change power state, currently in D3
> Jul 20 16:33:05 phs08 kernel: [ 197.047163] nvme nvme0: Removing after probe failure status: -19
> Jul 20 16:33:05 phs08 kernel: [ 197.047182] nvme0n1: detected capacity change from 500107862016 to 0
> Jul 20 16:33:05 phs08 kernel: [ 197.047793] blk_update_request: I/O error, dev nvme0n1, sector 0
>
>
> nvme list
>
> /dev/nvme0n1 S3EUNX0J305518L Samsung SSD 960 EVO 500GB 1.2 1 125.20 GB
> / 500.11 GB 512 B + 0 B 2B7QCXE7
>
> sudo nvme id-ctrl /dev/nvme0
>
> NVME Identify Controller:
> vid : 0x144d
> ssvid : 0x144d
> sn : S3EUNX0J305518L
> mn : Samsung SSD 960 EVO 500GB
> fr : 2B7QCXE7
> rab : 2
> ieee : 002538
> cmic : 0
> mdts : 9
> cntlid : 2
> ver : 10200
> rtd3r : 7a120
> rtd3e : 4c4b40
> oaes : 0
> oacs : 0x7
> acl : 7
> aerl : 3
> frmw : 0x16
> lpa : 0x3
> elpe : 63
> npss : 4
> avscc : 0x1
> apsta : 0x1
> wctemp : 350
> cctemp : 352
> mtfa : 0
> hmpre : 0
> hmmin : 0
> tnvmcap : 500107862016
> unvmcap : 0
> rpmbs : 0
> sqes : 0x66
> cqes : 0x44
> nn : 1
> oncs : 0x1f
> fuses : 0
> fna : 0x5
> vwc : 0x1
> awun : 255
> awupf : 0
> nvscc : 1
> acwu : 0
> sgls : 0
> ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
> rwt:0 rwl:0 idle_power:- active_power:-
> ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
> rwt:1 rwl:1 idle_power...

Read more...

Seth Forshee (sforshee) on 2018-01-10
Changed in linux (Ubuntu Artful):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Kai-Heng Feng (kaihengfeng) wrote :

Steve,

I guess LP: #1746340 is the same as this one.
Can you try this kernel with "nvme_core.force_apst=1"? I want to know if a PCI reset for NVMe after resume can solve the issue.

people.canonical.com/~khfeng/lp1746340-2/

Steve Roberts (drgrumpy) wrote :

Yes, but I am bit confused by that other thread, can you direct me to the download of the specific kernel you want me to try.

Kai-Heng Feng (kaihengfeng) wrote :
Steve Roberts (drgrumpy) wrote :

Unable to install the file:
linux-headers-4.13.0-34-generic_4.13.0-34.37~lp1746340_amd64.deb

due to missing dependency libssl1.1

Kai-Heng Feng (kaihengfeng) wrote :

Can you manually install libssl1.1? Or skip installing header files, if you don't use any DKMS.

Kai-Heng Feng (kaihengfeng) wrote :

Please try [1].

The previous one was built on Bionic, it had higher libssl requirement as a result. I build the new one on Xenial, so it should have the correct version number.

[1] people.canonical.com/~khfeng/lp1746340-pcireset/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers