Samsung SSD 960 EVO 500GB refused to change power state

Bug #1705748 reported by Steve Roberts on 2017-07-21
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Kai-Heng Feng
Xenial
Undecided
Unassigned
Artful
Medium
Kai-Heng Feng

Bug Description

=== SRU Justification ===
[Impact]
A user reported his NVMe went out to lunch after S3.

[Fix]
Disable APST for this particular NVMe + Motherboard setup.

[Test]
User confirmed this quirk works for his system.

[Regression Potential]
Very Low. This applies to a specific device setup, also user can
override this quirk by "nvme_core.force_apst=1".

=== Original Bug Report ===
Originally thought my issue was same as this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
but requested to report as separate bug

System becomes unusable at seemingly random times but especially after resume from suspend due to disk 'disappearing' becoming inaccessible, with hundreds of I/O errors logged.

After viewing the above bug report yesterday as a quick temporary fix I added kernel param, updated grub, etc with:
GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"

System appears to have been stable for the last day, but is presumably using more power than it should.

System, drive details below:

M2 nvme drive: Samsung SSD 960 EVO 500GB

Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17

M/B Asus Prime B350m-A
Ryzen 1600 cpu

Jul 20 16:32:59 phs08 kernel: [ 190.893571] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]
Jul 20 16:33:05 phs08 kernel: [ 197.010928] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 20 16:33:05 phs08 kernel: [ 197.046980] pci_raw_set_power_state: 4 callbacks suppressed
Jul 20 16:33:05 phs08 kernel: [ 197.046985] nvme 0000:01:00.0: Refused to change power state, currently in D3
Jul 20 16:33:05 phs08 kernel: [ 197.047163] nvme nvme0: Removing after probe failure status: -19
Jul 20 16:33:05 phs08 kernel: [ 197.047182] nvme0n1: detected capacity change from 500107862016 to 0
Jul 20 16:33:05 phs08 kernel: [ 197.047793] blk_update_request: I/O error, dev nvme0n1, sector 0

nvme list

/dev/nvme0n1 S3EUNX0J305518L Samsung SSD 960 EVO 500GB 1.2 1 125.20 GB / 500.11 GB 512 B + 0 B 2B7QCXE7

sudo nvme id-ctrl /dev/nvme0

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S3EUNX0J305518L
mn : Samsung SSD 960 EVO 500GB
fr : 2B7QCXE7
rab : 2
ieee : 002538
cmic : 0
mdts : 9
cntlid : 2
ver : 10200
rtd3r : 7a120
rtd3e : 4c4b40
oaes : 0
oacs : 0x7
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 350
cctemp : 352
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 500107862016
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0x5
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:4.08W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
---
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: drgrumpy 2192 F.... pulseaudio
 /dev/snd/pcmC1D0p: drgrumpy 2192 F...m pulseaudio
 /dev/snd/controlC1: drgrumpy 2192 F.... pulseaudio
DistroRelease: Linux 18.2
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=a6896cdd-6bac-4e7f-9e13-55460859c3ec
InstallationDate: Installed on 2017-07-06 (15 days ago)
InstallationMedia: Linux Mint 18.2 "Sonya" - Release amd64 20170628
IwConfig:
 lo no wireless extensions.

 enp37s0 no wireless extensions.
MachineType: System manufacturer System Product Name
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.10.0-26-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.default_ps_max_latency_us=0
ProcVersionSignature: Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.10.0-26-generic N/A
 linux-backports-modules-4.10.0-26-generic N/A
 linux-firmware 1.157.11
RfKill:

Tags: sonya
Uname: Linux 4.10.0-26-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 06/20/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0805
dmi.board.asset.tag: Default string
dmi.board.name: PRIME B350M-A
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0805:bd06/20/2017:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnPRIMEB350M-A:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Steve Roberts (drgrumpy) wrote :
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc1/

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1705748

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected sonya
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Steve Roberts (drgrumpy) wrote :

This is a brand new build, with Mint 18.2 installed intitally with kernel 4.8.53, but upgraded straight away to latest kernel due to using Ryzen cpu.

Will try to test with the 4.13 in the next 24 h

Steve Roberts (drgrumpy) wrote :

Confirm that removing:
nvme_core.default_ps_max_latency_us=0
from boot params results in reproducible problem on resume from suspend.

Downloaded and installed 4.13 headers and generic kernel:
Linux version 4.13.0-041300rc1-generic

Unable to test issue with above kernel. System boots ok, but now fails to suspend, so cannot check result on resume. Starts to suspend, switches off monitors, then hangs and becomes totally unresponsive, but still powered up, fan running, etc, will not resume, requiring power down or reset button to be pressed, no clues in syslog:

Jul 22 00:40:05 phs08 systemd[1]: Reached target Sleep.
Jul 22 00:40:05 phs08 systemd[1]: Starting Suspend...
Jul 22 00:40:05 phs08 systemd-sleep[4059]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Jul 22 00:40:05 phs08 systemd-sleep[4060]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Jul 22 00:40:05 phs08 systemd-sleep[4059]: Suspending system...

Screens off, keyboard, mouse unresponsive, will not resume from suspend

Reset or Power off/on -> re-boots ok

System is also much noisier with fan seemingly running at higher speed, but cannot check as it87 will not load.

Also notice that asus_wmi is loaded, but this is not a laptop.

Steve Roberts (drgrumpy) on 2017-07-22
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Kai-Heng Feng (kaihengfeng) wrote :

Please remove "nvme_core.default_ps_max_latency_us=0" then try this linux kernel:
http://people.canonical.com/~khfeng/lp1705748/

Steve Roberts (drgrumpy) wrote :

Many thanks, suggested kernel now under test...

1 comments hidden view all 108 comments
Steve Roberts (drgrumpy) wrote :

Unfortunately the problem is still there with this 4.11 kernel...
After initial install of the kernel and a suspend/resume it all seemed ok, but then after leaving suspended overnight, the disk goes missing again, see extracts below.

Jul 25 01:30:54 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-12-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro
...
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Suspending system...
Jul 25 01:43:33 phs08 kernel: [ 735.437613] PM: Syncing filesystems ... done.
...
Jul 25 01:52:37 phs08 systemd-sleep[5773]: Suspending system...
Jul 25 09:08:20 phs08 kernel: [ 1294.739373] PM: Syncing filesystems ... done.
...
Jul 25 09:10:24 phs08 systemd[1]: Removed slice User Slice of lightdm.
Jul 25 09:11:58 phs08 kernel: [ 1526.953121] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 25 09:12:03 phs08 kernel: [ 1531.977506] pci_raw_set_power_state: 4 callbacks suppressed
Jul 25 09:12:03 phs08 kernel: [ 1531.977511] nvme 0000:01:00.0: Refused to change power state, currently in D3
Jul 25 09:12:03 phs08 kernel: [ 1531.977676] nvme nvme0: Removing after probe failure status: -19
Jul 25 09:12:03 phs08 kernel: [ 1531.997538] nvme0n1: detected capacity change from 500107862016 to 0
Jul 25 09:12:03 phs08 kernel: [ 1531.997846] blk_update_request: I/O error, dev nvme0n1, sector 134367992
Jul 25 09:12:03 phs08 kernel: [ 1531.997857] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
... many i/o errors follow, reset needed to reboot

Steve Roberts (drgrumpy) wrote :

One curious thing I note, probably irrelevant, is that after resume, there is lots of stuff recorded in the logs that seem to be about suspending...... e.g. Freezing remaing tasks after the computer has resumed...

ul 25 01:42:58 phs08 systemd[1]: Reached target Sleep.
Jul 25 01:42:58 phs08 systemd[1]: Starting Suspend...
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Jul 25 01:42:58 phs08 systemd-sleep[5316]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Suspending system...

System is now sleeping, system wakes....

Jul 25 01:43:33 phs08 kernel: [ 735.437613] PM: Syncing filesystems ... done.
Jul 25 01:43:33 phs08 kernel: [ 735.469300] PM: Preparing system for sleep (mem)
Jul 25 01:43:33 phs08 kernel: [ 735.469480] Freezing user space processes ... (elapsed 0.002 seconds) done.
Jul 25 01:43:33 phs08 kernel: [ 735.471929] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Jul 25 01:43:33 phs08 kernel: [ 735.473849] PM: Suspending system (mem)
Jul 25 01:43:33 phs08 kernel: [ 735.473869] Suspending console(s) (use no_console_suspend to debug)
Jul 25 01:43:33 phs08 kernel: [ 735.474317] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Jul 25 01:43:33 phs08 kernel: [ 735.474394] sd 1:0:0:0: [sdb] Stopping disk
Jul 25 01:43:33 phs08 kernel: [ 735.474620] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jul 25 01:43:33 phs08 kernel: [ 735.482777] sd 0:0:0:0: [sda] Stopping disk
Jul 25 01:43:33 phs08 kernel: [ 735.483625] i8042 kbd 00:04: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 735.483765] nouveau 0000:27:00.0: DRM: suspending console...
Jul 25 01:43:33 phs08 kernel: [ 735.483771] nouveau 0000:27:00.0: DRM: suspending display...
Jul 25 01:43:33 phs08 kernel: [ 735.505047] nouveau 0000:27:00.0: DRM: evicting buffers...
Jul 25 01:43:33 phs08 kernel: [ 735.580082] nouveau 0000:27:00.0: DRM: waiting for kernel channels to go idle...
Jul 25 01:43:33 phs08 kernel: [ 735.580105] nouveau 0000:27:00.0: DRM: suspending fence...
Jul 25 01:43:33 phs08 kernel: [ 735.581462] nouveau 0000:27:00.0: DRM: suspending object tree...
Jul 25 01:43:33 phs08 kernel: [ 737.956791] PM: suspend of devices complete after 2482.515 msecs
Jul 25 01:43:33 phs08 kernel: [ 737.957846] PM: late suspend of devices complete after 1.050 msecs
Jul 25 01:43:33 phs08 kernel: [ 737.958651] xhci_hcd 0000:28:00.3: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959028] xhci_hcd 0000:22:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959043] r8169 0000:25:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959044] xhci_hcd 0000:03:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 743.441059] PM: noirq suspend of devices complete after 5482.825 msecs
Jul 25 01:43:33 phs08 kernel: [ 743.441087] ACPI: Preparing to enter system sleep state S3
Jul 25 01:43:33 phs08 kernel: [ 743.761477] PM: Saving platform NVS memory
Jul 25 01:43:33 phs08 kernel: [ 743.761495] Disabling non-boot CPUs ...

Kai-Heng Feng (kaihengfeng) wrote :

Try the kernel here, it adds a delay before controller reset.

http://people.canonical.com/~khfeng/lp1705748-2/

---

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f15f4017fb95..1dc42006a8fd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2308,6 +2308,8 @@ static const struct pci_device_id nvme_id_table[] = {
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x1c5f, 0x0540), /* Memblaze Pblaze4 adapter */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
+ { PCI_DEVICE(0x144d, 0xa804), /* Samsung SM961/PM961 */
+ .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x144d, 0xa821), /* Samsung PM1725 */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x144d, 0xa822), /* Samsung PM1725a */

Steve Roberts (drgrumpy) wrote :

Ok, thanks, will try and report back...

Steve Roberts (drgrumpy) wrote :

Unfortunately there seems no improvement, system boots fine, and operates fine, but nvme disappears again after suspend. However I didn't see any spontaneous disappearance of device (so could be some progress ? but will need to test for longer to be certain)

log extracts:

Jul 28 09:36:18 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-8-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

...booted and runs fine, then suspend...
Jul 28 11:28:57 phs08 systemd-sleep[7216]: Suspending system...
... and resume
Jul 28 12:24:36 phs08 kernel: [ 6773.441351] PM: resume of devices complete after 1369.759 msecs
Jul 28 12:24:36 phs08 kernel: [ 6773.441664] PM: Finishing wakeup.
Jul 28 12:24:36 phs08 systemd-sleep[7216]: System resumed.

Jul 28 12:27:40 phs08 kernel: [ 6957.315181] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Jul 28 12:27:40 phs08 kernel: [ 6957.339506] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Jul 28 12:27:40 phs08 kernel: [ 6957.339612] nvme nvme0: Removing after probe failure status: -19
Jul 28 12:27:40 phs08 kernel: [ 6957.375614] nvme0n1: detected capacity change from 500107862016 to 0

But one thing I note, is that the kernel records re-starting conventional spinning disks:

Jul 28 12:24:36 phs08 kernel: [ 6772.072165] sd 0:0:0:0: [sda] Starting disk
Jul 28 12:24:36 phs08 kernel: [ 6772.072166] sd 1:0:0:0: [sdb] Starting disk

but there seems no equivalent for the nvme drive (unless "nvme 0000:01:00.0: enabling device (0000 -> 0002)" is the equivalent.

Steve Roberts (drgrumpy) wrote :

I have also just noticed that the nvme drive seems to disappear 2 or 3 minutes after the system has re-awakened, so whilst it consistently seems to happen *after* resume from S3, it is not *on* resume.

Kai-Heng Feng (kaihengfeng) wrote :

Probably need to change the device to full functional state before S3, like how basic power management works.

Changed in linux (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Kai-Heng Feng (kaihengfeng)
Kai-Heng Feng (kaihengfeng) wrote :

Roberts, does the new kernel I built still have the issue?

Steve Roberts (drgrumpy) wrote :

Sorry, I have downloaded but haven't got to testing it yet.... have had to prioritise the day job... will aim to try this evening and report back tomorrow...

Steve Roberts (drgrumpy) wrote :

So last night I installed the kernel and it seems the issue persists:

Aug 3 02:30:59 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-9-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

suspend:

Aug 3 02:52:35 phs08 systemd[1]: Reached target Sleep.
Aug 3 02:52:35 phs08 systemd[1]: Starting Suspend...
Aug 3 02:52:35 phs08 systemd-sleep[5407]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Aug 3 02:52:35 phs08 systemd-sleep[5408]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Aug 3 02:52:35 phs08 systemd-sleep[5407]: Suspending system...

resume next morning:

Aug 3 09:29:21 phs08 kernel: [ 1300.773097] PM: Syncing filesystems ... done.
Aug 3 09:29:21 phs08 kern

Aug 3 09:29:21 phs08 kernel: [ 1305.552367] PM: resume of devices complete after 1226.588 msecs
Aug 3 09:29:21 phs08 kernel: [ 1305.552777] PM: Finishing wakeup.
Aug 3 09:29:21 phs08 kernel: [ 1305.552778] OOM killer enabled.
Aug 3 09:29:21 phs08 kernel: [ 1305.552778] Restarting tasks ... done.
Aug 3 09:29:24 phs08 kernel: [ 1307.942704] r8169 0000:25:00.0 enp37s0: link up
Aug 3 09:29:27 phs08 kernel: [ 1311.521223] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 3 09:29:27 phs08 kernel: [ 1311.533275] ata1.00: configured for UDMA/133
Aug 3 09:29:27 phs08 kernel: [ 1311.581201] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 3 09:29:28 phs08 kernel: [ 1311.593308] ata2.00: configured for UDMA/133

Aug 3 09:32:45 phs08 kernel: [ 1508.852213] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Aug 3 09:32:45 phs08 kernel: [ 1508.896322] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Aug 3 09:32:45 phs08 kernel: [ 1508.896429] nvme nvme0: Removing after probe failure status: -19
Aug 3 09:32:45 phs08 kernel: [ 1508.920265] nvme0n1: detected capacity change from 500107862016 to 0
Aug 3 09:32:45 phs08 kernel: [ 1508.920536] blk_update_request: I/O error, dev nvme0n1, sector 360686080

"Probably need to change the device to full functional state before S3, like how basic power management works"
I can't see anything in the logs that relates to this or suggests it is happening, what should I look for ?

I notice this in case of relevance
Aug 3 09:29:21 phs08 kernel: [ 1303.986393] CPU: 0 PID: 5407 Comm: systemd-sleep Tainted: G OE 4.12.0-9-generic #10

Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the full dmesg? I'd like to check some additional information.

Steve Roberts (drgrumpy) wrote :

I don't have a dmesg log file (due to systemd ?) but I attach syslog from start of boot with the kernel to when the reset button pressed...

Thanks for your attention to this Kai-Heng, really appreciated, Steve.

Kai-Heng Feng (kaihengfeng) wrote :

Please try this one:
http://people.canonical.com/~khfeng/lp1705748-3/

This kernel disables ASPT before controller shutdown. But looks like it's not enough (comment #31), this kernel also explicitly set power state 0 before controller shutdown.

Steve Roberts (drgrumpy) wrote :

Ok, will test and report and back

Steve Roberts (drgrumpy) wrote :
Download full text (4.2 KiB)

Unfortunately there seems another issue with this kernel:
The screen/video is blank on resume, I think this is the offending part:

ug 8 14:08:50 phs08 kernel: [ 165.478069] nouveau 0000:27:00.0: DRM: resuming fence...
Aug 8 14:08:50 phs08 kernel: [ 165.478082] nouveau 0000:27:00.0: DRM: resuming display...
Aug 8 14:08:50 phs08 kernel: [ 165.501235] nouveau 0000:27:00.0: DRM: resuming console...
Aug 8 14:08:50 phs08 kernel: [ 165.524428] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 8 14:08:50 phs08 kernel: [ 165.541138] ata6.00: configured for UDMA/133
Aug 8 14:08:50 phs08 kernel: [ 165.636931] usb 1-3: reset high-speed USB device number 2 using xhci_hcd
Aug 8 14:08:50 phs08 kernel: [ 166.032975] usb 1-8: reset high-speed USB device number 3 using xhci_hcd
Aug 8 14:08:50 phs08 kernel: [ 168.723550] r8169 0000:25:00.0 enp37s0: link up
Aug 8 14:08:50 phs08 kernel: [ 169.227911] PM: resume of devices complete after 4181.953 msecs
Aug 8 14:08:50 phs08 kernel: [ 169.228289] PM: Finishing wakeup.
Aug 8 14:08:50 phs08 kernel: [ 169.228290] OOM killer enabled.
Aug 8 14:08:50 phs08 kernel: [ 169.228290] Restarting tasks ... done.
Aug 8 14:08:50 phs08 kernel: [ 169.236474] usb 1-3.1: USB disconnect, device number 4
Aug 8 14:08:50 phs08 kernel: [ 169.655196] nouveau 0000:27:00.0: DVI-D-1: EDID is invalid:
Aug 8 14:08:50 phs08 kernel: [ 169.655198] [00] BAD 00 ff ff ff ff ff ff 00 4c 2d 40 02 39 31 42 4d
Aug 8 14:08:50 phs08 kernel: [ 169.655198] [00] BAD 09 12 01 03 00 26 1e 78 2a ee 95 a3 54 4c 99 26
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 0f 50 54 bf ef 80 81 80 81 40 71 4f 01 01 01 01
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 01 01 01 01 01 01 30 2a 00 98 51 00 2a 40 30 70
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 13 00 78 2d 11 00 00 1e 00 00 00 fd 00 38 4b 1e
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 51 0e 00 0a 20 20 20 20 20 20 00 00 00 fc 00 53
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 79 6e 63 4d 61 73 74 65 72 0a 20 20 00 00 00 ff
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 00 48 56 4e 51 32 30 30 37 37 35 0a 20 20 00 40
Aug 8 14:08:50 phs08 kernel: [ 169.655202] nouveau 0000:27:00.0: DRM: DDC responded, but no EDID for DVI-D-1
Aug 8 14:08:51 phs08 kernel: [ 169.798065] nouveau 0000:27:00.0: HDMI-A-1: EDID is invalid:
Aug 8 14:08:51 phs08 kernel: [ 169.798066] [00] BAD 00 ff ff ff ff ff ff 00 4c 2d 40 02 39 31 42 4d
Aug 8 14:08:51 phs08 kernel: [ 169.798067] [00] BAD 09 12 01 03 00 26 1e 78 2a ee 95 a3 54 4c 99 26
Aug 8 14:08:51 phs08 kernel: [ 169.798067] [00] BAD 0f 50 54 bf ef 80 81 80 81 40 71 4f 01 01 01 01
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 01 01 01 01 01 01 30 2a 00 98 51 00 2a 40 30 70
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 13 00 78 2d 11 00 00 1e 00 00 00 fd 00 38 4b 1e
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 51 0e 00 0a 20 20 20 20 20 20 00 00 00 fc 00 53
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 79 6e 63 4d 61 73 74 65 72 0a 20 20 00 00 00 ff
Aug 8 14:08:51 phs08 kernel: [ 169.798069] [00] BAD 00 48 56 4e 51 32 30 ...

Read more...

Kai-Heng Feng (kaihengfeng) wrote :

Yea, I think it's normal, because the kernel version is the same. The old one with the same version will be removed before installing the new one.

I have no idea about the nouveau part though. I'll build a new on based on 4.11 to focus on the NVMe bug.

Steve Roberts (drgrumpy) wrote :

Sorry for slow feedback....

Unfortunately the issue is still there:

Aug 15 12:52:58 phs08 kernel: [ 0.000000] Linux version 4.11.0-14-generic (root@ryzen) (gcc version 6.3.0 20170618 (Ubuntu 6.3.0-19ubuntu1) ) #20 SMP Wed Aug 9 20:56:51 CST 2017 (Ubuntu 4.11.0-14.20-generic 4.11.12)
Aug 15 12:52:58 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-14-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Tue Aug 15 13:31:03 BST 2017: performing suspend
Tue Aug 15 14:11:07 BST 2017: Awake.

Aug 15 14:11:07 phs08 kernel: [ 2297.422103] nvme nvme0: disabling APST...
Aug 15 14:11:07 phs08 kernel: [ 2297.423143] nvme nvme0: setting power state to 0...
...
Aug 15 14:11:07 phs08 kernel: [ 2302.067538] PM: resume of devices complete after 1158.582 msecs
Aug 15 14:11:07 phs08 kernel: [ 2302.067889] PM: Finishing wakeup.
Aug 15 14:11:07 phs08 kernel: [ 2302.067890] Restarting tasks ... done.
Aug 15 14:11:09 phs08 kernel: [ 2304.530720] r8169 0000:25:00.0 enp37s0: link up
Aug 15 14:11:13 phs08 kernel: [ 2307.860622] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 15 14:11:13 phs08 kernel: [ 2307.864647] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 15 14:11:13 phs08 kernel: [ 2307.872508] ata1.00: configured for UDMA/133
Aug 15 14:11:13 phs08 kernel: [ 2307.876591] ata2.00: configured for UDMA/133
Aug 15 14:14:02 phs08 kernel: [ 2477.191603] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Aug 15 14:14:02 phs08 kernel: [ 2477.227261] pci_raw_set_power_state: 4 callbacks suppressed
Aug 15 14:14:02 phs08 kernel: [ 2477.227266] nvme 0000:01:00.0: Refused to change power state, currently in D3
Aug 15 14:14:02 phs08 kernel: [ 2477.227394] nvme nvme0: Removing after probe failure status: -19
Aug 15 14:14:02 phs08 kernel: [ 2477.255682] nvme0n1: detected capacity change from 500107862016 to 0

It still seems odd to me that this apparently occurs 2-3 mins after the system has woken up...

One thing that confuses me is that the suspend operations like:

"nvme nvme0: disabling APST..."

are recorded as happening after the system has woken up ! Is this a matter of them not being logged until the system has woken up or actually not executing until the system has woken up ?... either
way it is confusing to see suspend operations occuring after the system has resumed.

Kai-Heng Feng (kaihengfeng) wrote :

I guess NVMe probably needs a to sleep exlat time (6000 here) to let it transit back to power state 0.

Please try http://people.canonical.com/~khfeng/lp1705748-sleep/

Changed in linux (Ubuntu):
status: In Progress → Triaged
28 comments hidden view all 108 comments
Steve Roberts (drgrumpy) wrote :

Kai-Heng,

In reply to #66: I guess this is a definite indication that it is a firmware issue.

#67: there are only two files to download, on all previous occasions there have been four, is that correct ?

Steve

Kai-Heng Feng (kaihengfeng) wrote :

This time I built the kernel on top of mainline kernel, so there are just two files.

Steve Roberts (drgrumpy) wrote :

Ah ok, so I will need to install 4.14 first ?

Kai-Heng Feng (kaihengfeng) wrote :

No, these two files are sufficient.

Steve Roberts (drgrumpy) wrote :

I installed the files,
removed the nvme_core.default_ps_max_latency_us=0,
did an update-grub and rebooted...
failed to boot into graphical interface and then hung... reset needed

Command line: BOOT_IMAGE=/vmlinuz-4.14.0-rc4-evo+ryzen root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspect this is the issue:

Oct 16 10:40:17 phs08 systemd[1]: Started Light Display Manager.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Main process exited, code=exited, status=1/FAILURE
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Unit entered failed state.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Failed with result 'exit-code'.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Service hold-off time over, scheduling restart.

Oct 16 10:40:18 phs08 gpu-manager[2291]: /etc/modprobe.d is not a file
Oct 16 10:40:18 phs08 gpu-manager[2291]: message repeated 4 times: [ /etc/modprobe.d is not a file]
Oct 16 10:40:18 phs08 gpu-manager[2291]: Error: can't open /lib/modules/4.14.0-rc4-evo+ryzen/updates/dkms

Oct 16 10:40:21 phs08 systemd[1]: gpu-manager.service: Start request repeated too quickly.
Oct 16 10:40:21 phs08 systemd[1]: Failed to start Detect the available GPUs and deal with any system changes.
Oct 16 10:40:21 phs08 systemd[1]: lightdm.service: Start request repeated too quickly.
Oct 16 10:40:21 phs08 systemd[1]: Failed to start Light Display Manager.

Kai-Heng Feng (kaihengfeng) wrote :

The fault is mine. Didn't take out-of-tree proprietary modules into account. Please try artful kernel here:

http://people.canonical.com/~khfeng/lp1705748-artful/

Steve Roberts (drgrumpy) wrote :

Sorry for slow response:

Command line: BOOT_IMAGE=/vmlinuz-4.13.0-17-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Resulted in spontaneous disappearance (no suspend/resume needed):

Oct 26 17:20:44 phs08 kernel: [ 3011.057465] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oct 26 17:20:44 phs08 kernel: [ 3011.089763] print_req_error: I/O error, dev nvme0n1, sector 131608072
Oct 26 17:20:44 phs08 kernel: [ 3011.089770] print_req_error: I/O error, dev nvme0n1, sector 202463672

Will try again to see if reproducible...

Steve Roberts (drgrumpy) wrote :

Oct 26 17:23:41 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-17-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspend...
Oct 26 19:00:46 phs08 systemd-sleep[5347]: Suspending system...

Resume....

Oct 26 19:57:11 phs08 kernel: [ 5841.689280] PM: resume of devices complete after 644.551 msecs
Oct 26 19:57:11 phs08 kernel: [ 5841.689579] PM: Finishing wakeup.
Oct 26 19:57:11 phs08 kernel: [ 5841.689580] OOM killer enabled.
...

Oct 26 20:00:29 phs08 kernel: [ 6039.045439] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oct 26 20:00:29 phs08 kernel: [ 6039.089704] print_req_error: I/O error, dev nvme0n1, sector 12208016

Again it seems to be ~3mins after wakeup that the disk becomes inaccessible...

I noticed this, probably unrelated:

Oct 26 21:42:45 phs08 kernel: [ 0.100002] ACPI Error: Needed [Integer/String/Buffer], found [Region] ffff991f1e97f5a0 (20170531/exresop-424)
Oct 26 21:42:45 phs08 kernel: [ 0.100002] ACPI Exception: AE_AML_OPERAND_TYPE, Could not execute arguments for [IOB2] (Region) (20170531/nsinit-412)
Oc

Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the output of `nvme get-feature -f 0x0c -H /dev/nvme0`?

Steve Roberts (drgrumpy) wrote :
Download full text (4.5 KiB)

Output from get-feature:

get-feature:0x0c (Autonomous Power State Transition), Current value: 0x000001
 Autonomous Power State Transition Enable (APSTE): Enabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 410 ms
 Idle Transition Power State (ITPS): 4
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[13]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[14]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[15]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[16]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[17]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[18]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[19]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[20]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[21]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 ...

Read more...

Kai-Heng Feng (kaihengfeng) wrote :

My apology, I typed wrong PCIID for the Samsung device.

Please try this one instead,
http://people.canonical.com/~khfeng/lp1705748-again/

Steve Roberts (drgrumpy) wrote :

Installed that latest version, can confirm that apst is disabled without the max_latency setting kernel parameter ....

I have also found a new BIOS for the m/b: 0902

So will also try and installing that and revert back to one of the kernels with apst turned on, to see if any difference...

or is it possible to turn the apst back on with a kernel parameter ?

Kai-Heng Feng (kaihengfeng) wrote :

Yes. Use "nvme_core.force_apst=1"

Kai-Heng Feng (kaihengfeng) wrote :

Steve, I'll send the patch if you have no concern.

Thanks for all the testing.

Steve Roberts (drgrumpy) wrote :

Sorry for slow response.
Yes all seems to be fine, I have had 9 days of uptime with numerous suspend and resumes without issue.

Steve Roberts (drgrumpy) wrote :

Will this make it into the mainstream kernel updates or how do I track when it is ?
I just installed 4.13.0-19 and the issue is still there.

Download full text (5.9 KiB)

> On 5 Jan 2018, at 7:28 PM, Steve Roberts <email address hidden> wrote:
>
> Will this make it into the mainstream kernel updates or how do I track when it is ?
> I just installed 4.13.0-19 and the issue is still there.

I’ll backport it to Artful kernel. Thanks for the notice.

>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1705748
>
> Title:
> Samsung SSD 960 EVO 500GB refused to change power state
>
> Status in linux package in Ubuntu:
> Triaged
>
> Bug description:
> Originally thought my issue was same as this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
> but requested to report as separate bug
>
> System becomes unusable at seemingly random times but especially after
> resume from suspend due to disk 'disappearing' becoming inaccessible,
> with hundreds of I/O errors logged.
>
> After viewing the above bug report yesterday as a quick temporary fix I added kernel param, updated grub, etc with:
> GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"
>
> System appears to have been stable for the last day, but is presumably
> using more power than it should.
>
> System, drive details below:
>
> M2 nvme drive: Samsung SSD 960 EVO 500GB
>
> Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17
>
> M/B Asus Prime B350m-A
> Ryzen 1600 cpu
>
> Jul 20 16:32:59 phs08 kernel: [ 190.893571] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]
> Jul 20 16:33:05 phs08 kernel: [ 197.010928] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> Jul 20 16:33:05 phs08 kernel: [ 197.046980] pci_raw_set_power_state: 4 callbacks suppressed
> Jul 20 16:33:05 phs08 kernel: [ 197.046985] nvme 0000:01:00.0: Refused to change power state, currently in D3
> Jul 20 16:33:05 phs08 kernel: [ 197.047163] nvme nvme0: Removing after probe failure status: -19
> Jul 20 16:33:05 phs08 kernel: [ 197.047182] nvme0n1: detected capacity change from 500107862016 to 0
> Jul 20 16:33:05 phs08 kernel: [ 197.047793] blk_update_request: I/O error, dev nvme0n1, sector 0
>
>
> nvme list
>
> /dev/nvme0n1 S3EUNX0J305518L Samsung SSD 960 EVO 500GB 1.2 1 125.20 GB
> / 500.11 GB 512 B + 0 B 2B7QCXE7
>
> sudo nvme id-ctrl /dev/nvme0
>
> NVME Identify Controller:
> vid : 0x144d
> ssvid : 0x144d
> sn : S3EUNX0J305518L
> mn : Samsung SSD 960 EVO 500GB
> fr : 2B7QCXE7
> rab : 2
> ieee : 002538
> cmic : 0
> mdts : 9
> cntlid : 2
> ver : 10200
> rtd3r : 7a120
> rtd3e : 4c4b40
> oaes : 0
> oacs : 0x7
> acl : 7
> aerl : 3
> frmw : 0x16
> lpa : 0x3
> elpe : 63
> npss : 4
> avscc : 0x1
> apsta : 0x1
> wctemp : 350
> cctemp : 352
> mtfa : 0
> hmpre : 0
> hmmin : 0
> tnvmcap : 500107862016
> unvmcap : 0
> rpmbs : 0
> sqes : 0x66
> cqes : 0x44
> nn : 1
> oncs : 0x1f
> fuses : 0
> fna : 0x5
> vwc : 0x1
> awun : 255
> awupf : 0
> nvscc : 1
> acwu : 0
> sgls : 0
> ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
> rwt:0 rwl:0 idle_power:- active_power:-
> ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
> rwt:1 rwl:1 idle_power...

Read more...

Seth Forshee (sforshee) on 2018-01-10
Changed in linux (Ubuntu Artful):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Kai-Heng Feng (kaihengfeng) wrote :

Steve,

I guess LP: #1746340 is the same as this one.
Can you try this kernel with "nvme_core.force_apst=1"? I want to know if a PCI reset for NVMe after resume can solve the issue.

people.canonical.com/~khfeng/lp1746340-2/

Steve Roberts (drgrumpy) wrote :

Yes, but I am bit confused by that other thread, can you direct me to the download of the specific kernel you want me to try.

Kai-Heng Feng (kaihengfeng) wrote :
Steve Roberts (drgrumpy) wrote :

Unable to install the file:
linux-headers-4.13.0-34-generic_4.13.0-34.37~lp1746340_amd64.deb

due to missing dependency libssl1.1

Kai-Heng Feng (kaihengfeng) wrote :

Can you manually install libssl1.1? Or skip installing header files, if you don't use any DKMS.

Kai-Heng Feng (kaihengfeng) wrote :

Please try [1].

The previous one was built on Bionic, it had higher libssl requirement as a result. I build the new one on Xenial, so it should have the correct version number.

[1] people.canonical.com/~khfeng/lp1746340-pcireset/

Steve Roberts (drgrumpy) wrote :

Sorry for being so slow, and not sure I have run the correct tests:

Command line: BOOT_IMAGE=/vmlinuz-4.13.0-34-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspend and resume:
Feb 24 11:11:14 phs08 kernel: [ 7608.732297] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Feb 24 11:11:14 phs08 kernel: [ 7608.784945] print_req_error: I/O error, dev nvme0n1, sector 136096888
Feb 24 11:11:14 phs08 kernel: [ 7608.784957] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Feb 24 11:11:14 phs08 kernel: [ 7608.944330] nvme 0000:01:00.0: RESET SUCCEEDED
Feb 24 11:11:14 phs08 kernel: [ 7608.944337] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Feb 24 11:11:14 phs08 kernel: [ 7608.944496] nvme nvme0: Removing after probe failure status: -19
Feb 24 11:11:14 phs08 kernel: [ 7608.976318] nvme0n1: detected capacity change from 500107862016 to 0
Feb 24 11:11:14 phs08 kernel: [ 7608.976521] print_req_error: I/O error, dev nvme0n1, sector 376046984

Note the force_apst parameter was not used above, tried below with the more recent kernel...

Command line: BOOT_IMAGE=/vmlinuz-4.15.0-9-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.force_apst=1

didn't get to being able to suspend, system hangs requiring reset, :
Feb 24 11:24:10 phs08 kernel: [ 567.842115] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]

will try again with force apst=1 shortly

Kai-Heng Feng (kaihengfeng) wrote :

Hmm, I'll build a new one based on Bionic's kernel.

Steve Roberts (drgrumpy) wrote :

In case it helps, further test with 4.13.0-34

Feb 24 11:48:40 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-34-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.force_apst=1

... boots fine, appears ok then suspend overnight...

Feb 25 11:54:43 phs08 kernel: [ 775.372073] sd 0:0:0:0: [sda] Starting disk
Feb 25 11:54:43 phs08 kernel: [ 775.372077] sd 1:0:0:0: [sdb] Starting disk
Feb 25 11:54:43 phs08 kernel: [ 775.372156] serial 00:05: activated
Feb 25 11:54:43 phs08 kernel: [ 775.555973] r8169 0000:25:00.0 enp37s0: link down
Feb 25 11:54:43 phs08 kernel: [ 775.658010] nvme 0000:01:00.0: RESET SUCCEEDED

seems okay for 5 mins... then...

Feb 25 11:59:39 phs08 kernel: [ 1072.862588] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Feb 25 11:59:39 phs08 kernel: [ 1072.906660] print_req_error: I/O error, dev nvme0n1, sector 30511480
Feb 25 11:59:40 phs08 kernel: [ 1073.054471] nvme 0000:01:00.0: RESET SUCCEEDED
Feb 25 11:59:40 phs08 kernel: [ 1073.054478] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Feb 25 11:59:40 phs08 kernel: [ 1073.054650] nvme nvme0: Removing after probe failure status: -19
Feb 25 11:59:40 phs08 kernel: [ 1073.078418] nvme0n1: detected capacity change from 500107862016 to 0

... usual i/o errors and hard reset needed.

Kai-Heng Feng (kaihengfeng) wrote :

The code is "Fix Committed" in the Arftul kernel tree, but it's not yet "Fix Released" as a binary package. So 4.13.0-34 package doesn't contain the fix yet.

Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-artful
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Steve Roberts (drgrumpy) wrote :

I am not sure if I have done the correct thing, but following the instructions above I have managed to install xenial-proposed kernel 4.13.0-38 on my Mint 18.2 XFCE system:

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.13.0-38-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

I can confirm that APST is disabled by default, and no apparent issues after several cycles of suspend and resume.

I can't see any option to add or change tags.

tags: added: verification-done-artful verification-done-xenial
removed: verification-needed-artful verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (56.9 KiB)

This bug was fixed in the package linux - 4.4.0-119.143

---------------
linux (4.4.0-119.143) xenial; urgency=medium

  * linux: 4.4.0-119.143 -proposed tracker (LP: #1760327)

  * Dell XPS 13 9360 bluetooth scan can not detect any device (LP: #1759821)
    - Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"

linux (4.4.0-118.142) xenial; urgency=medium

  * linux: 4.4.0-118.142 -proposed tracker (LP: #1759607)

  * Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty) (LP: #1758869)
    - x86/microcode/AMD: Do not load when running on a hypervisor

  * CVE-2018-8043
    - net: phy: mdio-bcm-unimac: fix potential NULL dereference in
      unimac_mdio_probe()

linux (4.4.0-117.141) xenial; urgency=medium

  * linux: 4.4.0-117.141 -proposed tracker (LP: #1755208)

  * Xenial update to 4.4.114 stable release (LP: #1754592)
    - x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
    - usbip: prevent vhci_hcd driver from leaking a socket pointer address
    - usbip: Fix implicit fallthrough warning
    - usbip: Fix potential format overflow in userspace tools
    - x86/microcode/intel: Fix BDW late-loading revision check
    - x86/retpoline: Fill RSB on context switch for affected CPUs
    - sched/deadline: Use the revised wakeup rule for suspending constrained dl
      tasks
    - can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
    - can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
    - PM / sleep: declare __tracedata symbols as char[] rather than char
    - time: Avoid undefined behaviour in ktime_add_safe()
    - timers: Plug locking race vs. timer migration
    - Prevent timer value 0 for MWAITX
    - drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
    - drivers: base: cacheinfo: fix boot error message when acpi is enabled
    - PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID
    - PCI: layerscape: Fix MSG TLP drop setting
    - mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version
    - fs/select: add vmalloc fallback for select(2)
    - hwpoison, memcg: forcibly uncharge LRU pages
    - cma: fix calculation of aligned offset
    - mm, page_alloc: fix potential false positive in __zone_watermark_ok
    - ipc: msg, make msgrcv work with LONG_MIN
    - x86/ioapic: Fix incorrect pointers in ioapic_setup_resources()
    - ACPI / processor: Avoid reserving IO regions too early
    - ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
    - ACPICA: Namespace: fix operand cache leak
    - netfilter: x_tables: speed up jump target validation
    - netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed
      in 64bit kernel
    - netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags
    - netfilter: nf_ct_expect: remove the redundant slash when policy name is
      empty
    - netfilter: nfnetlink_queue: reject verdict request from different portid
    - netfilter: restart search if moved to other chain
    - netfilter: nf_conntrack_sip: extend request line validation
    - netfilter: use fwmark_reflect in nf_send_reset
    - ext2: Don't clear SGID when inheriting ACLs
    - reiserfs: fix race in prealloc discard
    - re...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Kai-Heng Feng (kaihengfeng) wrote :

Robert,

Can you attach `lspci -vvnn` here? Thanks!

Steve Roberts (drgrumpy) wrote :
Download full text (22.1 KiB)

Hi Kai-Heng, Guessing that request is directed at me, so here it is:

$ lspci -vvnn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1450]
 Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Device [1022:1451]
 Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Interrupt: pin ? routed to IRQ 27
 Capabilities: <access denied>

00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453] (prog-if 00 [Normal decode])
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin ? routed to IRQ 284
 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
 Memory behind bridge: f7900000-f79fffff
 Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
 BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 Capabilities: <access denied>
 Kernel driver in use: pcieport
 Kernel modules: shpchp

00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453] (prog-if 00 [Normal decode])
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin ? routed to IRQ 285
 Bus: primary=00, secondary=02, subordinate=08, sec-latency=0
 I/O behind bridge: 0000f000-0000ffff
 Memory behind bridge: f7500000-f77fffff
 Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
 BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 Capabilities: <access denied>
 Kernel driver in use: pcieport
 Kernel modules: shpchp

00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB...

Steve Roberts (drgrumpy) wrote :

Realise I probably should have run as sudo, so sudo version attached

Kai-Heng Feng (kaihengfeng) wrote :

Hi Steve,

Can you boot with an older kernel with the original issue, suspend the system, then attach `sudo lspci -vvnn`?

I think the ASPM get setup after the system resume from S3.

Steve Roberts (drgrumpy) wrote :

I have deleted/removed many of the older testing kernel, but I think this one does not have the fix, based on post #97

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.13.0-34-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspend

Resume

$ sudo lspci -vvnn > lspci.txt

output attached

Displaying first 40 and last 40 comments. View all 108 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers