Samsung SSD 960 EVO 500GB refused to change power state

Bug #1705748 reported by Steve Roberts
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Kai-Heng Feng
Xenial
Fix Released
Undecided
Unassigned
Artful
Fix Released
Medium
Kai-Heng Feng

Bug Description

=== SRU Justification ===
[Impact]
A user reported his NVMe went out to lunch after S3.

[Fix]
Disable APST for this particular NVMe + Motherboard setup.

[Test]
User confirmed this quirk works for his system.

[Regression Potential]
Very Low. This applies to a specific device setup, also user can
override this quirk by "nvme_core.force_apst=1".

=== Original Bug Report ===
Originally thought my issue was same as this:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
but requested to report as separate bug

System becomes unusable at seemingly random times but especially after resume from suspend due to disk 'disappearing' becoming inaccessible, with hundreds of I/O errors logged.

After viewing the above bug report yesterday as a quick temporary fix I added kernel param, updated grub, etc with:
GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"

System appears to have been stable for the last day, but is presumably using more power than it should.

System, drive details below:

M2 nvme drive: Samsung SSD 960 EVO 500GB

Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17

M/B Asus Prime B350m-A
Ryzen 1600 cpu

Jul 20 16:32:59 phs08 kernel: [ 190.893571] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]
Jul 20 16:33:05 phs08 kernel: [ 197.010928] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 20 16:33:05 phs08 kernel: [ 197.046980] pci_raw_set_power_state: 4 callbacks suppressed
Jul 20 16:33:05 phs08 kernel: [ 197.046985] nvme 0000:01:00.0: Refused to change power state, currently in D3
Jul 20 16:33:05 phs08 kernel: [ 197.047163] nvme nvme0: Removing after probe failure status: -19
Jul 20 16:33:05 phs08 kernel: [ 197.047182] nvme0n1: detected capacity change from 500107862016 to 0
Jul 20 16:33:05 phs08 kernel: [ 197.047793] blk_update_request: I/O error, dev nvme0n1, sector 0

nvme list

/dev/nvme0n1 S3EUNX0J305518L Samsung SSD 960 EVO 500GB 1.2 1 125.20 GB / 500.11 GB 512 B + 0 B 2B7QCXE7

sudo nvme id-ctrl /dev/nvme0

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S3EUNX0J305518L
mn : Samsung SSD 960 EVO 500GB
fr : 2B7QCXE7
rab : 2
ieee : 002538
cmic : 0
mdts : 9
cntlid : 2
ver : 10200
rtd3r : 7a120
rtd3e : 4c4b40
oaes : 0
oacs : 0x7
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 350
cctemp : 352
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 500107862016
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0x5
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:4.08W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
---
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: drgrumpy 2192 F.... pulseaudio
 /dev/snd/pcmC1D0p: drgrumpy 2192 F...m pulseaudio
 /dev/snd/controlC1: drgrumpy 2192 F.... pulseaudio
DistroRelease: Linux 18.2
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=a6896cdd-6bac-4e7f-9e13-55460859c3ec
InstallationDate: Installed on 2017-07-06 (15 days ago)
InstallationMedia: Linux Mint 18.2 "Sonya" - Release amd64 20170628
IwConfig:
 lo no wireless extensions.

 enp37s0 no wireless extensions.
MachineType: System manufacturer System Product Name
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.10.0-26-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.default_ps_max_latency_us=0
ProcVersionSignature: Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.10.0-26-generic N/A
 linux-backports-modules-4.10.0-26-generic N/A
 linux-firmware 1.157.11
RfKill:

Tags: sonya
Uname: Linux 4.10.0-26-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 06/20/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0805
dmi.board.asset.tag: Default string
dmi.board.name: PRIME B350M-A
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0805:bd06/20/2017:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnPRIMEB350M-A:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Steve Roberts (drgrumpy) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc1/

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1705748

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Steve Roberts (drgrumpy) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected sonya
description: updated
Revision history for this message
Steve Roberts (drgrumpy) wrote : CRDA.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : JournalErrors.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : Lspci.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : Lsusb.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : ProcModules.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : UdevDb.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote : WifiSyslog.txt

apport information

Revision history for this message
Steve Roberts (drgrumpy) wrote :

This is a brand new build, with Mint 18.2 installed intitally with kernel 4.8.53, but upgraded straight away to latest kernel due to using Ryzen cpu.

Will try to test with the 4.13 in the next 24 h

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Confirm that removing:
nvme_core.default_ps_max_latency_us=0
from boot params results in reproducible problem on resume from suspend.

Downloaded and installed 4.13 headers and generic kernel:
Linux version 4.13.0-041300rc1-generic

Unable to test issue with above kernel. System boots ok, but now fails to suspend, so cannot check result on resume. Starts to suspend, switches off monitors, then hangs and becomes totally unresponsive, but still powered up, fan running, etc, will not resume, requiring power down or reset button to be pressed, no clues in syslog:

Jul 22 00:40:05 phs08 systemd[1]: Reached target Sleep.
Jul 22 00:40:05 phs08 systemd[1]: Starting Suspend...
Jul 22 00:40:05 phs08 systemd-sleep[4059]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Jul 22 00:40:05 phs08 systemd-sleep[4060]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Jul 22 00:40:05 phs08 systemd-sleep[4059]: Suspending system...

Screens off, keyboard, mouse unresponsive, will not resume from suspend

Reset or Power off/on -> re-boots ok

System is also much noisier with fan seemingly running at higher speed, but cannot check as it87 will not load.

Also notice that asus_wmi is loaded, but this is not a laptop.

Steve Roberts (drgrumpy)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please remove "nvme_core.default_ps_max_latency_us=0" then try this linux kernel:
http://people.canonical.com/~khfeng/lp1705748/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Many thanks, suggested kernel now under test...

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Unfortunately the problem is still there with this 4.11 kernel...
After initial install of the kernel and a suspend/resume it all seemed ok, but then after leaving suspended overnight, the disk goes missing again, see extracts below.

Jul 25 01:30:54 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-12-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro
...
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Suspending system...
Jul 25 01:43:33 phs08 kernel: [ 735.437613] PM: Syncing filesystems ... done.
...
Jul 25 01:52:37 phs08 systemd-sleep[5773]: Suspending system...
Jul 25 09:08:20 phs08 kernel: [ 1294.739373] PM: Syncing filesystems ... done.
...
Jul 25 09:10:24 phs08 systemd[1]: Removed slice User Slice of lightdm.
Jul 25 09:11:58 phs08 kernel: [ 1526.953121] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jul 25 09:12:03 phs08 kernel: [ 1531.977506] pci_raw_set_power_state: 4 callbacks suppressed
Jul 25 09:12:03 phs08 kernel: [ 1531.977511] nvme 0000:01:00.0: Refused to change power state, currently in D3
Jul 25 09:12:03 phs08 kernel: [ 1531.977676] nvme nvme0: Removing after probe failure status: -19
Jul 25 09:12:03 phs08 kernel: [ 1531.997538] nvme0n1: detected capacity change from 500107862016 to 0
Jul 25 09:12:03 phs08 kernel: [ 1531.997846] blk_update_request: I/O error, dev nvme0n1, sector 134367992
Jul 25 09:12:03 phs08 kernel: [ 1531.997857] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
... many i/o errors follow, reset needed to reboot

Revision history for this message
Steve Roberts (drgrumpy) wrote :

One curious thing I note, probably irrelevant, is that after resume, there is lots of stuff recorded in the logs that seem to be about suspending...... e.g. Freezing remaing tasks after the computer has resumed...

ul 25 01:42:58 phs08 systemd[1]: Reached target Sleep.
Jul 25 01:42:58 phs08 systemd[1]: Starting Suspend...
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Jul 25 01:42:58 phs08 systemd-sleep[5316]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Jul 25 01:42:58 phs08 systemd-sleep[5315]: Suspending system...

System is now sleeping, system wakes....

Jul 25 01:43:33 phs08 kernel: [ 735.437613] PM: Syncing filesystems ... done.
Jul 25 01:43:33 phs08 kernel: [ 735.469300] PM: Preparing system for sleep (mem)
Jul 25 01:43:33 phs08 kernel: [ 735.469480] Freezing user space processes ... (elapsed 0.002 seconds) done.
Jul 25 01:43:33 phs08 kernel: [ 735.471929] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
Jul 25 01:43:33 phs08 kernel: [ 735.473849] PM: Suspending system (mem)
Jul 25 01:43:33 phs08 kernel: [ 735.473869] Suspending console(s) (use no_console_suspend to debug)
Jul 25 01:43:33 phs08 kernel: [ 735.474317] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Jul 25 01:43:33 phs08 kernel: [ 735.474394] sd 1:0:0:0: [sdb] Stopping disk
Jul 25 01:43:33 phs08 kernel: [ 735.474620] sd 0:0:0:0: [sda] Synchronizing SCSI cache
Jul 25 01:43:33 phs08 kernel: [ 735.482777] sd 0:0:0:0: [sda] Stopping disk
Jul 25 01:43:33 phs08 kernel: [ 735.483625] i8042 kbd 00:04: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 735.483765] nouveau 0000:27:00.0: DRM: suspending console...
Jul 25 01:43:33 phs08 kernel: [ 735.483771] nouveau 0000:27:00.0: DRM: suspending display...
Jul 25 01:43:33 phs08 kernel: [ 735.505047] nouveau 0000:27:00.0: DRM: evicting buffers...
Jul 25 01:43:33 phs08 kernel: [ 735.580082] nouveau 0000:27:00.0: DRM: waiting for kernel channels to go idle...
Jul 25 01:43:33 phs08 kernel: [ 735.580105] nouveau 0000:27:00.0: DRM: suspending fence...
Jul 25 01:43:33 phs08 kernel: [ 735.581462] nouveau 0000:27:00.0: DRM: suspending object tree...
Jul 25 01:43:33 phs08 kernel: [ 737.956791] PM: suspend of devices complete after 2482.515 msecs
Jul 25 01:43:33 phs08 kernel: [ 737.957846] PM: late suspend of devices complete after 1.050 msecs
Jul 25 01:43:33 phs08 kernel: [ 737.958651] xhci_hcd 0000:28:00.3: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959028] xhci_hcd 0000:22:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959043] r8169 0000:25:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 737.959044] xhci_hcd 0000:03:00.0: System wakeup enabled by ACPI
Jul 25 01:43:33 phs08 kernel: [ 743.441059] PM: noirq suspend of devices complete after 5482.825 msecs
Jul 25 01:43:33 phs08 kernel: [ 743.441087] ACPI: Preparing to enter system sleep state S3
Jul 25 01:43:33 phs08 kernel: [ 743.761477] PM: Saving platform NVS memory
Jul 25 01:43:33 phs08 kernel: [ 743.761495] Disabling non-boot CPUs ...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Try the kernel here, it adds a delay before controller reset.

http://people.canonical.com/~khfeng/lp1705748-2/

---

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f15f4017fb95..1dc42006a8fd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2308,6 +2308,8 @@ static const struct pci_device_id nvme_id_table[] = {
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x1c5f, 0x0540), /* Memblaze Pblaze4 adapter */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
+ { PCI_DEVICE(0x144d, 0xa804), /* Samsung SM961/PM961 */
+ .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x144d, 0xa821), /* Samsung PM1725 */
                .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
        { PCI_DEVICE(0x144d, 0xa822), /* Samsung PM1725a */

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Ok, thanks, will try and report back...

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Unfortunately there seems no improvement, system boots fine, and operates fine, but nvme disappears again after suspend. However I didn't see any spontaneous disappearance of device (so could be some progress ? but will need to test for longer to be certain)

log extracts:

Jul 28 09:36:18 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-8-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

...booted and runs fine, then suspend...
Jul 28 11:28:57 phs08 systemd-sleep[7216]: Suspending system...
... and resume
Jul 28 12:24:36 phs08 kernel: [ 6773.441351] PM: resume of devices complete after 1369.759 msecs
Jul 28 12:24:36 phs08 kernel: [ 6773.441664] PM: Finishing wakeup.
Jul 28 12:24:36 phs08 systemd-sleep[7216]: System resumed.

Jul 28 12:27:40 phs08 kernel: [ 6957.315181] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Jul 28 12:27:40 phs08 kernel: [ 6957.339506] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Jul 28 12:27:40 phs08 kernel: [ 6957.339612] nvme nvme0: Removing after probe failure status: -19
Jul 28 12:27:40 phs08 kernel: [ 6957.375614] nvme0n1: detected capacity change from 500107862016 to 0

But one thing I note, is that the kernel records re-starting conventional spinning disks:

Jul 28 12:24:36 phs08 kernel: [ 6772.072165] sd 0:0:0:0: [sda] Starting disk
Jul 28 12:24:36 phs08 kernel: [ 6772.072166] sd 1:0:0:0: [sdb] Starting disk

but there seems no equivalent for the nvme drive (unless "nvme 0000:01:00.0: enabling device (0000 -> 0002)" is the equivalent.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

I have also just noticed that the nvme drive seems to disappear 2 or 3 minutes after the system has re-awakened, so whilst it consistently seems to happen *after* resume from S3, it is not *on* resume.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Probably need to change the device to full functional state before S3, like how basic power management works.

Changed in linux (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → Kai-Heng Feng (kaihengfeng)
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Roberts, does the new kernel I built still have the issue?

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Sorry, I have downloaded but haven't got to testing it yet.... have had to prioritise the day job... will aim to try this evening and report back tomorrow...

Revision history for this message
Steve Roberts (drgrumpy) wrote :

So last night I installed the kernel and it seems the issue persists:

Aug 3 02:30:59 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-9-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

suspend:

Aug 3 02:52:35 phs08 systemd[1]: Reached target Sleep.
Aug 3 02:52:35 phs08 systemd[1]: Starting Suspend...
Aug 3 02:52:35 phs08 systemd-sleep[5407]: Failed to connect to non-global ctrl_ifname: (nil) error: No such file or directory
Aug 3 02:52:35 phs08 systemd-sleep[5408]: /lib/systemd/system-sleep/wpasupplicant failed with error code 255.
Aug 3 02:52:35 phs08 systemd-sleep[5407]: Suspending system...

resume next morning:

Aug 3 09:29:21 phs08 kernel: [ 1300.773097] PM: Syncing filesystems ... done.
Aug 3 09:29:21 phs08 kern

Aug 3 09:29:21 phs08 kernel: [ 1305.552367] PM: resume of devices complete after 1226.588 msecs
Aug 3 09:29:21 phs08 kernel: [ 1305.552777] PM: Finishing wakeup.
Aug 3 09:29:21 phs08 kernel: [ 1305.552778] OOM killer enabled.
Aug 3 09:29:21 phs08 kernel: [ 1305.552778] Restarting tasks ... done.
Aug 3 09:29:24 phs08 kernel: [ 1307.942704] r8169 0000:25:00.0 enp37s0: link up
Aug 3 09:29:27 phs08 kernel: [ 1311.521223] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 3 09:29:27 phs08 kernel: [ 1311.533275] ata1.00: configured for UDMA/133
Aug 3 09:29:27 phs08 kernel: [ 1311.581201] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 3 09:29:28 phs08 kernel: [ 1311.593308] ata2.00: configured for UDMA/133

Aug 3 09:32:45 phs08 kernel: [ 1508.852213] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Aug 3 09:32:45 phs08 kernel: [ 1508.896322] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Aug 3 09:32:45 phs08 kernel: [ 1508.896429] nvme nvme0: Removing after probe failure status: -19
Aug 3 09:32:45 phs08 kernel: [ 1508.920265] nvme0n1: detected capacity change from 500107862016 to 0
Aug 3 09:32:45 phs08 kernel: [ 1508.920536] blk_update_request: I/O error, dev nvme0n1, sector 360686080

"Probably need to change the device to full functional state before S3, like how basic power management works"
I can't see anything in the logs that relates to this or suggests it is happening, what should I look for ?

I notice this in case of relevance
Aug 3 09:29:21 phs08 kernel: [ 1303.986393] CPU: 0 PID: 5407 Comm: systemd-sleep Tainted: G OE 4.12.0-9-generic #10

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the full dmesg? I'd like to check some additional information.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

I don't have a dmesg log file (due to systemd ?) but I attach syslog from start of boot with the kernel to when the reset button pressed...

Thanks for your attention to this Kai-Heng, really appreciated, Steve.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please try this one:
http://people.canonical.com/~khfeng/lp1705748-3/

This kernel disables ASPT before controller shutdown. But looks like it's not enough (comment #31), this kernel also explicitly set power state 0 before controller shutdown.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Ok, will test and report and back

Revision history for this message
Steve Roberts (drgrumpy) wrote :
Download full text (4.2 KiB)

Unfortunately there seems another issue with this kernel:
The screen/video is blank on resume, I think this is the offending part:

ug 8 14:08:50 phs08 kernel: [ 165.478069] nouveau 0000:27:00.0: DRM: resuming fence...
Aug 8 14:08:50 phs08 kernel: [ 165.478082] nouveau 0000:27:00.0: DRM: resuming display...
Aug 8 14:08:50 phs08 kernel: [ 165.501235] nouveau 0000:27:00.0: DRM: resuming console...
Aug 8 14:08:50 phs08 kernel: [ 165.524428] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Aug 8 14:08:50 phs08 kernel: [ 165.541138] ata6.00: configured for UDMA/133
Aug 8 14:08:50 phs08 kernel: [ 165.636931] usb 1-3: reset high-speed USB device number 2 using xhci_hcd
Aug 8 14:08:50 phs08 kernel: [ 166.032975] usb 1-8: reset high-speed USB device number 3 using xhci_hcd
Aug 8 14:08:50 phs08 kernel: [ 168.723550] r8169 0000:25:00.0 enp37s0: link up
Aug 8 14:08:50 phs08 kernel: [ 169.227911] PM: resume of devices complete after 4181.953 msecs
Aug 8 14:08:50 phs08 kernel: [ 169.228289] PM: Finishing wakeup.
Aug 8 14:08:50 phs08 kernel: [ 169.228290] OOM killer enabled.
Aug 8 14:08:50 phs08 kernel: [ 169.228290] Restarting tasks ... done.
Aug 8 14:08:50 phs08 kernel: [ 169.236474] usb 1-3.1: USB disconnect, device number 4
Aug 8 14:08:50 phs08 kernel: [ 169.655196] nouveau 0000:27:00.0: DVI-D-1: EDID is invalid:
Aug 8 14:08:50 phs08 kernel: [ 169.655198] [00] BAD 00 ff ff ff ff ff ff 00 4c 2d 40 02 39 31 42 4d
Aug 8 14:08:50 phs08 kernel: [ 169.655198] [00] BAD 09 12 01 03 00 26 1e 78 2a ee 95 a3 54 4c 99 26
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 0f 50 54 bf ef 80 81 80 81 40 71 4f 01 01 01 01
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 01 01 01 01 01 01 30 2a 00 98 51 00 2a 40 30 70
Aug 8 14:08:50 phs08 kernel: [ 169.655199] [00] BAD 13 00 78 2d 11 00 00 1e 00 00 00 fd 00 38 4b 1e
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 51 0e 00 0a 20 20 20 20 20 20 00 00 00 fc 00 53
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 79 6e 63 4d 61 73 74 65 72 0a 20 20 00 00 00 ff
Aug 8 14:08:50 phs08 kernel: [ 169.655200] [00] BAD 00 48 56 4e 51 32 30 30 37 37 35 0a 20 20 00 40
Aug 8 14:08:50 phs08 kernel: [ 169.655202] nouveau 0000:27:00.0: DRM: DDC responded, but no EDID for DVI-D-1
Aug 8 14:08:51 phs08 kernel: [ 169.798065] nouveau 0000:27:00.0: HDMI-A-1: EDID is invalid:
Aug 8 14:08:51 phs08 kernel: [ 169.798066] [00] BAD 00 ff ff ff ff ff ff 00 4c 2d 40 02 39 31 42 4d
Aug 8 14:08:51 phs08 kernel: [ 169.798067] [00] BAD 09 12 01 03 00 26 1e 78 2a ee 95 a3 54 4c 99 26
Aug 8 14:08:51 phs08 kernel: [ 169.798067] [00] BAD 0f 50 54 bf ef 80 81 80 81 40 71 4f 01 01 01 01
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 01 01 01 01 01 01 30 2a 00 98 51 00 2a 40 30 70
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 13 00 78 2d 11 00 00 1e 00 00 00 fd 00 38 4b 1e
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 51 0e 00 0a 20 20 20 20 20 20 00 00 00 fc 00 53
Aug 8 14:08:51 phs08 kernel: [ 169.798068] [00] BAD 79 6e 63 4d 61 73 74 65 72 0a 20 20 00 00 00 ff
Aug 8 14:08:51 phs08 kernel: [ 169.798069] [00] BAD 00 48 56 4e 51 32 30 ...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Yea, I think it's normal, because the kernel version is the same. The old one with the same version will be removed before installing the new one.

I have no idea about the nouveau part though. I'll build a new on based on 4.11 to focus on the NVMe bug.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Steve Roberts (drgrumpy) wrote :

Sorry for slow feedback....

Unfortunately the issue is still there:

Aug 15 12:52:58 phs08 kernel: [ 0.000000] Linux version 4.11.0-14-generic (root@ryzen) (gcc version 6.3.0 20170618 (Ubuntu 6.3.0-19ubuntu1) ) #20 SMP Wed Aug 9 20:56:51 CST 2017 (Ubuntu 4.11.0-14.20-generic 4.11.12)
Aug 15 12:52:58 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.11.0-14-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Tue Aug 15 13:31:03 BST 2017: performing suspend
Tue Aug 15 14:11:07 BST 2017: Awake.

Aug 15 14:11:07 phs08 kernel: [ 2297.422103] nvme nvme0: disabling APST...
Aug 15 14:11:07 phs08 kernel: [ 2297.423143] nvme nvme0: setting power state to 0...
...
Aug 15 14:11:07 phs08 kernel: [ 2302.067538] PM: resume of devices complete after 1158.582 msecs
Aug 15 14:11:07 phs08 kernel: [ 2302.067889] PM: Finishing wakeup.
Aug 15 14:11:07 phs08 kernel: [ 2302.067890] Restarting tasks ... done.
Aug 15 14:11:09 phs08 kernel: [ 2304.530720] r8169 0000:25:00.0 enp37s0: link up
Aug 15 14:11:13 phs08 kernel: [ 2307.860622] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 15 14:11:13 phs08 kernel: [ 2307.864647] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 15 14:11:13 phs08 kernel: [ 2307.872508] ata1.00: configured for UDMA/133
Aug 15 14:11:13 phs08 kernel: [ 2307.876591] ata2.00: configured for UDMA/133
Aug 15 14:14:02 phs08 kernel: [ 2477.191603] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Aug 15 14:14:02 phs08 kernel: [ 2477.227261] pci_raw_set_power_state: 4 callbacks suppressed
Aug 15 14:14:02 phs08 kernel: [ 2477.227266] nvme 0000:01:00.0: Refused to change power state, currently in D3
Aug 15 14:14:02 phs08 kernel: [ 2477.227394] nvme nvme0: Removing after probe failure status: -19
Aug 15 14:14:02 phs08 kernel: [ 2477.255682] nvme0n1: detected capacity change from 500107862016 to 0

It still seems odd to me that this apparently occurs 2-3 mins after the system has woken up...

One thing that confuses me is that the suspend operations like:

"nvme nvme0: disabling APST..."

are recorded as happening after the system has woken up ! Is this a matter of them not being logged until the system has woken up or actually not executing until the system has woken up ?... either
way it is confusing to see suspend operations occuring after the system has resumed.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I guess NVMe probably needs a to sleep exlat time (6000 here) to let it transit back to power state 0.

Please try http://people.canonical.com/~khfeng/lp1705748-sleep/

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

If this kernel still doesn't work, try latest mainline kernel with "nvme_core.default_ps_max_latency_us=1500", do disable deepest power state.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Unfortunately still no joy

Aug 21 18:02:54 phs08 kernel: [ 0.000000] Linux version 4.13.0-6-generic (root@linux) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)) #7~lp1705748 SMP Fri Aug 18 17:09:21 CST 2017 (Ubuntu 4.13.0-6.7~lp1705748-generic 4.13.0-rc5)
Aug 21 18:02:54 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-6-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro
...
Mon Aug 21 19:55:03 BST 2017: performing suspend
Tue Aug 22 02:17:23 BST 2017: Awake.

Aug 22 02:21:36 phs08 kernel: [ 6999.857721] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Aug 22 02:21:36 phs08 kernel: [ 6999.901881] print_req_error: I/O error, dev nvme0n1, sector 137284824
Aug 22 02:21:36 phs08 kernel: [ 6999.901892] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.901933] print_req_error: I/O error, dev nvme0n1, sector 137286408
Aug 22 02:21:36 phs08 kernel: [ 6999.901937] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.901970] print_req_error: I/O error, dev nvme0n1, sector 137296616
Aug 22 02:21:36 phs08 kernel: [ 6999.901974] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.902007] print_req_error: I/O error, dev nvme0n1, sector 137284312
Aug 22 02:21:36 phs08 kernel: [ 6999.902010] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Aug 22 02:21:36 phs08 kernel: [ 6999.945333] pci_raw_set_power_state: 5 callbacks suppressed
Aug 22 02:21:36 phs08 kernel: [ 6999.945338] nvme 0000:01:00.0: Refused to change power state, currently in D3
Aug 22 02:21:36 phs08 kernel: [ 6999.945503] nvme nvme0: Removing after probe failure status: -19

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Appreciate all your efforts Kai-Heng

Where do I get the latest mainline kernel ?

Thanks, Steve

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Use the one I built should be recent enough.

You can still get one from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc6/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Using http://people.canonical.com/~khfeng/lp1705748-sleep/

with "nvme_core.default_ps_max_latency_us=1500", to disable deepest power state

Been through several suspend/resume cycles so far and seems to be ok.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I built a kernel with the quirk. Please try it without the nvme_core kernel parameter.

http://people.canonical.com/~khfeng/pm961/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

I haven't yet tried with that latest kernel... but this morning, after 5 days of successful suspend/resume:

Aug 30 09:35:19 phs08 kernel: [71904.271956] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Aug 30 09:35:19 phs08 kernel: [71904.320269] print_req_error: I/O error, dev nvme0n1, sector 136098096
Aug 30 09:35:19 phs08 kernel: [71904.320280] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Aug 30 09:35:19 phs08 kernel: [71904.339992] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Aug 30 09:35:19 phs08 kernel: [71904.340163] nvme nvme0: Removing after probe failure status: -19
Aug 30 09:35:19 phs08 kernel: [71904.371974] nvme0n1: detected capacity change from 500107862016 to 0

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

But no more "nvme 0000:01:00.0: Refused to change power state, currently in D3"?

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Yes, correct I just searched the kernel log and no more
"...Refused to change..."

since I booted with:

Command line: BOOT_IMAGE=/vmlinuz-4.13.0-6-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.default_ps_max_latency_us=1500

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Originally can't read PCI status:
Aug 15 14:14:02 phs08 kernel: [ 2477.191603] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff

Now only controller is dead:
Aug 30 09:35:19 phs08 kernel: [71904.271956] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

So I guess other than NVME_QUIRK_NO_DEEPEST_PS, quirk NVME_QUIRK_DELAY_BEFORE_CHK_RDY is also needed.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Steve Roberts (drgrumpy) wrote :

Tried that latest version, same result:

Sep 1 16:56:16 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-8-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

(Note without nvme_core.default_ps_max_latency_us=1500. hope this was correct)

...
Sep 1 17:01:29 phs08 kernel: [ 318.046495] PM: Suspending system (mem)
...
Resume system:

...
Sep 1 18:07:42 phs08 kernel: [ 688.834876] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Sep 1 18:07:42 phs08 kernel: [ 688.875137] print_req_error: I/O error, dev nvme0n1, sector 134468320
Sep 1 18:07:42 phs08 kernel: [ 688.875148] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Sep 1 18:07:42 phs08 kernel: [ 688.902878] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Sep 1 18:07:42 phs08 kernel: [ 688.903048] nvme nvme0: Removing after probe failure status: -19
Sep 1 18:07:42 phs08 kernel: [ 688.930880] nvme0n1: detected capacity change from 500107862016 to 0

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Also same result with nvme_core.default_ps_max_latency_us=1500

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Just let you know, I am still working on this issue.
Currently digging through the spec, hopefully I can find some angle to try new workarounds.

Completely disable APST should be last resort.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

This kernel [1] resets NVMe controller before shutdown.

[1] http://people.canonical.com/~khfeng/pm961-reset/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Tried that latest one... still fails after suspend but should I be using nvme_core.default_ps_max_latency_us=1500 ?

But as I have said before it does bother me that the suspend and resume operations seem to overlap and all are logged on resume.

Booted am:

Sep 7 10:02:44 phs08 kernel: [ 0.000000] Linux version 4.12.0-14-generic (root@Linux) (gcc version 7.1.0 (Ubuntu 7.1.0-13ubuntu1) ) #15~pm961+reset SMP Wed Sep 6 13:43:27 CST 2017 (Ubuntu 4.12.0-14.15~pm961+reset-generic 4.12.10)
Sep 7 10:02:44 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.12.0-14-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

system running all day...

from pm-suspend.log:
Thu Sep 7 19:34:17 BST 2017: performing suspend
Thu Sep 7 20:26:41 BST 2017: Awake.

from kernel.log
(nothing about any suspend operations prior to wakeup time, i.e. all kernel operations relating to suspend are logged on wakeup !)

Sep 7 20:26:41 phs08 kernel: [34307.434357] PM: Syncing filesystems ... done.
Sep 7 20:26:41 phs08 kernel: [34307.498356] PM: Preparing system for sleep (mem)
...
Sep 7 20:26:41 phs08 kernel: [34311.224400] PM: resume of devices complete after 698.454 msecs
Sep 7 20:26:41 phs08 kernel: [34311.224785] PM: Finishing wakeup
...
At 20:28 I successfully saved a spreadsheet file to the nvme disk, that was open before the system was suspended (the file is timestamped with 20:28)

then....
Sep 7 20:29:56 phs08 kernel: [34506.017034] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Sep 7 20:29:56 phs08 kernel: [34506.052638] pci_raw_set_power_state: 5 callbacks suppressed
Sep 7 20:29:56 phs08 kernel: [34506.052642] nvme 0000:01:00.0: Refused to change power state, currently in D3
Sep 7 20:29:56 phs08 kernel: [34506.052779] nvme nvme0: Removing after probe failure status: -19
etc.

So the nvme disk apparently does not become inaccessible until 3 min after the system has woken up !

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So is "nvme disk apparently does not become inaccessible until 3 min after the system has woken up" a new behavior?

Revision history for this message
Steve Roberts (drgrumpy) wrote :

No, this seems to have been happening all along, if you look at the timings in earlier posts...see post #25

...but what throws me and makes it more obscure it that the suspend operations are logged on resume, and until yesterday I hadn't ever managed to check the drive was actually alive (I actually had assumed it wasn't, and the apparent delay was just until it was first accessed (var storing the logs is a mountpoint on one of the spinning SATA disks)

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

That's because disks need to be stopped before chipsets/CPU goes to suspend. So the internal messages are flushed *after* disk wakeup - disk is not available at the end the suspend process.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Changed in linux (Ubuntu):
status: In Progress → Triaged
Revision history for this message
Steve Roberts (drgrumpy) wrote :

Ok thanks and for the explanation - makes sense.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Hmm, there's no response from Samsung.

Do you think we should completely turn off APST for Asus Prime B350m-A + Samsung 960 EVO?

Revision history for this message
- (bplaa.yai) wrote :

FWIW, I'm also affected by this issue, but on a completely different system (Razer Blade Stealth, Samsung PM951, Fedora 26 - kernel 4.13.4), so I don't think specifically targeting Asus Prime B350m-A + Samsung 960 EVO would do it.

As the OP, I was having frequent random SSD disconnects (the issue appeared in late August), and thought I was affected by bug 1678184.

Using nvme_core.default_ps_max_latency_us=6000 allows the system to be stable again as long as it's awake, but still I sometimes have the issue when the system is awoken from sleep mode.

Please let me know if providing additional infos would help.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

bplaa.yai,

Can you file a new bug?
SM/PM951 is an OEM version, which means it's already inside the laptop when you unbox it.
That means we can make specific workaround for the laptop/NVMe combination.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Re: turn off APST completely

Tricky...

So I have been running with:
BOOT_IMAGE=/vmlinuz-4.13.0-8-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.default_ps_max_latency_us=0

for the last month or so with no issues....

I suppose APST is about reducing power consumption, but I am not clear how much it saves.... in my (desktop) case the biggest reduction in power consumption likely comes from being able to suspend the system as a whole, so being able to suspend/resume likely outweighs the benefit of apst, on the other hand for someone with an always on system, or someone that prefers to hibernate or shutdown, having apst enabled could be beneficial...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Steve,

I'll write a patch to let the system not to enable APST.

FWIW, my desktop ues Asus Prime B350m-A + Ryzen 7 1700, but with Intel P600. Currently I don't have any issue when APST is enabled.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Steve, please try [1] without any NVMe parameters.

[1] http://people.canonical.com/~khfeng/lp1705748-evo+ryzen/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Kai-Heng,

In response to #66, I guess that definitely points to a firmware issue with Samsung drive

#67 I will try and let you know...

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Kai-Heng,

In reply to #66: I guess this is a definite indication that it is a firmware issue.

#67: there are only two files to download, on all previous occasions there have been four, is that correct ?

Steve

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

This time I built the kernel on top of mainline kernel, so there are just two files.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Ah ok, so I will need to install 4.14 first ?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

No, these two files are sufficient.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

I installed the files,
removed the nvme_core.default_ps_max_latency_us=0,
did an update-grub and rebooted...
failed to boot into graphical interface and then hung... reset needed

Command line: BOOT_IMAGE=/vmlinuz-4.14.0-rc4-evo+ryzen root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspect this is the issue:

Oct 16 10:40:17 phs08 systemd[1]: Started Light Display Manager.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Main process exited, code=exited, status=1/FAILURE
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Unit entered failed state.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Failed with result 'exit-code'.
Oct 16 10:40:18 phs08 systemd[1]: lightdm.service: Service hold-off time over, scheduling restart.

Oct 16 10:40:18 phs08 gpu-manager[2291]: /etc/modprobe.d is not a file
Oct 16 10:40:18 phs08 gpu-manager[2291]: message repeated 4 times: [ /etc/modprobe.d is not a file]
Oct 16 10:40:18 phs08 gpu-manager[2291]: Error: can't open /lib/modules/4.14.0-rc4-evo+ryzen/updates/dkms

Oct 16 10:40:21 phs08 systemd[1]: gpu-manager.service: Start request repeated too quickly.
Oct 16 10:40:21 phs08 systemd[1]: Failed to start Detect the available GPUs and deal with any system changes.
Oct 16 10:40:21 phs08 systemd[1]: lightdm.service: Start request repeated too quickly.
Oct 16 10:40:21 phs08 systemd[1]: Failed to start Light Display Manager.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The fault is mine. Didn't take out-of-tree proprietary modules into account. Please try artful kernel here:

http://people.canonical.com/~khfeng/lp1705748-artful/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Sorry for slow response:

Command line: BOOT_IMAGE=/vmlinuz-4.13.0-17-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Resulted in spontaneous disappearance (no suspend/resume needed):

Oct 26 17:20:44 phs08 kernel: [ 3011.057465] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oct 26 17:20:44 phs08 kernel: [ 3011.089763] print_req_error: I/O error, dev nvme0n1, sector 131608072
Oct 26 17:20:44 phs08 kernel: [ 3011.089770] print_req_error: I/O error, dev nvme0n1, sector 202463672

Will try again to see if reproducible...

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Oct 26 17:23:41 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-17-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspend...
Oct 26 19:00:46 phs08 systemd-sleep[5347]: Suspending system...

Resume....

Oct 26 19:57:11 phs08 kernel: [ 5841.689280] PM: resume of devices complete after 644.551 msecs
Oct 26 19:57:11 phs08 kernel: [ 5841.689579] PM: Finishing wakeup.
Oct 26 19:57:11 phs08 kernel: [ 5841.689580] OOM killer enabled.
...

Oct 26 20:00:29 phs08 kernel: [ 6039.045439] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Oct 26 20:00:29 phs08 kernel: [ 6039.089704] print_req_error: I/O error, dev nvme0n1, sector 12208016

Again it seems to be ~3mins after wakeup that the disk becomes inaccessible...

I noticed this, probably unrelated:

Oct 26 21:42:45 phs08 kernel: [ 0.100002] ACPI Error: Needed [Integer/String/Buffer], found [Region] ffff991f1e97f5a0 (20170531/exresop-424)
Oct 26 21:42:45 phs08 kernel: [ 0.100002] ACPI Exception: AE_AML_OPERAND_TYPE, Could not execute arguments for [IOB2] (Region) (20170531/nsinit-412)
Oc

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the output of `nvme get-feature -f 0x0c -H /dev/nvme0`?

Revision history for this message
Steve Roberts (drgrumpy) wrote :
Download full text (4.5 KiB)

Output from get-feature:

get-feature:0x0c (Autonomous Power State Transition), Current value: 0x000001
 Autonomous Power State Transition Enable (APSTE): Enabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 86 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 410 ms
 Idle Transition Power State (ITPS): 4
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[13]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[14]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[15]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[16]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[17]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[18]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[19]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[20]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[21]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 ...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

My apology, I typed wrong PCIID for the Samsung device.

Please try this one instead,
http://people.canonical.com/~khfeng/lp1705748-again/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Installed that latest version, can confirm that apst is disabled without the max_latency setting kernel parameter ....

I have also found a new BIOS for the m/b: 0902

So will also try and installing that and revert back to one of the kernels with apst turned on, to see if any difference...

or is it possible to turn the apst back on with a kernel parameter ?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Yes. Use "nvme_core.force_apst=1"

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Steve, I'll send the patch if you have no concern.

Thanks for all the testing.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Steve Roberts (drgrumpy) wrote :

Sorry for slow response.
Yes all seems to be fine, I have had 9 days of uptime with numerous suspend and resumes without issue.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Will this make it into the mainstream kernel updates or how do I track when it is ?
I just installed 4.13.0-19 and the issue is still there.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote : Re: [Bug 1705748] Re: Samsung SSD 960 EVO 500GB refused to change power state
Download full text (5.9 KiB)

> On 5 Jan 2018, at 7:28 PM, Steve Roberts <email address hidden> wrote:
>
> Will this make it into the mainstream kernel updates or how do I track when it is ?
> I just installed 4.13.0-19 and the issue is still there.

I’ll backport it to Artful kernel. Thanks for the notice.

>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1705748
>
> Title:
> Samsung SSD 960 EVO 500GB refused to change power state
>
> Status in linux package in Ubuntu:
> Triaged
>
> Bug description:
> Originally thought my issue was same as this:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
> but requested to report as separate bug
>
> System becomes unusable at seemingly random times but especially after
> resume from suspend due to disk 'disappearing' becoming inaccessible,
> with hundreds of I/O errors logged.
>
> After viewing the above bug report yesterday as a quick temporary fix I added kernel param, updated grub, etc with:
> GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"
>
> System appears to have been stable for the last day, but is presumably
> using more power than it should.
>
> System, drive details below:
>
> M2 nvme drive: Samsung SSD 960 EVO 500GB
>
> Ubuntu 4.10.0-26.30~16.04.1-generic 4.10.17
>
> M/B Asus Prime B350m-A
> Ryzen 1600 cpu
>
> Jul 20 16:32:59 phs08 kernel: [ 190.893571] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]
> Jul 20 16:33:05 phs08 kernel: [ 197.010928] nvme 0000:01:00.0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> Jul 20 16:33:05 phs08 kernel: [ 197.046980] pci_raw_set_power_state: 4 callbacks suppressed
> Jul 20 16:33:05 phs08 kernel: [ 197.046985] nvme 0000:01:00.0: Refused to change power state, currently in D3
> Jul 20 16:33:05 phs08 kernel: [ 197.047163] nvme nvme0: Removing after probe failure status: -19
> Jul 20 16:33:05 phs08 kernel: [ 197.047182] nvme0n1: detected capacity change from 500107862016 to 0
> Jul 20 16:33:05 phs08 kernel: [ 197.047793] blk_update_request: I/O error, dev nvme0n1, sector 0
>
>
> nvme list
>
> /dev/nvme0n1 S3EUNX0J305518L Samsung SSD 960 EVO 500GB 1.2 1 125.20 GB
> / 500.11 GB 512 B + 0 B 2B7QCXE7
>
> sudo nvme id-ctrl /dev/nvme0
>
> NVME Identify Controller:
> vid : 0x144d
> ssvid : 0x144d
> sn : S3EUNX0J305518L
> mn : Samsung SSD 960 EVO 500GB
> fr : 2B7QCXE7
> rab : 2
> ieee : 002538
> cmic : 0
> mdts : 9
> cntlid : 2
> ver : 10200
> rtd3r : 7a120
> rtd3e : 4c4b40
> oaes : 0
> oacs : 0x7
> acl : 7
> aerl : 3
> frmw : 0x16
> lpa : 0x3
> elpe : 63
> npss : 4
> avscc : 0x1
> apsta : 0x1
> wctemp : 350
> cctemp : 352
> mtfa : 0
> hmpre : 0
> hmmin : 0
> tnvmcap : 500107862016
> unvmcap : 0
> rpmbs : 0
> sqes : 0x66
> cqes : 0x44
> nn : 1
> oncs : 0x1f
> fuses : 0
> fna : 0x5
> vwc : 0x1
> awun : 255
> awupf : 0
> nvscc : 1
> acwu : 0
> sgls : 0
> ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
> rwt:0 rwl:0 idle_power:- active_power:-
> ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
> rwt:1 rwl:1 idle_power...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
description: updated
Seth Forshee (sforshee)
Changed in linux (Ubuntu Artful):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
importance: Undecided → Medium
status: New → In Progress
Changed in linux (Ubuntu):
status: Triaged → Fix Committed
Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Steve,

I guess LP: #1746340 is the same as this one.
Can you try this kernel with "nvme_core.force_apst=1"? I want to know if a PCI reset for NVMe after resume can solve the issue.

people.canonical.com/~khfeng/lp1746340-2/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Yes, but I am bit confused by that other thread, can you direct me to the download of the specific kernel you want me to try.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Steve Roberts (drgrumpy) wrote :

Unable to install the file:
linux-headers-4.13.0-34-generic_4.13.0-34.37~lp1746340_amd64.deb

due to missing dependency libssl1.1

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you manually install libssl1.1? Or skip installing header files, if you don't use any DKMS.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please try [1].

The previous one was built on Bionic, it had higher libssl requirement as a result. I build the new one on Xenial, so it should have the correct version number.

[1] people.canonical.com/~khfeng/lp1746340-pcireset/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Sorry for being so slow, and not sure I have run the correct tests:

Command line: BOOT_IMAGE=/vmlinuz-4.13.0-34-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspend and resume:
Feb 24 11:11:14 phs08 kernel: [ 7608.732297] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Feb 24 11:11:14 phs08 kernel: [ 7608.784945] print_req_error: I/O error, dev nvme0n1, sector 136096888
Feb 24 11:11:14 phs08 kernel: [ 7608.784957] BTRFS error (device nvme0n1p5): bdev /dev/nvme0n1p5 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Feb 24 11:11:14 phs08 kernel: [ 7608.944330] nvme 0000:01:00.0: RESET SUCCEEDED
Feb 24 11:11:14 phs08 kernel: [ 7608.944337] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Feb 24 11:11:14 phs08 kernel: [ 7608.944496] nvme nvme0: Removing after probe failure status: -19
Feb 24 11:11:14 phs08 kernel: [ 7608.976318] nvme0n1: detected capacity change from 500107862016 to 0
Feb 24 11:11:14 phs08 kernel: [ 7608.976521] print_req_error: I/O error, dev nvme0n1, sector 376046984

Note the force_apst parameter was not used above, tried below with the more recent kernel...

Command line: BOOT_IMAGE=/vmlinuz-4.15.0-9-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.force_apst=1

didn't get to being able to suspend, system hangs requiring reset, :
Feb 24 11:24:10 phs08 kernel: [ 567.842115] INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]

will try again with force apst=1 shortly

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Hmm, I'll build a new one based on Bionic's kernel.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Steve Roberts (drgrumpy) wrote :

In case it helps, further test with 4.13.0-34

Feb 24 11:48:40 phs08 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.13.0-34-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro nvme_core.force_apst=1

... boots fine, appears ok then suspend overnight...

Feb 25 11:54:43 phs08 kernel: [ 775.372073] sd 0:0:0:0: [sda] Starting disk
Feb 25 11:54:43 phs08 kernel: [ 775.372077] sd 1:0:0:0: [sdb] Starting disk
Feb 25 11:54:43 phs08 kernel: [ 775.372156] serial 00:05: activated
Feb 25 11:54:43 phs08 kernel: [ 775.555973] r8169 0000:25:00.0 enp37s0: link down
Feb 25 11:54:43 phs08 kernel: [ 775.658010] nvme 0000:01:00.0: RESET SUCCEEDED

seems okay for 5 mins... then...

Feb 25 11:59:39 phs08 kernel: [ 1072.862588] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Feb 25 11:59:39 phs08 kernel: [ 1072.906660] print_req_error: I/O error, dev nvme0n1, sector 30511480
Feb 25 11:59:40 phs08 kernel: [ 1073.054471] nvme 0000:01:00.0: RESET SUCCEEDED
Feb 25 11:59:40 phs08 kernel: [ 1073.054478] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Feb 25 11:59:40 phs08 kernel: [ 1073.054650] nvme nvme0: Removing after probe failure status: -19
Feb 25 11:59:40 phs08 kernel: [ 1073.078418] nvme0n1: detected capacity change from 500107862016 to 0

... usual i/o errors and hard reset needed.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The code is "Fix Committed" in the Arftul kernel tree, but it's not yet "Fix Released" as a binary package. So 4.13.0-34 package doesn't contain the fix yet.

Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
tags: added: verification-needed-artful
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Steve Roberts (drgrumpy) wrote :

I am not sure if I have done the correct thing, but following the instructions above I have managed to install xenial-proposed kernel 4.13.0-38 on my Mint 18.2 XFCE system:

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.13.0-38-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

I can confirm that APST is disabled by default, and no apparent issues after several cycles of suspend and resume.

I can't see any option to add or change tags.

tags: added: verification-done-artful verification-done-xenial
removed: verification-needed-artful verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (18.9 KiB)

This bug was fixed in the package linux - 4.13.0-38.43

---------------
linux (4.13.0-38.43) artful; urgency=medium

  * linux: 4.13.0-38.43 -proposed tracker (LP: #1755762)

  * Servers going OOM after updating kernel from 4.10 to 4.13 (LP: #1748408)
    - i40e: Fix memory leak related filter programming status
    - i40e: Add programming descriptors to cleaned_count

  * [SRU] Lenovo E41 Mic mute hotkey is not responding (LP: #1753347)
    - platform/x86: ideapad-laptop: Increase timeout to wait for EC answer

  * fails to dump with latest kpti fixes (LP: #1750021)
    - kdump: write correct address of mem_section into vmcoreinfo

  * headset mic can't be detected on two Dell machines (LP: #1748807)
    - ALSA: hda/realtek - Support headset mode for ALC215/ALC285/ALC289
    - ALSA: hda - Fix headset mic detection problem for two Dell machines
    - ALSA: hda - Fix a wrong FIXUP for alc289 on Dell machines

  * CIFS SMB2/SMB3 does not work for domain based DFS (LP: #1747572)
    - CIFS: make IPC a regular tcon
    - CIFS: use tcon_ipc instead of use_ipc parameter of SMB2_ioctl
    - CIFS: dump IPC tcon in debug proc file

  * i2c-thunderx: erroneous error message "unhandled state: 0" (LP: #1754076)
    - i2c: octeon: Prevent error message on bus error

  * hisi_sas: Add disk LED support (LP: #1752695)
    - scsi: hisi_sas: directly attached disk LED feature for v2 hw

  * EDAC, sb_edac: Backport 1 patch to Ubuntu 17.10 (Fix missing DIMM sysfs
    entries with KNL SNC2/SNC4 mode) (LP: #1743856)
    - EDAC, sb_edac: Fix missing DIMM sysfs entries with KNL SNC2/SNC4 mode

  * [regression] Colour banding and artefacts appear system-wide on an Asus
    Zenbook UX303LA with Intel HD 4400 graphics (LP: #1749420)
    - drm/edid: Add 6 bpc quirk for CPT panel in Asus UX303LA

  * DVB Card with SAA7146 chipset not working (LP: #1742316)
    - vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems

  * [Asus UX360UA] battery status in unity-panel is not changing when battery is
    being charged (LP: #1661876) // AC adapter status not detected on Asus
    ZenBook UX410UAK (LP: #1745032)
    - ACPI / battery: Add quirk for Asus UX360UA and UX410UAK

  * ASUS UX305LA - Battery state not detected correctly (LP: #1482390)
    - ACPI / battery: Add quirk for Asus GL502VSK and UX305LA

  * support thunderx2 vendor pmu events (LP: #1747523)
    - perf pmu: Extract function to get JSON alias map
    - perf pmu: Pass pmu as a parameter to get_cpuid_str()
    - perf tools arm64: Add support for get_cpuid_str function.
    - perf pmu: Add helper function is_pmu_core to detect PMU CORE devices
    - perf vendor events arm64: Add ThunderX2 implementation defined pmu core
      events
    - perf pmu: Add check for valid cpuid in perf_pmu__find_map()

  * lpfc.ko module doesn't work (LP: #1746970)
    - scsi: lpfc: Fix loop mode target discovery

  * Ubuntu 17.10 crashes on vmalloc.c (LP: #1739498)
    - powerpc/mm/book3s64: Make KERN_IO_START a variable
    - powerpc/mm/slb: Move comment next to the code it's referring to
    - powerpc/mm/hash64: Make vmalloc 56T on hash

  * ethtool -p fails to light NIC LED on HiSilicon D05 systems (LP: #1748567)
    - net...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (56.9 KiB)

This bug was fixed in the package linux - 4.4.0-119.143

---------------
linux (4.4.0-119.143) xenial; urgency=medium

  * linux: 4.4.0-119.143 -proposed tracker (LP: #1760327)

  * Dell XPS 13 9360 bluetooth scan can not detect any device (LP: #1759821)
    - Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"

linux (4.4.0-118.142) xenial; urgency=medium

  * linux: 4.4.0-118.142 -proposed tracker (LP: #1759607)

  * Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty) (LP: #1758869)
    - x86/microcode/AMD: Do not load when running on a hypervisor

  * CVE-2018-8043
    - net: phy: mdio-bcm-unimac: fix potential NULL dereference in
      unimac_mdio_probe()

linux (4.4.0-117.141) xenial; urgency=medium

  * linux: 4.4.0-117.141 -proposed tracker (LP: #1755208)

  * Xenial update to 4.4.114 stable release (LP: #1754592)
    - x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
    - usbip: prevent vhci_hcd driver from leaking a socket pointer address
    - usbip: Fix implicit fallthrough warning
    - usbip: Fix potential format overflow in userspace tools
    - x86/microcode/intel: Fix BDW late-loading revision check
    - x86/retpoline: Fill RSB on context switch for affected CPUs
    - sched/deadline: Use the revised wakeup rule for suspending constrained dl
      tasks
    - can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
    - can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
    - PM / sleep: declare __tracedata symbols as char[] rather than char
    - time: Avoid undefined behaviour in ktime_add_safe()
    - timers: Plug locking race vs. timer migration
    - Prevent timer value 0 for MWAITX
    - drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
    - drivers: base: cacheinfo: fix boot error message when acpi is enabled
    - PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID
    - PCI: layerscape: Fix MSG TLP drop setting
    - mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version
    - fs/select: add vmalloc fallback for select(2)
    - hwpoison, memcg: forcibly uncharge LRU pages
    - cma: fix calculation of aligned offset
    - mm, page_alloc: fix potential false positive in __zone_watermark_ok
    - ipc: msg, make msgrcv work with LONG_MIN
    - x86/ioapic: Fix incorrect pointers in ioapic_setup_resources()
    - ACPI / processor: Avoid reserving IO regions too early
    - ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
    - ACPICA: Namespace: fix operand cache leak
    - netfilter: x_tables: speed up jump target validation
    - netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed
      in 64bit kernel
    - netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags
    - netfilter: nf_ct_expect: remove the redundant slash when policy name is
      empty
    - netfilter: nfnetlink_queue: reject verdict request from different portid
    - netfilter: restart search if moved to other chain
    - netfilter: nf_conntrack_sip: extend request line validation
    - netfilter: use fwmark_reflect in nf_send_reset
    - ext2: Don't clear SGID when inheriting ACLs
    - reiserfs: fix race in prealloc discard
    - re...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Robert,

Can you attach `lspci -vvnn` here? Thanks!

Revision history for this message
Steve Roberts (drgrumpy) wrote :
Download full text (22.1 KiB)

Hi Kai-Heng, Guessing that request is directed at me, so here it is:

$ lspci -vvnn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1450]
 Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Device [1022:1451]
 Subsystem: ASUSTeK Computer Inc. Device [1043:8747]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Interrupt: pin ? routed to IRQ 27
 Capabilities: <access denied>

00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453] (prog-if 00 [Normal decode])
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin ? routed to IRQ 284
 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
 Memory behind bridge: f7900000-f79fffff
 Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
 BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 Capabilities: <access denied>
 Kernel driver in use: pcieport
 Kernel modules: shpchp

00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1453] (prog-if 00 [Normal decode])
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin ? routed to IRQ 285
 Bus: primary=00, secondary=02, subordinate=08, sec-latency=0
 I/O behind bridge: 0000f000-0000ffff
 Memory behind bridge: f7500000-f77fffff
 Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
 BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 Capabilities: <access denied>
 Kernel driver in use: pcieport
 Kernel modules: shpchp

00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1452]
 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB...

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Realise I probably should have run as sudo, so sudo version attached

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Hi Steve,

Can you boot with an older kernel with the original issue, suspend the system, then attach `sudo lspci -vvnn`?

I think the ASPM get setup after the system resume from S3.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

I have deleted/removed many of the older testing kernel, but I think this one does not have the fix, based on post #97

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.13.0-34-generic root=UUID=f7ae652b-cbf6-48b8-bc6a-d3963957ab57 ro

Suspend

Resume

$ sudo lspci -vvnn > lspci.txt

output attached

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

There is another user faces the same issue. The PCIe ASPM is enabled for SM/PM/EVO 961, and disabling ASPM solves the issue for the user.

But ASPM is not enabled on your machine, so unfortunately it's not going to help your case.

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Hmmm.... so if I understand PCIe ASPM would be set in the BIOS. I updated the BIOS to a later version (3803) since the original issue (soon after the last testing at end of Feb). So I guess it may be that it has been disabled in the bios ?

I guess I need to go back and try an older non-fixed kernel again and see what happens...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Steve,

I've found that the PCIe common clock may be the culprit here.
Please try the kernel [1] with kernel parameter nvme_core.force_apst=1.

[1] https://people.canonical.com/~khfeng/quirk-no-commclk/

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Ok, testing now... will report back...

Revision history for this message
Steve Roberts (drgrumpy) wrote :

$ cat /proc/cmdline
BOOT_IMAGE=/@/boot/vmlinuz-4.18.0-3-generic root=UUID=0885f579-b2ff-4c94-964c-d65c30bb7761 ro rootflags=subvol=@ nvme_core.force_apst=1

Been through several cycles of suspend and resume, without any apparent issues.

But I note:
$ sudo nvme id-ctrl /dev/nvme0n1
....
ps 0 : mp:6.04W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:5.09W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:4.08W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Steve,

Does the issue completely go away with my kernel?

Can you attach `sudo lspci -vvnn` with the kernel?
Thanks!

Revision history for this message
Steve Roberts (drgrumpy) wrote :

Apparently yes. File attached.

With:
$ cat /proc/cmdline
BOOT_IMAGE=/@/boot/vmlinuz-4.18.0-3-generic root=UUID=0885f579-b2ff-4c94-964c-d65c30bb7761 ro rootflags=subvol=@ nvme_core.force_apst=1

I have been at least 8 days without re-booting with suspend/resume a couple of times a day at least.

BUT in the meantime I have upgraded to Mint 19 XFCE, so some other issues have a arisen, but prob not due to the kernel.

Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.