ubuntu hynix ssd I/O Errors after some minutes

Bug #1785715 reported by Revisor
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Maybe similar to Bug 1678184
Im working on ubuntu 18.04 (with Kernel 4.15.0-29-generic) on a dell XPS 15-9560
After 20 - 180 Minutes the graphical interface hangs up (i can switch workspaces, but not much else) and entering commands like "ls" yields bash: input/output error
Reboot is impossible via the command or the button. Short click on power button gives me a TTY like view prompting errors for nvme0 ext4 file system errors (i will post more exact text on next hangup)

The hangups appear randomly, but often during shutting the lid or plugging in power cable, which makes me thing it might be something with ssd power states.
For that reason i added grub parameters like "nvme_core.default_ps_max_latency_us=3000" (see Bug 1678184) and ended up fully disabling APST by setting it to 0, but the bug still occured.

sudo nvme id-ctrl /dev/nvme0 yields:
NVME Identify Controller:
vid : 0x1c5c
ssvid : 0x1c5c
sn : EJ75N621210105PBR
mn : PC300 NVMe SK hynix 1TB
fr : 20005A00
rab : 1
ieee : ace42e
cmic : 0
mdts : 5
cntlid : 0
ver : 10200
rtd3r : 90f560
rtd3e : ea60
oaes : 0
ctratt : 0
oacs : 0x16
acl : 3
aerl : 3
frmw : 0x16
lpa : 0x2
elpe : 254
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 361
cctemp : 363
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 0
unvmcap : 0
rpmbs : 0
edstt : 60
dsto : 1
fwug : 0
kas : 0
hctma : 0
mntmt : 0
mxtmt : 0
sanicap : 0
hmminds : 0
hmmaxd : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x1e
fuses : 0
fna : 0
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
msdbd : 0
ps 0 : mp:5.87W operational enlat:5 exlat:5 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:2.40W operational enlat:30 exlat:30 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:1.90W operational enlat:100 exlat:100 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.1000W non-operational enlat:1000 exlat:1000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0060W non-operational enlat:1000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-

sudo nvme get-feature -f 0x0c -H /dev/nvme0 yields:
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
 Autonomous Power State Transition Enable (APSTE): Disabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 ... and so on ...
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: revisor 1593 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 18.04
HibernationDevice: RESUME=UUID=1bc0493d-0592-40d4-b0f4-938fbe12a990
InstallationDate: Installed on 2018-07-12 (25 days ago)
InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Release amd64 (20180426)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 003: ID 04f3:24a1 Elan Microelectronics Corp.
 Bus 001 Device 002: ID 0cf3:e300 Atheros Communications, Inc.
 Bus 001 Device 004: ID 0c45:6713 Microdia
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Dell Inc. XPS 15 9560
NonfreeKernelModules: nvidia_modeset nvidia
Package: linux (not installed)
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-29-generic root=UUID=70377f6e-8c1c-4316-bded-c738410879ab ro quiet splash nvme_core.default_ps_max_latency_us=0 vt.handoff=1
ProcVersionSignature: Ubuntu 4.15.0-29.31-generic 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-29-generic N/A
 linux-backports-modules-4.15.0-29-generic N/A
 linux-firmware 1.173.1
Tags: bionic
Uname: Linux 4.15.0-29-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 07/05/2018
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.10.1
dmi.board.name: 05FFDN
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 10
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.10.1:bd07/05/2018:svnDellInc.:pnXPS159560:pvr:rvnDellInc.:rn05FFDN:rvrA00:cvnDellInc.:ct10:cvr:
dmi.product.family: XPS
dmi.product.name: XPS 15 9560
dmi.sys.vendor: Dell Inc.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1785715

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: bionic
Revision history for this message
Revisor (ichwillspamm) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Revisor (ichwillspamm) wrote : CRDA.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : IwConfig.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : Lspci.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : ProcEnviron.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : ProcModules.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : PulseList.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : RfKill.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : UdevDb.txt

apport information

Revision history for this message
Revisor (ichwillspamm) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please test this kernel, it disables common clock on the NVMe.

https://people.canonical.com/~khfeng/quirk-no-commclk/

Revision history for this message
Revisor (ichwillspamm) wrote :

Thank you Kai-Heng Feng for the quick response!

Can you please tell me how to install it? If i download the .deb files, shall i install them one after another via dpkg? Or can ubuntu-software handle them?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Download them into same directory and run `sudo dpkg -i *deb`.

Revision history for this message
Revisor (ichwillspamm) wrote :

I did that and rebooted.
After i selected ubuntu in grub menu as usual i got the normal login screen. After logging in i only see a violet background and a mouse pointer. Hitting keys and moving mouse does nothing, but when i switch to TTY4 via ALT+f4 i can see a warning:
[ 368.068010] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:61]

this warning is repeated every 28 seconds

And it seems i cannot log in via tty anymore :/

It feels like this is a completely different problem.

What i can do is enter recovery mode and get a root shell.

Revision history for this message
Revisor (ichwillspamm) wrote :

I just found out i can log into the graphical interface when i enter recovery mode root shell, then exit and resume normal boot. A weird but probably irrelevant thing is that the magnifying settings of text and icons seem to have resetted (but maybe that is because of the changes in your kernel)

shall i run an update to maybe fix the issue that i can only boot after entering and exiting recovery mode root shell?

Changed in linux (Ubuntu):
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Revisor (ichwillspamm) wrote :

update: Even with the new kernel the problem still occurs (but it feels like they became more rare).
Today it appeared three times. Twice when closing the lid (in fact every time i closed it, if i remember correctly) and once while working (programming python).

Was anyone able to reproduce the problem with an XPS 9560 with hynix SSD?

Changed in linux (Ubuntu):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you attach `lspci -vvnn`? I'd like to check the ASPM settings.

Revision history for this message
Revisor (ichwillspamm) wrote :
Download full text (14.9 KiB)

Thank you for dedicating some of your time to this matter! :)
Here is the requested output:
revisor@revisor-XPS-15-9560:~$ lspci -vvnn
00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5910] (rev 05)
 Subsystem: Dell Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [1028:07be]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort+ <MAbort+ >SERR- <PERR- INTx-
 Latency: 0
 Capabilities: <access denied>

00:01.0 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x16) [8086:1901] (rev 05) (prog-if 00 [Normal decode])
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 16
 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
 I/O behind bridge: 0000e000-0000efff
 Memory behind bridge: ec000000-ed0fffff
 Prefetchable memory behind bridge: 00000000c0000000-00000000d1ffffff
 Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
 BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 Capabilities: <access denied>
 Kernel driver in use: pcieport

00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:591b] (rev 04) (prog-if 00 [VGA controller])
 Subsystem: Dell Device [1028:07be]
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 255
 Region 0: Memory at eb000000 (64-bit, non-prefetchable) [size=16M]
 Region 2: Memory at 80000000 (64-bit, prefetchable) [size=256M]
 Region 4: I/O ports at f000 [size=64]
 [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
 Capabilities: <access denied>
 Kernel modules: i915

00:04.0 Signal processing controller [1180]: Intel Corporation Skylake Processor Thermal Subsystem [8086:1903] (rev 05)
 Subsystem: Dell Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [1028:07be]
 Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Interrupt: pin A routed to IRQ 16
 Region 0: Memory at ed120000 (64-bit, non-prefetchable) [size=32K]
 Capabilities: <access denied>
 Kernel driver in use: proc_thermal
 Kernel modules: processor_thermal_device

00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller [8086:a12f] (rev 31) (prog-if 30 [XHCI])
 Subsystem: Dell Sunrise Point-H USB 3.0 xHCI Controller [1028:07be]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <P...

Revision history for this message
Revisor (ichwillspamm) wrote :
Download full text (44.6 KiB)

Some of these show "<access denied>".
I dont see an edit option here, so i will just write another comment.
If i call this with sudo i get the full report here:

revisor@revisor-XPS-15-9560:~$ sudo lspci -vvnn
[sudo] password for revisor:
00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5910] (rev 05)
 Subsystem: Dell Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [1028:07be]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort+ <MAbort+ >SERR- <PERR- INTx-
 Latency: 0
 Capabilities: [e0] Vendor Specific Information: Len=10 <?>

00:01.0 PCI bridge [0604]: Intel Corporation Skylake PCIe Controller (x16) [8086:1901] (rev 05) (prog-if 00 [Normal decode])
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 16
 Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
 I/O behind bridge: 0000e000-0000efff
 Memory behind bridge: ec000000-ed0fffff
 Prefetchable memory behind bridge: 00000000c0000000-00000000d1ffffff
 Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
 BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
  PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 Capabilities: [88] Subsystem: Dell Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [1028:07be]
 Capabilities: [80] Power Management version 3
  Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
  Address: 00000000 Data: 0000
 Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
  DevCap: MaxPayload 256 bytes, PhantFunc 0
   ExtTag- RBE+
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
   MaxPayload 256 bytes, MaxReadReq 128 bytes
  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
  LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
   ClockPM- Surprise- LLActRep- BwNot+ ASPMOptComp+
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt+
  SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
   Slot #1, PowerLimit 75.000W; Interlock- NoCompl+
  SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
   Control: AttnInd Unknown, PwrInd Unknown, Power- Interlock-
  SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet+ Interlock-
   Changed: MRL- PresDet+ LinkState-
  RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
  RootCap: CRSVisible-
  RootSta: PME ReqID 0000, PMEStatus- PMEPending-
  DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+...

Revision history for this message
Revisor (ichwillspamm) wrote :

Unfortunately i still couldnt get rid of the problem. Im still in the state where everything works fine as long as i do not close the lid or plug in AC adapter. Whenever i do that it reliably hangs and i cant do anything anymore (not change to another TTY or do any keyboard input, so i have to 7sec power-button-push to kill it). And still i cannot boot via my currently installed default kernel (hangs at login-screen), but have to enter root shell, exit it and then resume boot.

Was anyone able to reproduce it?
Do you think using an older linux kernel might fix it? (as from reading the other reports it seems like the kernel is the problem [in combination with my hardware]).

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Sorry for the late reply.

Please try kernel in [1].

[1] https://people.canonical.com/~khfeng/lp1785715/

Revision history for this message
Revisor (ichwillspamm) wrote :

Just a few days ago i tried windows10 on the machine (more like a test run).
And strangely i encountered a similar (maybe the same) problem.
This time i noticed that i can provoke the problem by moving the laptop in some angles.
So i suspect that it might be a hardware problem (at least in part).
I just sent my Laptop to DELL for them to take a look.

Now im curious what they do about it.

Thx Kai-Heng Feng and sorry that i didnt immediately reported it here when i suspected it might be a hardware problem.

I will keep you informed.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Do you still see this bug?

Revision history for this message
Revisor (ichwillspamm) wrote : Re: [Bug 1785715] Re: ubuntu hynix ssd I/O Errors after some minutes

Sorry, i forgot to report back. Thank you for asking.

No, i got my SSD replaced by a new one and now it is all good.
Seems like it has been a hardware-related problem.

Am 10.05.2019 um 17:26 schrieb Kai-Heng Feng:
> Do you still see this bug?
>

Changed in linux (Ubuntu):
assignee: Kai-Heng Feng (kaihengfeng) → nobody
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.