Samsung SSD corruption (fsck needed)

Bug #1746340 reported by Lucas Zanella
90
This bug affects 15 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Ubuntu 4.13.0-21.24-generic 4.13.13

I have a Razer Blade Stealth 2016. The first Ubuntu I installed was Ubuntu 17.04, which gave me this error after 2 weeks of usage. After that, I installed 16.04 and used it for MONTHS without any problems, until it produced the same error this week. I think it has to do with the ubuntu updates, because I did one recently and one today, just before this problem. Could be a coincidence though.

I notice the error when I try to save something on disk and it says me that the disk is in read-only mode:

lz@lz:/var/log$ touch something
touch: cannot touch 'something': Read-only file system

lz@lz:/var/log$ cat syslog
Jan 29 01:07:39 lz kernel: [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0

lz@lz:/var/log$ dmesg
[62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.377374] Aborting journal on device nvme0n1p2-8.
[62984.379343] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
[62984.379516] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.381486] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.383484] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.385469] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.387278] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.389262] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.391252] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.393341] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[63285.618078] audit: type=1400 audit(1517195560.393:63): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=22495 comm="cupsd" capability=12 capname="net_admin"

Rebooting the ubuntu will give me a black terminal where I can run fsck /dev/nvm30n1p2 (something like that) and it fill fix a lot of orphaned inodes. The majority of time it boots back to the Ubuntu working good, but some times it boots to a broken ubuntu (no images, lots of things broken). I have to reinstall ubuntu then.

Every time I reinstall my Ubuntu, I have to try lots of times until it installs without an Input/Output error. When it installs, I can use it for some hours without having the problem, but if I run the software updates, it ALWAYS crashes and enters in read-only mode, specifically in the part that is installing kernel updates.

I noticed that Ubuntu installs updates automatically when they're for security reasons. Could this be the reason my Ubuntu worked for months without the problem, but then an update was applied and it broke?

I thought that this bug was happening: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 and tried different nvme_core.default_ps_max_latency_us= combinations, all them gave errors. I just changed to 0 and I had no error while using ubuntu (however I didn't test for a long time) but I still had the error after trying to update my ubuntu.

My Samsung 512gb SSD is:

SAMSUNG MZVLW512HMJP-00000, FW REV: CXY7501Q

on a Razer Blade Stealth.

I also asked this on ask ubuntu, without success: https://askubuntu.com/questions/998471/razer-blade-stealth-disk-corruption-fsck-needed-probably-samsung-ssd-bug-afte

Please help me, as I need this computer to work on lots of things :c
---
ApportVersion: 2.20.7-0ubuntu3.7
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: lz 1088 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 17.10
InstallationDate: Installed on 2018-01-30 (0 days ago)
InstallationMedia: Ubuntu 17.10 "Artful Aardvark" - Release amd64 (20180105.1)
MachineType: Razer Blade Stealth
Package: linux (not installed)
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-21-generic.efi.signed root=UUID=0ca062da-7e8f-425a-88b1-1f784fb40346 ro quiet splash button.lid_init_state=open nvme_core.default_ps_max_latency_us=0
ProcVersionSignature: Ubuntu 4.13.0-21.24-generic 4.13.13
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-21-generic N/A
 linux-backports-modules-4.13.0-21-generic N/A
 linux-firmware 1.169.1
Tags: wayland-session artful
Uname: Linux 4.13.0-21-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 01/12/2017
dmi.bios.vendor: Razer
dmi.bios.version: 6.00
dmi.board.name: Razer
dmi.board.vendor: Razer
dmi.chassis.type: 9
dmi.chassis.vendor: Razer
dmi.modalias: dmi:bvnRazer:bvr6.00:bd01/12/2017:svnRazer:pnBladeStealth:pvr2.04:rvnRazer:rnRazer:rvr:cvnRazer:ct9:cvr:
dmi.product.family: 1A586752
dmi.product.name: Blade Stealth
dmi.product.version: 2.04
dmi.sys.vendor: Razer

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.15 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1746340

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Lucas Zanella (lucaszanella) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected artful wayland-session
description: updated
Revision history for this message
Lucas Zanella (lucaszanella) wrote : CRDA.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : IwConfig.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : JournalErrors.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : Lspci.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : Lsusb.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : ProcEnviron.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : ProcModules.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : PulseList.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : RfKill.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : UdevDb.txt

apport information

Revision history for this message
Lucas Zanella (lucaszanella) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Which kernel should I install exactly, and how to? Don't feel safe to download from http

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

This is a known issue for Samsung NVMe.

Please attach the output of `sudo nvme id-ctrl /dev/nvme0` and `sudo nvme get-feature -f 0x0c -H /dev/nvme0 | less`, Thanks!

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Uhh sans the "less", thanks.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Thank you for your answer. I'm desperated. I just installed debian therefore I'm not going to able to do it right now, but I have output from the last time I was using Ubuntu.

I tried nvme_core.default_ps_max_latency_us=5500 and it didn't work. Then I've put it to 0, which didn't work too. Well, with 0 it didn't generate errors while using, but while trying to update my machine, which always happens too, so I don't know anymore. I remember seeing ATSP Disabled at the output, but the error always happens when I try to update my software...

Shouldn't this bug be already fixed? Or not in my kernel? I could pay to get to the bottom of this, because I need my computer so much right now and this bug is happening every day and I can't continue my work!

The last kernel I had on ubuntu was 4.13.0-26-generic, now I'm on debian and I have 4.9.0-4.

sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S33UNX0J324060 SAMSUNG MZVLW512HMJP-00000 1 25,30 GB / 512,11 GB 512 B + 0 B CXY7501Q

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S33UNX0J324060
mn : SAMSUNG MZVLW512HMJP-00000
fr : CXY7501Q
rab : 2
ieee : 002538
cmic : 0
mdts : 0
cntlid : 2
ver : 10200
rtd3r : 186a0
rtd3e : 4c4b40
oaes : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 341
cctemp : 344
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 512110190592
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote : Re: [Bug 1746340] Re: Samsung SSD corruption (fsck needed)
Download full text (8.7 KiB)

Kai-Heng

> On 31 Jan 2018, at 1:38 PM, Lucas Zanella <email address hidden> wrote:
>
> Thank you for your answer. I'm desperated. I just installed debian
> therefore I'm not going to able to do it right now, but I have output
> from the last time I was using Ubuntu.
>
> I tried nvme_core.default_ps_max_latency_us=5500 and it didn't work.
> Then I've put it to 0, which didn't work too. Well, with 0 it didn't
> generate errors while using, but while trying to update my machine,
> which always happens too, so I don't know anymore. I remember seeing
> ATSP Disabled at the output, but the error always happens when I try to
> update my software…

I’d like to see the output of `sudo nvme get-feature -f 0x0c -H /dev/nvme0` when you use nvme_core.default_ps_max_latency_us=0.

>
> Shouldn't this bug be already fixed? Or not in my kernel? I could pay to
> get to the bottom of this, because I need my computer so much right now
> and this bug is happening every day and I can't continue my work!

This is more likely to a low level NVMe/PCIe issue. If possible, please try to upgrade the firmware for the NVMe.

>
> The last kernel I had on ubuntu was 4.13.0-26-generic, now I'm on debian
> and I have 4.9.0-4.

You’ll get hit by this issue (again) once next Debian release uses newer kernel.

>
> sudo nvme list
> Node SN Model Namespace Usage Format FW Rev
> ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
> /dev/nvme0n1 S33UNX0J324060 SAMSUNG MZVLW512HMJP-00000 1 25,30 GB / 512,11 GB 512 B + 0 B CXY7501Q
>
> NVME Identify Controller:
> vid : 0x144d
> ssvid : 0x144d
> sn : S33UNX0J324060
> mn : SAMSUNG MZVLW512HMJP-00000
> fr : CXY7501Q
> rab : 2
> ieee : 002538
> cmic : 0
> mdts : 0
> cntlid : 2
> ver : 10200
> rtd3r : 186a0
> rtd3e : 4c4b40
> oaes : 0
> oacs : 0x17
> acl : 7
> aerl : 3
> frmw : 0x16
> lpa : 0x3
> elpe : 63
> npss : 4
> avscc : 0x1
> apsta : 0x1
> wctemp : 341
> cctemp : 344
> mtfa : 0
> hmpre : 0
> hmmin : 0
> tnvmcap : 512110190592
> unvmcap : 0
> rpmbs : 0
> sqes : 0x66
> cqes : 0x44
> nn : 1
> oncs : 0x1f
> fuses : 0
> fna : 0
> vwc : 0x1
> awun : 255
> awupf : 0
> nvscc : 1
> acwu : 0
> sgls : 0
> subnqn :
> ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
> rwt:0 rwl:0 idle_power:- active_power:-
> ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
> rwt:1 rwl:1 idle_power:- active_power:-
> ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
> rwt:2 rwl:2 idle_power:- active_power:-
> ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
> rwt:3 rwl:3 idle_power:- active_power:-
> ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
> rwt:4 rwl:4 idle_power:- active_power:-
>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1746340
>
> Title:
> Samsung SSD corruption (fsck needed)
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Ubuntu 4.13.0-21.24-generic 4.13.13
>
>
> I have a Razer Blade Stealth 2016. The first Ubuntu I installed w...

Read more...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi. I've been trying to install Windows 10 in order to try to update my SSD firmware, but I'm getting an error:

https://imgur.com/a/BM0gG

could it be that my SSD has a real hardware problem? I tried many different pen drives, in different USB ports, but I always get the same error.

I'm trying to install Ubuntu to get the output of nvme_core.default_ps_max_latency_us=0 but the installation always fails

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (5.8 KiB)

Hi! I managed to install ubuntu again, these are the outputs you asked for the ms tie of 0 milliseconds:

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S33UNX0J324060
mn : SAMSUNG MZVLW512HMJP-00000
fr : CXY7501Q
rab : 2
ieee : 002538
cmic : 0
mdts : 0
cntlid : 2
ver : 10200
rtd3r : 186a0
rtd3e : 4c4b40
oaes : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 341
cctemp : 344
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 512110190592
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

get-feature:0xc (Autonomous Power State Transition), Current value:00000000
 Autonomous Power State Transition Enable (APSTE): Disabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 ...

Read more...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I just installed 4.15.0-041500-generic

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Problem persists with 4.15.0-041500-generic, just happened

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So you have the issue on Linux v4.15 with nvme_core.default_ps_max_latency_us=0, but not on v4.9?

APST doesn't get enabled on both of them.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

On debian (4.9) I didn't notice the issue but I didn't use much. HOWEVER, when I do apt-get upgrade on debian I do get the issue. It just updated the kernel file, didn't run the new kernel (a boot would have to happen).

On v4.15 I didn't change the nvme_core.default_ps_max_latency_us=0, I guess. I did before upgrading to v4.15, I guess. But I can try again.

This is all very strange

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I forgot to mention that I reinstalled windows and everything is fine. Even did a benchmark test on the SSD and I'm downloading lots of files to test

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I am not familiar with Windows, is there anyway to check its APST table? I'd like to see if deepest power state is enabled or not.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I searched and found nothing.

So, even with APST disabled my ssd will fail on linux. What should I do?
Does it work normally for other people when they disable it?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I found a guy with same problem as mine and had a Razer Blade Stealth, but he didn't post anything more after that. And he was in a thread with you. I also found some people with this same problem on the same SSD. Together with the fact that I had no problem on windows (ore than 24hrs of usage by now) I think it can be fixed in the kernel.

I had no luck updating my SSD's firmware as it's OEM and Samsung's updater won't work for it. Do you have any idea? I don't have money to buy a new SSD, and I really need to work. I'd be so grateful if you could help with a solution.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does the issue happen after system suspend?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Initially I noted that it'd happen after opening the lid of the notebook, so yes. But now after I install Ubuntu it immediately starts looking for software updates and that's when the problem happens for the first time, when I haven't even had time to close the notebook to suspend it.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please try [1]. It will do a PCI reset for NVMe device after resume.

people.canonical.com/~khfeng/lp1746340/

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Thanks. What's a 'PCI reset for NVMe device after resume'?

Here's the output of running sudo dpkg -i *.deb on the 4 files:

Selecting previously unselected package linux-headers-4.15.0+.
(Reading database ... 137951 files and directories currently installed.)
Preparing to unpack linux-headers-4.15.0+_4.15.0+-2_amd64.deb ...
Unpacking linux-headers-4.15.0+ (4.15.0+-2) ...
Selecting previously unselected package linux-image-4.15.0+.
Preparing to unpack linux-image-4.15.0+_4.15.0+-2_amd64.deb ...
Unpacking linux-image-4.15.0+ (4.15.0+-2) ...
Selecting previously unselected package linux-image-4.15.0+-dbg.
Preparing to unpack linux-image-4.15.0+-dbg_4.15.0+-2_amd64.deb ...
Unpacking linux-image-4.15.0+-dbg (4.15.0+-2) ...
dpkg-deb (subprocess): decompressing archive member: lzma error: compressed data is corrupt
dpkg-deb: error: subprocess <decompress> returned error exit status 2
dpkg: error processing archive linux-image-4.15.0+-dbg_4.15.0+-2_amd64.deb (--install):
 cannot copy extracted data for './usr/lib/debug/lib/modules/4.15.0+/kernel/drivers/iio/pressure/zpa2326.ko' to '/usr/lib/debug/lib/modules/4.15.0+/kernel/drivers/iio/pressure/zpa2326.ko.dpkg-new': unexpected end of file or stream
Selecting previously unselected package linux-libc-dev.
Preparing to unpack linux-libc-dev_4.15.0+-2_amd64.deb ...
Unpacking linux-libc-dev (4.15.0+-2) ...
Setting up linux-headers-4.15.0+ (4.15.0+-2) ...
Setting up linux-image-4.15.0+ (4.15.0+-2) ...
update-initramfs: Generating /boot/initrd.img-4.15.0+
W: Possible missing firmware /lib/firmware/i915/skl_dmc_ver1_27.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_dmc_ver1_04.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_guc_ver9_39.bin for module i915
W: Possible missing firmware /lib/firmware/i915/bxt_guc_ver9_29.bin for module i915
W: Possible missing firmware /lib/firmware/i915/skl_guc_ver9_33.bin for module i915
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-4.15.0+
Found initrd image: /boot/initrd.img-4.15.0+
Found linux image: /boot/vmlinuz-4.13.0-21-generic
Found initrd image: /boot/initrd.img-4.13.0-21-generic
Adding boot menu entry for EFI firmware configuration
done
Setting up linux-libc-dev (4.15.0+-2) ...
Errors were encountered while processing:
 linux-image-4.15.0+-dbg_4.15.0+-2_amd64.deb

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I downloaded again and it seems that this time it wasn't corrupted.

Output:

Preparing to unpack linux-headers-4.15.0+_4.15.0+-2_amd64.deb ...
Unpacking linux-headers-4.15.0+ (4.15.0+-2) over (4.15.0+-2) ...
Preparing to unpack linux-image-4.15.0+_4.15.0+-2_amd64(1).deb ...
Unpacking linux-image-4.15.0+ (4.15.0+-2) over (4.15.0+-2) ...
Preparing to unpack linux-image-4.15.0+-dbg_4.15.0+-2_amd64(1).deb ...
Unpacking linux-image-4.15.0+-dbg (4.15.0+-2) ...
Preparing to unpack linux-libc-dev_4.15.0+-2_amd64.deb ...
Unpacking linux-libc-dev (4.15.0+-2) over (4.15.0+-2) ...
Setting up linux-headers-4.15.0+ (4.15.0+-2) ...
Setting up linux-image-4.15.0+ (4.15.0+-2) ...
update-initramfs: Generating /boot/initrd.img-4.15.0+
W: Possible missing firmware /lib/firmware/i915/skl_dmc_ver1_27.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_dmc_ver1_04.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_guc_ver9_39.bin for module i915
W: Possible missing firmware /lib/firmware/i915/bxt_guc_ver9_29.bin for module i915
W: Possible missing firmware /lib/firmware/i915/skl_guc_ver9_33.bin for module i915
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-4.15.0+
Found initrd image: /boot/initrd.img-4.15.0+
Found linux image: /boot/vmlinuz-4.13.0-21-generic
Found initrd image: /boot/initrd.img-4.13.0-21-generic
Adding boot menu entry for EFI firmware configuration
done
Setting up linux-image-4.15.0+-dbg (4.15.0+-2) ...
Setting up linux-libc-dev (4.15.0+-2) ...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

After installing everything, I rebooted to use the new kernel. I then installed updates on the machine to see if the problem would happen (easier way to make it happen is on the moment I try to update). After the update, wireless stopped working. Restarted many times and still not working.

Could it be that the update triggered the error and the so called pcie reset of this kernel made the wireless go wrong?

I'm gonna still use this kernel to see if the read only filesystem happens though

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I added an USB wireless receiver to use internet to download things so I can see if something happens. I installed more system updates through the ubuntu software updates. Is this ok? The kernel will still be yours, rigtht?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Download full text (6.4 KiB)

> On Feb 8, 2018, at 10:19 AM, Lucas Zanella <email address hidden> wrote:
>
> I added an USB wireless receiver to use internet to download things so I
> can see if something happens. I installed more system updates through
> the ubuntu software updates. Is this ok? The kernel will still be yours,
> rigtht?

I should be. You can use `uname -r` to check the kernel version.

>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1746340
>
> Title:
> Samsung SSD corruption (fsck needed)
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Ubuntu 4.13.0-21.24-generic 4.13.13
>
>
> I have a Razer Blade Stealth 2016. The first Ubuntu I installed was Ubuntu 17.04, which gave me this error after 2 weeks of usage. After that, I installed 16.04 and used it for MONTHS without any problems, until it produced the same error this week. I think it has to do with the ubuntu updates, because I did one recently and one today, just before this problem. Could be a coincidence though.
>
> I notice the error when I try to save something on disk and it says me
> that the disk is in read-only mode:
>
> lz@lz:/var/log$ touch something
> touch: cannot touch 'something': Read-only file system
>
>
> lz@lz:/var/log$ cat syslog
> Jan 29 01:07:39 lz kernel: [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
>
>
> lz@lz:/var/log$ dmesg
> [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.377374] Aborting journal on device nvme0n1p2-8.
> [62984.379343] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
> [62984.379516] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.381486] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.383484] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.385469] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.387278] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.389262] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.391252] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.393341] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [63285.618078] audit: type=1400 audit(1517195560.393:63): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=22495 comm="cupsd" capability=12 capname="net_admin"
>
> Rebooting the ubuntu will give me a black ter...

Read more...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

The new kernel has been running for almost a day and no problems happened (however I still have no PCIe wireless and no i915 firmware so I can't open things like kdenlive)

What does this fix of yours do and is it possible to make it work with everything?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I built another one based on Bionic, please use this kernel instead,
people.canonical.com/~khfeng/lp1746340-2/

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (4.2 KiB)

Just after running sudo dpkg -i *.deb, and before rebooting, the error happened. Since the new kernel isn't running yet, I guess this current kernel still had the problem? That's strange because I've been running for more than 24 hours, downloading lots of torrents and had no problems.

I'm going to reboot now to test the new kernel

Here's the output:

sudo dpkg -i *.deb
[sudo] password for lz:
Selecting previously unselected package linux-headers-4.14.0-17.
(Reading database ... 215114 files and directories currently installed.)
Preparing to unpack linux-headers-4.14.0-17_4.14.0-17.20~lp1746340_all.deb ...
Unpacking linux-headers-4.14.0-17 (4.14.0-17.20~lp1746340) ...
Selecting previously unselected package linux-headers-4.14.0-17-generic.
Preparing to unpack linux-headers-4.14.0-17-generic_4.14.0-17.20~lp1746340_amd64.deb ...
Unpacking linux-headers-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
Selecting previously unselected package linux-image-4.14.0-17-generic.
Preparing to unpack linux-image-4.14.0-17-generic_4.14.0-17.20~lp1746340_amd64.deb ...
Examining /etc/kernel/preinst.d/
Done.
Unpacking linux-image-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
Selecting previously unselected package linux-image-extra-4.14.0-17-generic.
Preparing to unpack linux-image-extra-4.14.0-17-generic_4.14.0-17.20~lp1746340_amd64.deb ...
Unpacking linux-image-extra-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
Setting up linux-headers-4.14.0-17 (4.14.0-17.20~lp1746340) ...
Setting up linux-headers-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
Setting up linux-image-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
Running depmod.
update-initramfs: deferring update (hook will be called later)
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
update-initramfs: Generating /boot/initrd.img-4.14.0-17-generic
run-parts: executing /etc/kernel/postinst.d/unattended-upgrades 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
run-parts: executing /etc/kernel/postinst.d/update-notifier 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-4.15.0+
Found initrd image: /boot/initrd.img-4.15.0+
Found linux image: /boot/vmlinuz-4.14.0-17-generic
Found initrd image: /boot/initrd.img-4.14.0-17-generic
Found linux image: /boot/vmlinuz-4.13.0-32-generic
Found initrd image: /boot/initrd.img-4.13.0-32-generic
Found linux image: /boot/vmlinuz-4.13.0-21-generic
Found initrd image: /boot/initrd.img-4.13.0-21-generic
Adding boot menu entry for EFI firmware configuration
done
Setting up linux-image-extra-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
run-parts: executing /etc/kernel/postinst.d/...

Read more...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I rebooted and I was still at 4.15. I then activated grub to select a kernel version, and I chose 4.14.... which is yours. I then boot to a cpu_fifo_underun and nothing boots

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

initramf [drm: intel_cpu_fifo_underrun_irq_handler [i915]] *ERROR* CPU pipe A FIFO underrun

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Download full text (10.7 KiB)

> On Feb 9, 2018, at 3:34 PM, Lucas Zanella <email address hidden>
> wrote:
>
> Just after running sudo dpkg -i *.deb, and before rebooting, the error
> happened. Since the new kernel isn't running yet, I guess this current
> kernel still had the problem? That's strange because I've been running
> for more than 24 hours, downloading lots of torrents and had no
> problems.
>
> I'm going to reboot now to test the new kernel
>

The issue happens when the disk transits between operational states and
non-operational states.

If you are torrenting, then chances are the disk is always in operational
states, so you don’t see the issue.

Kai-Heng

> Here's the output:
>
> sudo dpkg -i *.deb
> [sudo] password for lz:
> Selecting previously unselected package linux-headers-4.14.0-17.
> (Reading database ... 215114 files and directories currently installed.)
> Preparing to unpack
> linux-headers-4.14.0-17_4.14.0-17.20~lp1746340_all.deb ...
> Unpacking linux-headers-4.14.0-17 (4.14.0-17.20~lp1746340) ...
> Selecting previously unselected package linux-headers-4.14.0-17-generic.
> Preparing to unpack
> linux-headers-4.14.0-17-generic_4.14.0-17.20~lp1746340_amd64.deb ...
> Unpacking linux-headers-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
> Selecting previously unselected package linux-image-4.14.0-17-generic.
> Preparing to unpack
> linux-image-4.14.0-17-generic_4.14.0-17.20~lp1746340_amd64.deb ...
> Examining /etc/kernel/preinst.d/
> Done.
> Unpacking linux-image-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
> Selecting previously unselected package
> linux-image-extra-4.14.0-17-generic.
> Preparing to unpack
> linux-image-extra-4.14.0-17-generic_4.14.0-17.20~lp1746340_amd64.deb ...
> Unpacking linux-image-extra-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
> Setting up linux-headers-4.14.0-17 (4.14.0-17.20~lp1746340) ...
> Setting up linux-headers-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
> Setting up linux-image-4.14.0-17-generic (4.14.0-17.20~lp1746340) ...
> Running depmod.
> update-initramfs: deferring update (hook will be called later)
> Examining /etc/kernel/postinst.d.
> run-parts: executing /etc/kernel/postinst.d/apt-auto-removal
> 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
> run-parts: executing /etc/kernel/postinst.d/initramfs-tools
> 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
> update-initramfs: Generating /boot/initrd.img-4.14.0-17-generic
> run-parts: executing /etc/kernel/postinst.d/unattended-upgrades
> 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
> run-parts: executing /etc/kernel/postinst.d/update-notifier
> 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
> run-parts: executing /etc/kernel/postinst.d/zz-update-grub
> 4.14.0-17-generic /boot/vmlinuz-4.14.0-17-generic
> Generating grub configuration file ...
> Warning: Setting GRUB_TIMEOUT to a non-zero value when
> GRUB_HIDDEN_TIMEOUT is set is no longer supported.
> Found linux image: /boot/vmlinuz-4.15.0+
> Found initrd image: /boot/initrd.img-4.15.0+
> Found linux image: /boot/vmlinuz-4.14.0-17-generic
> Found initrd image: /boot/initrd.img-4.14.0-17-generic
> Found linux image: /boot/vmlinuz-4.13.0-32-generic
> Found...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Very important update: I bought a brand new Samsung 960 EVO, and I can't even install Ubuntu: I get I/O error in the installation

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

You need to boot with kernel parameter "nvme_core.default_ps_max_latency_us=0"

Please try this kernel after installation:
http://people.canonical.com/~khfeng/lp1746340-artful/

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Is it possible to boot the live installation media with the kernel parameter? I'm having problems installing the ubuntu into the new SSD, always get I/O errors...

I'm gonna also try the new kernel on the old SSD though

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

You mean I need to boot with the parameter on your new kernel? I'm gonna try it

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (3.1 KiB)

I just installed your new kernel on the old SSD and changed the nvme_core_..._us to 0

seems that a dependency is missing on the kernel:

oblems prevent configuration of linux-headers-4.13.0-34-generic:
 linux-headers-4.13.0-34-generic depends on libssl1.1 (>= 1.1.0); however:
  Package libssl1.1 is not installed.

dpkg: error processing package linux-headers-4.13.0-34-generic (--install):
 dependency problems - leaving unconfigured
Setting up linux-image-4.13.0-34-generic (4.13.0-34.37~lp1746340) ...
Running depmod.
update-initramfs: deferring update (hook will be called later)
Examining /etc/kernel/postinst.d.
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
update-initramfs: Generating /boot/initrd.img-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/unattended-upgrades 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/update-notifier 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.15.0+
Found initrd image: /boot/initrd.img-4.15.0+
Found linux image: /boot/vmlinuz-4.14.0-17-generic
Found initrd image: /boot/initrd.img-4.14.0-17-generic
Found linux image: /boot/vmlinuz-4.13.0-34-generic
Found initrd image: /boot/initrd.img-4.13.0-34-generic
Found linux image: /boot/vmlinuz-4.13.0-32-generic
Found initrd image: /boot/initrd.img-4.13.0-32-generic
Found linux image: /boot/vmlinuz-4.13.0-21-generic
Found initrd image: /boot/initrd.img-4.13.0-21-generic
Adding boot menu entry for EFI firmware configuration
done
Setting up linux-image-extra-4.13.0-34-generic (4.13.0-34.37~lp1746340) ...
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
update-initramfs: Generating /boot/initrd.img-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/unattended-upgrades 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/update-notifier 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 4.13.0-34-generic /boot/vmlinuz-4.13.0-34-generic
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.15.0+
Found initrd image: /boot/initrd.img-4.15.0+
Found linux image: /boot/vmlinuz-4.14.0-17-generic
Found initrd image: /boot/initrd.img-4.14.0-17-generic
Found linux image: /boot/vmlinuz-4.13.0-34-generic
Found initrd image: /boot/initrd.img-4.13.0-34-generic
Found linux image: /boot/vmlinuz-4.13.0-32-generic
Found initrd image: /boot/initrd.img-4.13.0-32-generic
Found linux image: /boot/vmlinuz-4.13.0-21-generic
Found initrd image: /boot/initrd.img-4.13.0-21-generic
Adding boot menu entry for EFI firmware configuration
done
Errors were encountered while processing:
 linux-hea...

Read more...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Nevermind I installed libssl1.1 by adding the bionic rep, however right before I could reinstall the kernel the system entered in read-only mode. I'm gonna try to enter and install the new kernel in some way.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I think I got it: http://pastebin.com/raw/squFVnGi

I'm counting with the idea that this 4.13.0-34 is the new one, not the old one I had. I hope it is.

Just booted and PCIe wireless is working. uname-r gives 4.13.0-34-generic.

Going to leave the system rest for a while to see if something happens, not going to download torrent again.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I've been running for days without any problem (it'd happen before like 30 minutes after installation). So can you release the source? Will it be on mainline?

Also, how to use this kernel with the live image? Because it's painful to install ubuntu with this problems, I get I/O error in 90% of my tries. I have to try for hours until it installs good.

Thank you so much!

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you try again with [1]?

The one you used is with quirk NVME_QUIRK_NO_DEEPEST_PS, let's see if that quirk is unnecessary.

[1] people.canonical.com/~khfeng/lp1746340-pcireset/

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Ok, just installed it. Gonna monitor it to see if any errors come up

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Everything is ok with this new kernel. No erros.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Changed in linux (Ubuntu):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
Revision history for this message
Lucas Zanella (lucaszanella) wrote :

When there will be a kernel with this patch included?

What about the live image? It's going to take months for a live installation image to have this patch. Is it possible for me to use this kernel in a live image myself?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

> When there will be a kernel with this patch included?
v4.16.

> What about the live image? It's going to take months for a live installation image to have this patch. Is it possible for me to use this kernel in a live image myself?
I'll back port the patch to v4.15 so Bionic (18.04) live image will have this fix.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

but v4.16-rc1 doesn't have "NVME_QUIRK_PCI_RESET_RESUME = (1 << 7)"

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The patch doesn't get merged yet.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I compiled a kernel myself apllying the patcch and using make deb-pkg and got these files:

    linux-headers-4.15.4_4.15.4-4_amd64.deb
    linux-image-4.15.4_4.15.4-4_amd64.deb
    linux-image-4.15.4-dbg_4.15.4-4_amd64.deb
    linux-libc-dev_4.15.4-4_amd64.deb

but you don't have image...dbg neither libc, and you have image-extra and headers...generic. What's the difference? Will mine work? If not, how do I get your 4 files exactly?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I download a fresh 4.15.0-9 kernel, applied the diff and compiled just as in the page you sent.

I then formatted, installed ubuntu 17.10.1, booted and enabled nvme_..._us=0, rebooted. Then I started installing updates. The error ocurred in the middle of it.

I've never tried to install updates on the kernels you made me test. Could it be that it's breaking something?

:(

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I installed ubuntu again, installed my compiled kernel, disabled updates. When installing a package (virt-manager), the error ocurred again. This package messes with kernel (kvm things) but I used before on your kernels and everything was fine (I didn't try to install while in your kernel though)

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Here's what I did:

git clone bionic_git_url...
cd ubuntu-bionic
git checkout <tag of version 4.15.0-9>
patch -p1 < nvme_reset.diff #(from your diff file)
#gave an error about last line but I checked manually and it was ok (I guess was because of the number in the end: https://pastebin.com/j4Tz1fDa
sudo apt-get build-dep linux-image-$(uname -r)
fakeroot debian/rules clean
fakeroot debian/rules binary-headers binary-generic binary-perarch

after a long time, I copied

linux-headers-4.15.0-9_4.15.0-9.10_all.deb
linux-headers-4.15.0-9-generic_4.15.0-9.10_amd64.deb
linux-image-4.15.0-9-generic_4.15.0-9.10_amd64.deb
linux-image-extra-4.15.0-9-generic_4.15.0-9.10_amd64.deb

and installed on my fresh ubuntu 17.10.1 install on the razer blade stealth by doing

sudo dpkg -i *.deb

then I added nvme_..._us=0 to grub and did

sudo update-grub

rebooted and used for a while (confirmed using uname-r that the new kernel was running). In the first time I did all this, the problem ocurred while installing updates. In the second time I tried, the error ocurred when tried to install virt-manager.

Since the kernel worked perfectly except for that, I can only assume that your diff didn't go through. But if I do git checkout <tag> and then apply a diff to that tag, then I can simply cd to this folder and compile and I'll be using the diff, right?

Thank you for your help!

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I want to know the reason behind compiling your own kernel, is it because with kernel parameter "nvme_core.default_ps_max_latency_us=0" you still encounter some disk errors?

If it's true, then we need to put the patch into Bionic's kernel, and make sure the daily Bionic iso use the new kernel.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I added nvme_core.default_ps_max_latency_us=0 because you said in an older comment. Is it necessary or can I take it off?

I'm compiling my own because I want to learn and also test new kernels as they are released, specially now with specte and meltdown (it's going to take time for it to reach mainline and even more time for it to reach the live installer). Also it's a good pratice for security reasons.

I don't see what I did wrong, my kernel should work exactly as yours.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please remove the kernel parameter so we can make sure it works with APST enabled.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Since you wrote the last message I recompiled the kernel and reinstalled. Tested again, the problem ocurred in about 1 hour. Then I took the kernel parameter off and started to test and I've been running for more than 24 hours without errors. However, the error ocurred inside a virtual machine. But the disk in the machine is named /dev/sda1, so it's not using NVME drivers or anything like that. How is it possible for the error to occur inside the virtual machine but not on the main machine? Could this be due to another completely unrelated problem?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Thanks Lucas. Sounds like the issue is gone when APST gets enabled.

It should be great if you can test it with more S3 cycles.

Regarding to the VM issue, I can't be sure unless you attach the error message.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

The kernel is still good. The error happened again in the virtual machine, here's dmesg:

[ 6730.708866] EXT4-fs error (device sda1): htree_dirblock_to_tree:976: inode #418562: comm updatedb.mlocat: Directory block failed checksum
[ 6730.710121] Aborting journal on device sda1-8.
[ 6730.711514] EXT4-fs (sda1): Remounting filesystem read-only
[ 6730.713087] EXT4-fs error (device sda1): ext4_journal_check_start:60: Detected aborted journal
[ 7030.415582] audit: type=1400 audit(1519269087.344:26): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=2851 comm="cupsd" capability=12 capname="net_admin"
[67539.479651] clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
[67539.479670] clocksource: 'kvm-clock' wd_now: 55b3d2da2f60 wd_last: 269f11d4c146 mask: ffffffffffffffff
[67539.479673] clocksource: 'tsc' cs_now: 422f92c80a6a cs_last: 422e9dc07f56 mask: ffffffffffffffff

what do you think? My disk is /dev/sda1 on the virtual machine, so no NVME... I'm using KVM spice

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (7.8 KiB)

So... the error happened :(

I don't know if it's related but I was compiling QT 5 inside a virtual machine and gone to sleep. When I woke up there was an error in the compilation about not being able to allocate virtual memory. The vm was unusable (I pressed things and they won't work) then I rebooted the VM and in the ubuntu initialization there was something about trying to write outside disk hd1. I did fsck then and the machine kept printing lots of lines indefintely about disk writes (wouldn't stop). Tried to print but couldn't save.

Now, in the main machine, I did touch a and the error Read Only filesystem appeared. Then looks at dmesg:

[62526.097648] CPU3: Package temperature above threshold, cpu clock throttled (total events = 734393)
[62526.097650] CPU2: Package temperature above threshold, cpu clock throttled (total events = 734389)
[62526.097654] CPU0: Package temperature above threshold, cpu clock throttled (total events = 734421)
[62526.098643] CPU0: Core temperature/speed normal
[62526.098644] CPU2: Core temperature/speed normal
[62526.098644] CPU3: Package temperature/speed normal
[62526.098645] CPU1: Package temperature/speed normal
[62526.098646] CPU2: Package temperature/speed normal
[62526.098647] CPU0: Package temperature/speed normal
[62826.083664] CPU0: Core temperature/speed normal
[62826.083665] CPU2: Core temperature/speed normal
[62826.083666] CPU3: Package temperature/speed normal
[62826.083667] CPU1: Package temperature/speed normal
[62826.083667] CPU2: Package temperature/speed normal
[62826.083669] CPU0: Package temperature/speed normal
[63109.039660] CPU3: Core temperature above threshold, cpu clock throttled (total events = 122579)
[63109.039661] CPU1: Core temperature above threshold, cpu clock throttled (total events = 122586)
[63109.043637] CPU3: Core temperature/speed normal
[63109.043637] CPU1: Core temperature/speed normal
[63141.298625] CPU2: Core temperature above threshold, cpu clock throttled (total events = 685839)
[63141.298626] CPU0: Core temperature above threshold, cpu clock throttled (total events = 685861)
[63141.298628] CPU1: Package temperature above threshold, cpu clock throttled (total events = 752070)
[63141.298628] CPU3: Package temperature above threshold, cpu clock throttled (total events = 752043)
[63141.298630] CPU0: Package temperature above threshold, cpu clock throttled (total events = 752073)
[63141.298633] CPU2: Package temperature above threshold, cpu clock throttled (total events = 752042)
[63141.311665] CPU0: Core temperature/speed normal
[63141.311666] CPU2: Core temperature/speed normal
[63141.311667] CPU1: Package temperature/speed normal
[63141.311667] CPU3: Package temperature/speed normal
[63141.311668] CPU2: Package temperature/speed normal
[63141.311669] CPU0: Package temperature/speed normal
[63441.300764] CPU2: Core temperature/speed normal
[63441.300765] CPU0: Core temperature/speed normal
[63441.300766] CPU3: Package temperature/speed normal
[63441.300766] CPU1: Package temperature/speed normal
[63441.300767] CPU0: Package temperature/speed normal
[63441.300768] CPU2: Package temperature/speed normal
[63742.088404] CPU0: Core temperature above threshold, cpu cloc...

Read more...

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

here's a print screen:

https://imgur.com/a/ZEBAP

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Do you have full dmesg in comment #75? Do you see any NVMe error?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Nope, sorry, I rebooted after copying the dmesg. I thought that since there is [68668.595459] EXT4-fs (nvme0n1p2): Remounting filesystem read-only in what I copied, it was enough because there is where the error started. Gonna cat the entire file the next time, but I'm afraid it'll only happend if I force my CPU too much like I did (actually this is good, because the nvme took a very long time to enter in read only mody, it's a very good progress)

Meanwhile, I think that the average of read only errors inside the VMs is like 0.8/day. I always test the main machine when these errors happen and it's always fine. The only time when it gone wrong was this one that I reported.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Just to remember, inside the VM the error looks like this:

[26547.754916] EXT4-fs error (device sda1): htree_dirblock_to_tree:976: inode #31777: comm gvfsd-trash: Directory block failed checksum
[26547.756301] Aborting journal on device sda1-8.
[26547.757724] EXT4-fs (sda1): Remounting filesystem read-only
[26547.762207] EXT4-fs error (device sda1): ext4_journal_check_start:60: Detected aborted journal
[26631.771204] EXT4-fs error (device sda1): htree_dirblock_to_tree:976: inode #302034: comm gvfsd-trash: Directory block failed checksum

when outside there's no problem at all.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I don't think the EXT4 issue inside VM is the same as NVMe one.

If you no longer have the issue on host machine, then the fix works.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Yes, the fix totally works. The only time when I had a real nvme error on the main machine was that one I reported. Don't know why, though, but it looks like that the VM gone terribly wrong. But these VMs are just plain ubuntus with docker, visual studio code and git. Nothing fancy, don't know why I keep getting ext4 problems.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Well, the error happened again, and I wasn't even running any VMs

Maybe there's a rare case in which your diff correction didn't get applied?

:(

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please change the line in the patch from
"return NVME_QUIRK_PCI_RESET_RESUME";
to
"return (NVME_QUIRK_PCI_RESET_RESUME | NVME_QUIRK_NO_DEEPEST_PS);

And see if this still happens.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Ok, I'm compiling it now on my other PC. Meanwhile, the error happened twice in the same day. That's odd, it usually took at least 2 days to manifest again. Could it be that something chanded on my PC?

Anyways, here's the error:

 4609.325351] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1447: inode #26610978: comm updatedb.mlocat: checksumming directory block 0
[ 4609.327443] Aborting journal on device nvme0n1p2-8.
[ 4609.329533] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
[ 4609.357281] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1447: inode #26739117: comm updatedb.mlocat: checksumming directory block 0
[ 4609.627350] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1447: inode #5901277: comm updatedb.mlocat: checksumming directory block 0
[ 4795.596378] perf: interrupt took too long (2563 > 2500), lowering kernel.perf_event_max_sample_rate to 78000
[ 4911.346846] audit: type=1400 audit(1519876882.781:29): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=4149 comm="cupsd" capability=12 capname="net_admin"

I'll test the new compiler kernel in some hours

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

After 2 days with the new kernel, it happened again. Seems like around every 2 days it happens. Maybe some rare nvme write that you didn't cover in the quirks?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Again, can you attach full dmesg?
Because the message is not about NVMe, but EXT4.

Please fsck the rootfs before any further testing.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

The next time it happen I will post the full dmesg. But even my first message (#1):

Jan 29 01:07:39 lz kernel: [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0

cites EXT-4 errors. It's always been like that, nothing changed.

I need to fsck the rootfs now or when the error happens again? And how should I do it?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Generally I boot up a live system and run fsck on they block device. A quick google shows there are several ways to achieve the same thing.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So does the issue still happen after fsck?

Does quirk NVME_QUIRK_PCI_RESET_RESUME alone work for you?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Sorry I didn't test it yet. I had to travel and use the computer so I did a fsck on onvme0n1p2 only, to get it working.

I thought fsck /dev/nvme0n1p2 was the same a fscking the rootfs. I do it every time the error happens. I didn't understand exactly what you meant.

Also I think NVME_QUIRK_NO_DEEPEST_PS had no effect when added. I'd however try to make it work with it first, and if everything goes ok I'd take it off to see if it continues.

If I need to do it before the error happens, I can just run my live ubuntu and do it.

Thank you for your help.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Ok so I didn't know exactly what to do.

I was using my machine and even though I didn't get any errors I rebooted, entered ubuntu live image and did fsck on /dev/nvme0n1p1 and /dev/nvme0n1p2, no errors showed up

Then I continued using my machine and the error appeared. I rebooted into the live machine and did

fsck /dev/nvme0n1p2 and fsck /dev/nvme0n1p1

here are the outputs:

https://pastebin.com/jpz5SwrR

https://pastebin.com/xNMQPuVi

(in the nvme0n1p1 I got dirty bit, don't know what is this, and in nvme0n1p2 the output looks the same as when I've run the fsck from the SSD)

The error persists.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Seems to me it's a bug in EXT4 instead of NVMe.

So seems like NVME_QUIRK_PCI_RESET_RESUME is not needed?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Without NVME_QUIRK_PCI_RESET_RESUME the bug happens every 2 hours. With it, the bug happens every 2 days or more (I think there was a time i've run 5 to 8 days without an error)

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

The error just happened again inside the VM, and upon VM reboot it said that something was trying to access outside the disk hd0 or hd1, don't remember. Then I noted that the host machine had the error too.

This already happened before and I mentioned in commet #75

Somehow the virtual machine errors and the host machine errors are related. However, the error happens even without virtual machine usage.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Do you suspend/resume the system during your usage?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I do it a lot. However the errors don't happen right afetr waking from suspend. They take some time.

Before the NVME quirk I tried to see if suspend/resume influenced in the error. I experienced botting the PC and leaving it open until the error ocurred, proving the suspend wasn't causing it.

However now with the NVME quirk the computer takes a lot of time to show the error, so I always en dup closing it. However I think there was one time the error happened right after turning the PC on, though I'm not 100% sure

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Have you ever seen error message that says the nvme device stays in D3 and refused to change to D0?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

No. The times I've read dmesg there was nothing like this, neither as an error popup. I'll grep D0 and D3 the next time though

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I just noted that almost always when I do

docker rm $(docker ps -a -q)

to remove all docker containers inside my VM, the error happens. Maybe high disk usage causes this?

The errors on my computer are taking time to happen, but ih the VMs it happens every day.

I'm 100% thankful for what you've done. If you know something, I can pay, these errors are making very hard for me to work

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please try this one, I built it with NVMe queue depth = 2.

https://people.canonical.com/~khfeng/lp1746340-q-2/

Also please attach the dmesg, thanks!

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Would you mind posting the diff? I'm using custom kernel modifications (not related to disk and tested without them)

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Ok, there's actually a kernel parameter for that, please boot with "nvme.io_queue_depth=2"

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I tried this parameter and the computer got stuck at the loading screen. Had to enter recovery mode and remove the parameter to make it boot again

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you try some value like 64? PM1725 NVMe uses this value.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I'm trying since you wrote. No problems yet on the host machine, but the virtual machine already presented the error twice today (not related to high disk usage)

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Ok, the error just happened now in the main machine. Took much more days to happen.

I picked my computer while it were sleeping but opened, and moved the mouse and saw a black screen with nothing on it. Then pressed power and soething appeared: /.../libvirt .... read only file system and then it turned off. Libvirt was running at the time so I think it was just an error saying that libvirt was trying to write to the disk. Don't know if it's related.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

And it just happened again. It's common to happend again right after I rebooted ans fscked from a previous one. Then it calms down for some days

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Do you see similar behavior under Windows?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

On the epoch when I was having the error every 2 hours I installed windows and used for some days without any problems, so I guess not. I also tried an old debian and installation went ok (on ubuntu it fails 80% of the time. I have to try many until I get a good installation)

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So do you see the same issue with mainline v4.9 kernel [1]?

From what I can understand, disable APST can let you fully install Ubuntu/Debian, but after some usage, you still have to fsck the disk?

[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9.93/

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I did not try to install with disable APST. I tried to put a custom kernel in the live CD but it wouldn't boot. My current ubuntu was installed with trial and error until it installed without any errors. The installation process is not so important for me, I can try 5 or 7 times before getting it right. I'm mainly concerned about usage after.

So this is what happened: before any quirks, I was having the error every 2 hours. After your kernel quirk, I started having the error every 2 days on main machine and on average every day on the virtual machines. After the NVME queue_depth parameter it looks like it's taking 5 days, but the virtual machines continue giving the errors 1 time per day on average.

So I should try this new kernel? I suppose it already has the quirk you created. I'll install it soon. Not now because I need to backup things, because if the error happens during the kernel installation then the whole ubuntu is going to get wrecked.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The quirk is not included.

I asked this because looks like you didn't need to fsck under Debian with Linux v4.9.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I didn't need but suddenly I installed some updates and it broke. However I don't think the kernel got upgraded with that update.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you use v4.9 under Ubuntu and see if this still happens? Or does your laptop need driver support from newer kernel?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi, I'm back, sorry for the delay. I'll test it soon again. I tried and the error happened in the middle of the update and broke my ubuntu. I'll reinstall and try again

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Lucas,

Can you attach `sudo lspci -vvnn` here? Thanks!

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (28.2 KiB)

Hello. Thank you for your continued support! I was unable to test the older kernel yet as I'm using this PC constantly and cannot lose or have it unusable for too much time, as when the system gets corrupted I have to spend hours trying to install it again without errors.

Here's the output:

00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5904] (rev 02)
 Subsystem: Razer USA Ltd. Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [1a58:6752]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Capabilities: [e0] Vendor Specific Information: Len=10 <?>

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 620 [8086:5916] (rev 02) (prog-if 00 [VGA controller])
 Subsystem: Razer USA Ltd. HD Graphics 620 [1a58:6752]
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 127
 Region 0: Memory at db000000 (64-bit, non-prefetchable) [size=16M]
 Region 2: Memory at 90000000 (64-bit, prefetchable) [size=256M]
 Region 4: I/O ports at f000 [size=64]
 [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
 Capabilities: [40] Vendor Specific Information: Len=0c <?>
 Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
  DevCap: MaxPayload 128 bytes, PhantFunc 0
   ExtTag- RBE+
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
   MaxPayload 128 bytes, MaxReadReq 128 bytes
  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
  DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
  DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
  Address: fee00018 Data: 0000
 Capabilities: [d0] Power Management version 2
  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [100 v1] Process Address Space ID (PASID)
  PASIDCap: Exec- Priv-, Max PASID Width: 14
  PASIDCtl: Enable- Exec- Priv-
 Capabilities: [200 v1] Address Translation Service (ATS)
  ATSCap: Invalidate Queue Depth: 00
  ATSCtl: Enable-, Smallest Translation Unit: 00
 Capabilities: [300 v1] Page Request Interface (PRI)
  PRICtl: Enable- Reset-
  PRISta: RF- UPRGI- Stopped+
  Page Request Capacity: 00008000, Page Request Allocation: 00000000
 Kernel driver in use: i915
 Kernel modules: i915

00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f] (rev 21) (prog-if 30 [XHCI])
 Subsystem: Razer USA Ltd. Sunrise Point-LP USB 3.0 xHCI Controller [1a58:6752]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

If possible, please try this kernel:
https://people.canonical.com/~khfeng/pm961-disable-aspm/

Please also attach `sudo lspci -vvnn` with this kernel, thanks!

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Could you provide the diff file? I need to compile with other modifications.

Or better yet, is there a command to disable aspm in boot?

Thank you so much

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

For Bionic kernel.

tags: added: patch
Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Should I also add the other quirk you made which made the problem happen fewer times?

https://lkml.org/lkml/2018/2/15/347

I'm using the 4.15-23

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

No, just use the patch in #120.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (28.1 KiB)

I just compiled the kernel and it presented the error just minutes after the first boot.

A reminder: my kernel parameters are still like this:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash button.lid_init_state=open nvme.io_queue_depth=64"

Here's the output you wanted:

00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5904] (rev 02)
 Subsystem: Razer USA Ltd. Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [1a58:6752]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ >SERR- <PERR- INTx-
 Latency: 0
 Capabilities: [e0] Vendor Specific Information: Len=10 <?>

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 620 [8086:5916] (rev 02) (prog-if 00 [VGA controller])
 Subsystem: Razer USA Ltd. HD Graphics 620 [1a58:6752]
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 124
 Region 0: Memory at db000000 (64-bit, non-prefetchable) [size=16M]
 Region 2: Memory at 90000000 (64-bit, prefetchable) [size=256M]
 Region 4: I/O ports at f000 [size=64]
 [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
 Capabilities: [40] Vendor Specific Information: Len=0c <?>
 Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
  DevCap: MaxPayload 128 bytes, PhantFunc 0
   ExtTag- RBE+
  DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
   MaxPayload 128 bytes, MaxReadReq 128 bytes
  DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
  DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
  DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
  Address: fee00018 Data: 0000
 Capabilities: [d0] Power Management version 2
  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
  Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [100 v1] Process Address Space ID (PASID)
  PASIDCap: Exec- Priv-, Max PASID Width: 14
  PASIDCtl: Enable- Exec- Priv-
 Capabilities: [200 v1] Address Translation Service (ATS)
  ATSCap: Invalidate Queue Depth: 00
  ATSCtl: Enable-, Smallest Translation Unit: 00
 Capabilities: [300 v1] Page Request Interface (PRI)
  PRICtl: Enable- Reset-
  PRISta: RF- UPRGI- Stopped+
  Page Request Capacity: 00008000, Page Request Allocation: 00000000
 Kernel driver in use: i915
 Kernel modules: i915

00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f] (rev 21) (prog-if 30 [XHCI])
 Subsystem: Razer USA Ltd. Sunrise Point-LP USB 3.0 xHCI Controller [1a58:6752]
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >T...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So the ASPM is indeed disabled:
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+

Can yo try disabling the deepest power state under this kernel? i.e. use "nvme-core.default_ps_max_latency_us=1500".

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

FWIW there's another user says that disabling ASPM fixes this issue for him.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Should I take nvme.io_queue_depth=64 out? I didn't experience the problem again yet, just right after the first boot with the new kernel. However I still experience the error inside the VMs.

I'm adding nvme-core.default_ps_max_latency_us=1500 now.

Is this an user of Razer Blade Stealth? Would be good to talk to him to see if he experiences the problems inside VMs, which is very annoying as I do everything in them.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

No the user uses an XPS 9560. I think remove io_queue_depth parameter should be safe.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I've tested the new kernel with this parameter for 10 days and the error didn't happen, which is a record. So I decided to open a VM yesterday and it handled good for like 12 hours, they I went to sleep and grabbed the computer again and the error had happened inside and outside the VM. I don't know if it was caused by the VM itself but for those 10 days I tested, it worked great.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you attach the error? Maybe use something else like virtual box to see if KVM is the culprit?

Revision history for this message
hariprasad (hariprasad) wrote :

I can confirm. I have the same problems since 10.11. 2017. Some hours system works and suddenly there are serious SSD problems. I changed SSD rour times (reclamation) and I did recamation odf a new Computer one times.
HW: Intel NUC7i7BNH (Intel i7), Samsung EVO 960 M.2 NVMe, OPM Crucial 16GB.
SSD formatting: Partition table GPT, Primary Partitions EXT4
SW: Ubuntu Linux 18.04

Installation UEFI, legacy has no effect.

Revision history for this message
hariprasad (hariprasad) wrote :

Additionally, i can comfirm, that problem is not bounded only on Samsung NVME SSD, but It occurs on Intel SSD as well. It looks like, that problem is a SSD Driver.

Revision history for this message
hariprasad (hariprasad) wrote :

There is an issue, wich can be our problem. it looks like, that workaround is to disable TRIM. https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi hariprasad. nvme-core.default_ps_max_latency_us=1500 and ASPM disabling by kernel patching worked for me, at least for the main machine (my virtual machines inside this main machine still give the error after some time). I've been using the patch for almost a month without incidents. I had the problem inside a virtual machine after some days, but it didn't happen again yet (I'm using VMs but not for too much time like in the last time the problem happened).

I'm going to try to disable TRIM but it's going to take days for me to test if the VMs give any errors. Should sudo rm /etc/cron.weekly/fstrim be enough?

Have you experienced problems installing ubuntu into your SSD? My ubuntu installation gives disk error in 8/10 tries. That is, I need to reinstall ubuntu 8 times in average for the installation to end without any problems. I didn't try to install ubuntu with the kernel patch because it's a lot of work to create a live CD instalation with a patched kernel, but maybe I'll do it some day.

I also need to try virtualbox in place of virt-manager as kaihengfeng suggested. Problem is that I need to leave things open for days to notice these errors, so it's going to take time.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Lucas,

I've found that the PCIe common clock may be the culprit here.
Please try the kernel [1].

[1] https://people.canonical.com/~khfeng/quirk-no-commclk/

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Could you send me the patch (and should I use the aspm patch together with it?)

Another information that you might find useful: I've been using for 30 days without any problems on the main machine (only one incident inside the virtual machine), so today I decided to finally conclude all the missing updates that I've been waiting to do since the problems arised. It updated more than 1000 packages, and as before it gave the Read Only error while updating initram. It seems that it ALWAYS happens when updating initram.

Here's the output I was able to save: https://pastebin.com/raw/PaSQwRJN

So after 30 days without any problems it happens while doing this. It's gotta explain something, because it's very unusual, and I've always had problems with initram before.

Revision history for this message
Sam (samr28) wrote :

I also have the same issue on a Razer Blade 2017 - 7500U model. My system has the exact same drive in it. I have just installed the 4.18.0-3 kernel linked above and will post here if I run into the issue again.

Revision history for this message
hariprasad (hariprasad) wrote :

Hello, my Ubuntu Linux was 8-9 months out of order v. 17.10, and later 18.04. I did many installations and tests, changed SSD M.2 four times (recognized reclamation), changed the whole NUC7i7BNH (recognized reclamation). Log Issue on Intel. Finally I installed Fedora 28. and NVME M.2 SSD is in good condition and work properly. I used default LVM partition format. Ubuntu during installation on VLN filed immediatelly during installation, when updates were applied. The problem is bounded specially with Ubuntu. I doesn't check, which driver use Fedora and Ubuntu, Fedora kernel is '4.17.12-200.fc28.x86_64 #1 x86_64 GNU/Linux', so I cannot distill if it should be in driver or is system settings. But finally, i can say, that my reclamations were unauthorized. Sadly, in that case, the easiest workaround for me is to use different Linux.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi hariprasad. If possible, you could try our patched kernel which disables ASPM: https://people.canonical.com/~khfeng/pm961-disable-aspm/. It worked for me but only when I added the nvme-core.default_ps_max_latency_us=1500 kernel parameter. Maybe you can try some day. We're still investigating the issue. When Kai Heng send me the patch I can try it and see what changes.

Revision history for this message
hariprasad (hariprasad) wrote :

Hello Lucas, thank you for response. Yes, badly setted kernel parameters can cause very serious problems. Additionally, I can comfirm, that problem is not bounded only on Samsung NVME SSD, but It occurs on Intel SSD-6 series as well. It looks like, that problem is in ASPM SSD Driver - kernel parameters. There are a few errors, which came in one time. The easiest way, how to simulate initframfs error during startup/restart is to install Thunderbird and download thousands emails from cloud e.q. google mail to generate traffic on SSD. Than install and startup Firefox, add plugins for video (Player) and stertup video. Firefox for Linux (last version was something about 57-61) is unstable on Linux (generally, not only Ubuntu), than Firefox begin crash, and issues - "Would you like to restart and recover Firefox?". It streses the SSD and after a few restores (about 10) probably begin crash Thunderbird. It is the time for restart system. Probably - there will be issue, that it is not possible to start Ubuntu and initframfs error occured.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

It's worth saying that the ASPM patch + 1500 kernel parameter worked for me for over a month without giving me one single error. After update to 18.04 now I see the error every 2 or 3 days. Actually, in the middle of the update process to 18.04 it gave the error right on the initramfs update, which is where it always gives the error. This is sad, it was working perfectly except inside the VMs but it was very stable :(

Revision history for this message
hariprasad (hariprasad) wrote :

Hello Lucas, thank you for response. Yes, badly setted kernel parameters can cause very serious problems. Additionally, I can comfirm, that problem is not bounded only on Samsung NVME SSD, but It occurs on Intel SSD-6 series as well. It looks like, that problem is in ASPM SSD Driver - kernel parameters. There are a few errors, which came in one time. The easiest way, how to simulate initframfs error during startup/restart is to install Thunderbird and download thousands emails from cloud e.q. google mail to generate traffic on SSD. Than install and startup Firefox, add plugins for video (Player) and stertup video. Firefox for Linux (last version was something about 57-61) is unstable on Linux (generally, not only Ubuntu), than Firefox begin crash, and issues - "Would you like to restart and recover Firefox?". It streses the SSD and after a few restores (about 10) probably begin crash Thunderbird. It is the time for restart system. Probably - there will be issue, that it is not possible to start Ubuntu and initframfs error occured.

Revision history for this message
Sam (samr28) wrote :

I'm also running the ASPM patch and haven't had problems for the last month or so. Any idea when this will get merged?

Revision history for this message
Janne Peltonen (janne-peltonen) wrote :

Stumbled onto this bug from somewhere else, and noticed that it seems I have the same samsung SSD drive SM961/PM961 (Same output on lspci --vvnn regarding the NVMe as Lucas posted,). However, for me it has worked without any problems on stock ubuntu 18.04 / mint / kubuntu installations. Perhaps it depends on the system configuration as well instead of just the SSD? Not sure this information helps but though to post it anyway.

Revision history for this message
Fabian (fabiangieseke) wrote :

I have had the same issues with Ubuntu 18.04 and a Samsung MZ-V7E1T0 1000GB M.2 PCI Express 3.0 and the default installation (ext4): Plenty of errors, especially when upgrading/installing packages via apt.

I have reinstalled the whole system. Instead of the standard journaling file system (ext4), I have btrfs for the root mount point (/). System works perfectly now, no errors for a couple of days with plenty of software being installed.

Not sure, might be a ext4/kernel bug (?).

Revision history for this message
Janne Peltonen (janne-peltonen) wrote :

To add to my previous comment, I've been running ext4 all the time.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Fabian, did you have any problems installing ubuntu? Mine would give disk errors about 7/10 times I tried to install. I had to try many times until no error appeared.

I'd like to try btrfs but I don't have the time to do it right now. I also had problems with apt, but when upgrading the system. It'd always give the error in the initramfs update, or something like that.

I'll try to install a fresh ubuntu 18.04 soon too, as Janne suggested.

Revision history for this message
Fabian (fabiangieseke) wrote :

I have tried two things:

(1) Fresh install, Ubuntu 18.04 (about ten days ago), ext4. No errors during the installation. However, when installing stuff via apt afterwards (or upgrading), I got many errors along the lines described above (e.g., "compressed data is corrupt... unexpected end of file or stream"). This happened for, I guess, arbitrary packages. No errors for initramfs update for me ...

(2) Fresh install, Ubuntu 18.04 (about four days ago), btrfs for /. No errors at all.

Revision history for this message
Ole Christian Nilsen (oc-nilsen) wrote :

I have this bug (MSI laptop, Ubuntu Studio 18.04) and it's getting quite annoying to be honest. If there's anything I can do to help remedy the situation within reasonable time (I'm about to reinstall) then let me know.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi Ole Christian. First, did you have any problems in the ubuntu installation? In mine I had to try to install several times until it installed without any disk errors.

Also, you can try this kernel https://people.canonical.com/~khfeng/pm961-disable-aspm/ with this kernel parameter "nvme-core.default_ps_max_latency_us=1500". This is what worked for me, but it's not a definitive solution, I still get the error in some situations (much more rare than before though). You can read our discussion to understand it better.

I guess someone is working on this bug for a definitive solution...

Revision history for this message
Ole Christian Nilsen (oc-nilsen) wrote :

Hi Lucas! Thanks for the reply. No, I had no problems during installation. The computer just shuts down at random intervals to a black screen with all kinds of EXT4-fs errors and reports that the file system is read only. Often the disk isn't even recognized at reboot, so I have to boot into a live environment and use Gparted to fix it from there.

I do music production professionally, so if I can't get it fixed relatively easily and permanently then I'll have to look elsewhere unfortunately.

Thanks though. :)

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Ok Christian, thanks for the info. You can try the kernel for now, and I also read that using ubuntu with brtfs system instead of ext4 also solves the problem, you could try

Revision history for this message
Ole Christian Nilsen (oc-nilsen) wrote :

I may try that. Are we sure it's a kernel bug though? I can't remember having this problem when I used Solus OS for a while. But I may not have used for long enough since I discovered it didn't support Jack2 and was pretty much unusable to me.

Revision history for this message
pleban (marek-zebrowski-gmail) wrote :

I can confirm that bug with two different NVMe drivers - Samsung EVO970 and WD Black in 4.18.0-10-generic and in 4.15.0-20-generic kernels. H270 Intel chipset on the motherboard

Revision history for this message
Ole Christian Nilsen (oc-nilsen) wrote :

I have the WD Black 256 Gb drive.

Revision history for this message
pleban (marek-zebrowski-gmail) wrote :

Attachment contains error from dmesg output. For me reproduction steps are: write large (>10G) amount of data to NVMe ssd.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

What's the PCI ID for EVO 970 and WD Black?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

If you use Samsung (144d:a804) or Sk Hynix (1c5c:1285), please try kernel in [1].

[1] https://people.canonical.com/~khfeng/lp1785715/

Revision history for this message
pleban (marek-zebrowski-gmail) wrote :

My Samsung is indeed [144d:a808]. I'll check WD later on - it's not connected at this time.
I was not able to reproduce this bug using Clear Linux current kernel (4.18.16-645).

Revision history for this message
pleban (marek-zebrowski-gmail) wrote :

My Samsung is indeed [144d:a808]. I'll check WD later on - it's not connected at this time.
I was not able to reproduce this bug using Clear Linux current kernel (4.18.16-645).
I checked kernel https://people.canonical.com/~khfeng/lp1785715/ with no nvme-core.default_ps_max_latency_us= settings and I was not able to reproduce the issue with my "copy lots of data" scenario that triggered the bug every time yesterday.
So it looks like success! I'll keep using that kernel for now and report if any problems arise.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The kernel doesn't do anything special for 144d:a808, it's for 144d:a804.

Revision history for this message
pleban (marek-zebrowski-gmail) wrote :

Then I'm puzzled. I'll retest later with WD.

Revision history for this message
Janne Peltonen (janne-peltonen) wrote :

Here is the output of lspci -vvnn on my computer. It's from the 256GB version of the samsung NVMe.
On my systems I've never had any corruption problems, even moving large (60GB+) VM files and installing OS on ext4 multiple times. Currently running stock LM 19. Hope this helps.

Revision history for this message
Ole Christian Nilsen (oc-nilsen) wrote :

lshw output:

*-storage
                description: Non-Volatile memory controller
                product: Sandisk Corp
                vendor: Sandisk Corp
                physical id: 0
                bus info: pci@0000:04:00.0
                version: 00
                width: 64 bits
                clock: 33MHz
                capabilities: storage pm pciexpress msix nvm_express bus_master cap_list
                configuration: driver=nvme latency=0
                resources: irq:16 memory:df100000-df103fff

lspci output:

04:00.0 Non-Volatile memory controller: Sandisk Corp WD Black NVMe SSD

Revision history for this message
Ole Christian Nilsen (oc-nilsen) wrote :

lspci -vvnn:

04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black NVMe SSD [15b7:5001] (prog-if 02 [NVM Express])
        Subsystem: Marvell Technology Group Ltd. WD Black NVMe SSD [1b4b:1093]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        NUMA node: 0
        Region 0: Memory at df100000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: nvme
        Kernel modules: nvme

Revision history for this message
Ole Christian Nilsen (oc-nilsen) wrote :

I should probably mention that while it is installed in my laptop it is not currently being used as I had to revert to using an ordinary HDD.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Any news on this problem? Im still having it

Revision history for this message
Richard Grieves (trickydickie) wrote :

I too am having an SSD corruption issue with Ubuntu 18.04, same exact symptoms. I have a Kingston 480gb SSD, not nvme, connected over SATA. My PC is a desktop, I have attached the output of lspci -vvnn. I have to do manual fsck every 1.5 weeks or so. When I am using my PC, it will freeze up occasionally for about 15 seconds with very high SSD I/O usage - I have attached an iotop log which recorded a freeze at around 18:03:22 (the log records every 1 second, and you will see there is a gap between a recording at 18:03:22 and 18:03:35 which indicates the freeze, followed by 90%+ io. I have included my SSD smart info as well as my current lsblk output below:

=== START OF INFORMATION SECTION ===
Device Model: KINGSTON SA400S37480G
Serial Number: 50026B76825B4FA0
LU WWN Device Id: 5 0026b7 6825b4fa0
Firmware Version: SBFKB1C2
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Feb 25 17:57:34 2019 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

lsblk::

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 3.7M 1 loop /snap/gnome-system-monitor/57
loop1 7:1 0 13M 1 loop /snap/gnome-characters/103
loop2 7:2 0 91M 1 loop /snap/core/6350
loop3 7:3 0 3.7M 1 loop /snap/gnome-system-monitor/51
loop4 7:4 0 2.3M 1 loop /snap/gnome-calculator/180
loop5 7:5 0 140.7M 1 loop /snap/gnome-3-26-1604/78
loop6 7:6 0 270.5M 1 loop /snap/pycharm-community/112
loop7 7:7 0 86.9M 1 loop /snap/core/4917
loop8 7:8 0 91M 1 loop /snap/core/6405
loop9 7:9 0 14.5M 1 loop /snap/gnome-logs/45
loop10 7:10 0 140.7M 1 loop /snap/gnome-3-26-1604/74
loop11 7:11 0 13M 1 loop /snap/gnome-characters/139
loop12 7:12 0 14.5M 1 loop /snap/gnome-logs/37
loop13 7:13 0 2.3M 1 loop /snap/gnome-calculator/260
loop14 7:14 0 34.7M 1 loop /snap/gtk-common-themes/319
loop15 7:15 0 34.6M 1 loop /snap/gtk-common-themes/818
loop16 7:16 0 140.9M 1 loop /snap/gnome-3-26-1604/70
loop17 7:17 0 34.8M 1 loop /snap/gtk-common-themes/1122
sda 8:0 0 447.1G 0 disk
└─sda1 8:1 0 447.1G 0 part /

Revision history for this message
Richard Grieves (trickydickie) wrote :

Here is the iotop log I mentioned above (attached)

Revision history for this message
Richard Grieves (trickydickie) wrote :

My issue has been resolved by upgrading the firmware of my SSD from SBFKB1C2 to SBFKB1C3.

https://askubuntu.com/questions/1107053/ubutnu-18-04-ssd-sometimes-freeze-for-seconds

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Just tried Ubuntu 19 today and the problem persists (can't even install ubuntu because it gives io error)

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi Kai-Heng Feng, do you have any news on this problem? It'd be great to know.

Than you so much!

Revision history for this message
Fabian (fabiangieseke) wrote :

Hi,

a little update from my side: It seems that faulty memory was the reason for the data corruptions in my case. I have replaced the memory module and everything seems to work fine now. I was quite surprised though that the memory was defective since I did test it carefully for many hours with memtest (20+ passes without any errors). The errors only occured when running Ubuntu ...

The memory was the only thing I have changed, so I am very sure that this was the cause ...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Lucas,
Do you still have this issue on mainline kernel?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I tried the Ubuntu 19.04 installer and I couldn't even install it because of IO errors. Does the installer of Ubuntu 19.04 uses the new kernel?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi Kai-Heng Feng, I just installed kernel 5.1.1 and the error still happens

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Disable ASPM. Only compile tested.

Revision history for this message
WinEunuchs2Unix (ricklee518) wrote :

I've been using NVMe M.2 Samsung Pro 960 for 18 months and never had a problem.
Ubuntu 16.04.6 LTS, Kernel 4.14.114 LTS, Skylake 6700HQ, nVidia 970m
UEFI, GPT, AHCI (Intel Raid off), Secure Boot off

`/etc/fstab`:

UUID=b40b3925-70ef-447f-923e-1b05467c00e7 / ext4 errors=remount-ro 0 1
UUID=D656-F2A8 /boot/efi vfat umask=0077 0 1
UUID=b4512bc6-0ec8-4b17-9edd-88db0f031332 none swap sw 0 0

`/etc/default/grub`:
GRUB_CMDLINE_LINUX_DEFAULT="noplymouth fastboot acpiphp.disable=1 pcie_aspm=force vt.handoff=7 i915.fastboot=1 nopti nospectre_v2 nospec"

I've never had a single fsck error ever. Granted the `grub` boot option `fastboot` means `fsck` is not run on boot but I can check once FS is mounted RW with:

$ sudo fsck -n /dev/nvme0n1p6
fsck from util-linux 2.27.1
e2fsck 1.42.13 (17-May-2015)
Warning! /dev/nvme0n1p6 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
New_Ubuntu_16.04: clean, 712096/2953920 files, 5733245/11829504 blocks

Assuming your `/etc/fstab` is the same, the two important `grub` boot parameters are: `acpiphp.disable=1 pcie_aspm=force`. If memory serves me correct though these were setup for suspend/resume reasons though.

I hope this helps those effected by bug a little but more importantly that people realize the vast majority of NVMe installations work fine in Linux.

Changed in linux (Ubuntu):
assignee: Kai-Heng Feng (kaihengfeng) → nobody
Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi, I didn't find the root of the problem, **BUT**...

Using Qubes OS I was able to run for more than 7 days without any problems! It normally would occur in the first hour of usage.

I guess Qubes's Xen drivers proxy the pcie requests and therefore the failing NVME/PCIE drivers aren't used. So it at least shines a light in the problem.

Maybe someone with better understanding of how Xen works can reason about which drivers are being used and which are not and discover the root of the problem!

Revision history for this message
Juan Carlos Carvajal Bermúdez (jucajuca) wrote :

I have exactly the same drive:

/dev/nvme0n1 S444NY0K600040 SAMSUNG MZVLB256HAHQ-00000 1 81.09 GB / 256.06 GB 512 B + 0 B EXD7101Q

and exactly the same problem.

I filled a bug before deactivating AER (pci=noaer)

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852479

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi Juan, what computer you are in?

If you really want to use your computer with linux the only way that it solved for me was to use Qubes OS

Revision history for this message
trong luu (tronglx) wrote :

Hi lucas, i have the same problem. My laptop is matebook x pro 2018 and nvme LITEON CA3-8D512. My working is on linux and it is an unpleasant experience. Currently, when issue occurs, i power off/on my laptop (one times or more) and it can work normally in a few hours. Do you have any another suggest about linux distributions?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi tronglx, the only way I found was to install Qubes OS

Revision history for this message
trong luu (tronglx) wrote :

Thank Lucas, do you have tried with arch linux? Qubes OS is very strange with me. I'm developer and os community is very importance.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi tronglx,

I didn't try Arch but I THINK I tried Manjaro which is based on it. If I did it didn't work. I remember trying lots of linux and all of them failed.

Qubes OS works because it doesn't use linux kernels directly because it uses Xen microkernel, so somehow it excludes the bug. You can install Arch as a Qubes VM, there's a script for it, you just run it and then it generates an image that you can install. They also provide Ubuntu, Kali and others.

Since this bug is rare I don't think they'll try to fix, the guy that was helping here gave up.

Revision history for this message
trong luu (tronglx) wrote :

Thank Lucas,
It just happened to my laptop. I will try find out the solution.

Revision history for this message
trong luu (tronglx) wrote :

I switched to recovery mode and run: mount -o remount,rw /. The problem no longer appears, it seem be fixed.

Revision history for this message
trong luu (tronglx) wrote :

The error still happens.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi trong luu, for me the error happened every day, which is why I ended up using Qubes. It's the only way that I could find except for Windows

You can try older kernels but it didn't work for me. Remember that downloading older ubuntus will still give you a recent kernel, you have to downgrade by yourself. However ni Ubuntu or Debian kernels fixed the problem for me

Revision history for this message
trong luu (tronglx) wrote :

I think other SSD type is last option. But, i really want find out root cause of the problem. As my understanding, system booted up with Opts: errors=remount-ro. Then something went wrong, system switched to ro mode to protect file system. Do you have checked system log, have any abnormal log? NVME is becoming more and more popular. This is the big problem with linux user.

Revision history for this message
Juan Carlos Carvajal Bermúdez (jucajuca) wrote :

For anyone struggling with this hideous bug, try the following:

add "nvme_core.default_ps_max_latency_us=250" in /etc/default/grub, for example:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pcie_aspm=off nvme_core.default_ps_max_latency_us=250"

then run "update-grub"

My laptop has been running smoothly for a week now. (/dev/nvme0n1 S444NY0K600040 SAMSUNG MZVLB256HAHQ-00000 1 81.09 GB / 256.06 GB 512 B + 0 B EXD7101Q)

see more infos here: https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe

@kernel developers, would it not be great to detect such disks and lower automatically the nvme_core.default_ps_max_latency_us? Thi bug is really hard to detect and solve because there are NO logs whatsoever. Disk goes read-only ya know?

Revision history for this message
trong luu (tronglx) wrote :

Hi Juan, i have tried with your suggest many time but the problem still happens. I also tried with nvme_core.default_ps_max_latency_us=0 but no hope. I'm not sure the APST be disabled. How to check APST status when system booted?
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/core.c#n2282

Revision history for this message
Juan Carlos Carvajal Bermúdez (jucajuca) wrote :

Hi
try:

cat /sys/module/nvme_core/parameters/default_ps_max_latency_us

sudo nvme get-feature -f 0x0c -H /dev/nvme0

please read carefully the link provided. the info is there.

Revision history for this message
trong luu (tronglx) wrote :
Download full text (6.0 KiB)

After running cat /sys/module/nvme_core/parameters/default_ps_max_latency_us command and output is 0.
sudo nvme get-feature -f 0x0c -H /dev/nvme0n1p2
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
 Autonomous Power State Transition Enable (APSTE): Disabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[13]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[14]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[15]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[16]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[17]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[18]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[19]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[20]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
...

Read more...

Revision history for this message
trong luu (tronglx) wrote :

Hi Lucas, is it ok if installing window and using ubuntu in VMware?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi trong luu, I didn't test it, but I think that it depends on the way VMware virtualizes access to the disk. There may be multiple ways, one of which will work.

Revision history for this message
trong luu (tronglx) wrote :

Hi Lucas, do you have tried with new SSD? I don't think this is the hw issue. My SSD Power Cycles is only 807. Eventually, if not having any other solution, i think i will buy new SSD, do you know which type of SSD would work properly with linux?
Smartctl output:
sudo smartctl -t long -a /dev/nvme0n1p2
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.0.0-37-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: LITEON CA3-8D512
Serial Number: 0028104000DN
Firmware Version: C49640A
PCI Vendor ID: 0x14a4
PCI Vendor Subsystem ID: 0x1b4b
IEEE OUI Identifier: 0x002303
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Thu Dec 19 08:54:28 2019 +07
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 32 Pages
Warning Comp. Temp. Threshold: 83 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 8.00W - - 0 0 0 0 0 0
 1 + 4.50W - - 1 1 1 1 5 5
 2 + 3.00W - - 2 2 2 2 5 5
 3 - 0.0700W - - 3 3 3 3 1000 5000
 4 - 0.0100W - - 4 4 4 4 5000 45000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
 0 - 512 0 1
 1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 47 Celsius
Available Spare: 100%
Available Spare Threshold: 0%
Percentage Used: 0%
Data Units Read: 5,773,150 [2.95 TB]
Data Units Written: 6,405,757 [3.27 TB]
Host Read Commands: 78,674,228
Host Write Commands: 91,754,035
Controller Busy Time: 10,405
Power Cycles: 807
Power On Hours: 312
Unsafe Shutdowns: 104
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hu trong lu. I indeed bought a new SSD because I thought mine was faulty. However I bought one of the same brand (Samsung). I didn't have the idea of buying another brand. Anyways, the brand new SSD also has the problem.

For my case it definitely is not a hardware problem. With Linux the problem happens every day, sometimes more than once per day. With Windows the error never happened and with Qubes I'm running for more than 2 months without any problems. So it's not hardware, definitely is something wrong with Linux kernel

Revision history for this message
trong luu (tronglx) wrote :

Thank Lucas, i think i will buy another type of SSD. Do you have any suggestion?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

The other 2 good brands I know are Corsair and WD Black. Don't buy Samsung, the majority of people with this problem have Samsung

Revision history for this message
trong luu (tronglx) wrote :

Thank you. My SSD is LITEON CA3-8D512, not Samsung. So, would i buy another type of SSD? non nvme?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

non nvme SSDs are pretty slow, like 8 times slower. Stick with NVME and if nothing works install Qubes

Revision history for this message
trong luu (tronglx) wrote :

Thank Lucas.

Revision history for this message
Craigums Carlonious (craigsidcarlson) wrote :

It's 2020, is there still no solution to this problem? Getting this error with ubuntu 18 LTS and 19

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Craigums Carlonious,

Is the system exact the same?

Revision history for this message
Craigums Carlonious (craigsidcarlson) wrote :

Hi, yes I also am trying to install onto the razer blade stealth like a lot of the other people above which have the SAMSUNG MZVLB256HAHQ-00000 nvme ssd. Getting the I/O Error and have tried most of the fixes mentioned above, but no luck, and I would rather continue using Windows instead of Qubes.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please attach `sudo nvme id-ctrl /dev/nvme0`?

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Hi Kai-Heng Feng, please note that after I installed Qubes, I never ever had the problem again. It may be useful in the debug process, and maybe the way Xen, PCIe and Linux work together in Qubes can give a hint on what's happening. Thank you for all your help to this day.

Revision history for this message
Juan Carlos Carvajal Bermúdez (jucajuca) wrote :

an update on this:

it was actually pcie_aspm=off what helped to solve the problem.

I think the problem is related to the power management of PCIe ports.

Without pcie_aspm=off I started seeing errors like the following ones:

- [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=7841 end=7842) time 291 us, min 1063, max 1079, scanline start 1044, end 1092

 pcieport 0000:00:1d.0: AER: Corrected error received: 0000:00:1d.0
 pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
 pcieport 0000:00:1d.0: AER: device [8086:a330] error status/mask=00000001/00002000

I think the bug is not with the nvme controller but somewhere in ASPM. But I am not a kernel developer.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

Juan, which hardware you're on? Razer?

Revision history for this message
Juan Carlos Carvajal Bermúdez (jucajuca) wrote :

No I have a laptop from a XMG, it is a German brand.

Revision history for this message
Ramon Fontes (ramonreisfontes) wrote :

Hello all!

I'm experiencing the same problem with an adata SU800NS38. My SSD works fine with the 4.17.0-041700-generic kernel version but unfortunately this is the only kernel version it works perfectly. In addition to try other kernel versions such as 4.x, I also tried 5.0 - 5.5. The disk becomes read-only during use and I need to use fsck whenever I start the system.

Revision history for this message
Ramon Fontes (ramonreisfontes) wrote :

BTW, pcie_aspm=off and nvme_core.default_ps_max_latency_us=5500 didn't work.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (6.1 KiB)

Ramon, which hardware is yours? Razer?

Enviado via ProtonMail móvel

-------- Mensagem Original --------
Ativo 6 de mar de 2020 16:34, Ramon Fontes escreveu:

> BTW, pcie_aspm=off and nvme_core.default_ps_max_latency_us=5500 didn't
> work.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1746340
>
> Title:
> Samsung SSD corruption (fsck needed)
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Ubuntu 4.13.0-21.24-generic 4.13.13
>
> I have a Razer Blade Stealth 2016. The first Ubuntu I installed was Ubuntu 17.04, which gave me this error after 2 weeks of usage. After that, I installed 16.04 and used it for MONTHS without any problems, until it produced the same error this week. I think it has to do with the ubuntu updates, because I did one recently and one today, just before this problem. Could be a coincidence though.
>
> I notice the error when I try to save something on disk and it says me
> that the disk is in read-only mode:
>
> lz@lz:/var/log$ touch something
> touch: cannot touch 'something': Read-only file system
>
> lz@lz:/var/log$ cat syslog
> Jan 29 01:07:39 lz kernel: [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
>
> lz@lz:/var/log$ dmesg
> [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.377374] Aborting journal on device nvme0n1p2-8.
> [62984.379343] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
> [62984.379516] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.381486] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.383484] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.385469] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.387278] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.389262] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.391252] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.393341] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [63285.618078] audit: type=1400 audit(1517195560.393:63): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=22495 comm="cupsd" capability=12 capname="net_admin"
>
> Rebooting the ubuntu will give me a black terminal where I can run
> fsck /dev/nvm30n1p2 (something like that) and it fill fix a lot of
> orphaned inodes. The majority of time it boots back to the Ubuntu
> working good, but some times...

Read more...

Revision history for this message
Ramon Fontes (ramonreisfontes) wrote :

I have a Dell Inspiron 14 5000 Series-5480. The most strange thing is that I bought my laptop about 1 year ago and I've installed Ubuntu 18.04.1 with kernel 4.15.0-65-xxx (default installation) and everything worked as expected. However, the same problem happened with any other kernel version (including 4.17.0-041700-generic).

Then, after having some problems with my system I've installed Ubuntu 18.04.4. The kernel version installed with the system was 5.3 and after observing the same problem with the disk I've tried to install 4.15.0-65 and the problem has not been solved (I don't remember exactly which kernel version I had in the first time (e.g. what was the xxx)). Finally, I've found that 4.17.0-041700-generic works and I don't know why. It didn't work with Ubuntu 18.04.1 and it works with 18.04.4. This is really strange and I need to use most recent kernel versions, because I need to use some features I've implemented for v5.5-rc1.

[1] https://github.com/torvalds/linux/commit/b5764696ac409523414f70421c13b7e7a9309454#diff-21081ef83e1374560c2e244926168e49
[2] https://github.com/torvalds/linux/commit/7dfd8ac327301f302b03072066c66eb32578e940#diff-21081ef83e1374560c2e244926168e49

Revision history for this message
Lucas Zanella (lucaszanella) wrote :
Download full text (7.3 KiB)

I also had this problem of it working for a year, then I update it and it stops working. Then I roll back the kernel and it won't work again

Enviado via ProtonMail móvel

-------- Mensagem Original --------
Ativo 7 de mar de 2020 10:17, Ramon Fontes escreveu:

> I have a Dell Inspiron 14 5000 Series-5480. The most strange thing is
> that I bought my laptop about 1 year ago and I've installed Ubuntu
> 18.04.1 with kernel 4.15.0-65-xxx (default installation) and everything
> worked as expected. However, the same problem happened with any other
> kernel version (including 4.17.0-041700-generic).
>
> Then, after having some problems with my system I've installed Ubuntu 18.04.4. The kernel version installed with the system was 5.3 and after observing the same problem with the disk I've tried to install 4.15.0-65 and the problem has not been solved (I don't remember exactly which kernel version I had in the first time (e.g. what was the xxx)). Finally, I've found that 4.17.0-041700-generic works and I don't know why. It didn't work with Ubuntu 18.04.1 and it works with 18.04.4. This is really strange and I need to use most recent kernel versions, because I need to use some features I've implemented for v5.5-rc1.
>
> [1] https://github.com/torvalds/linux/commit/b5764696ac409523414f70421c13b7e7a9309454#diff-21081ef83e1374560c2e244926168e49
> [2] https://github.com/torvalds/linux/commit/7dfd8ac327301f302b03072066c66eb32578e940#diff-21081ef83e1374560c2e244926168e49
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1746340
>
> Title:
> Samsung SSD corruption (fsck needed)
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Ubuntu 4.13.0-21.24-generic 4.13.13
>
> I have a Razer Blade Stealth 2016. The first Ubuntu I installed was Ubuntu 17.04, which gave me this error after 2 weeks of usage. After that, I installed 16.04 and used it for MONTHS without any problems, until it produced the same error this week. I think it has to do with the ubuntu updates, because I did one recently and one today, just before this problem. Could be a coincidence though.
>
> I notice the error when I try to save something on disk and it says me
> that the disk is in read-only mode:
>
> lz@lz:/var/log$ touch something
> touch: cannot touch 'something': Read-only file system
>
> lz@lz:/var/log$ cat syslog
> Jan 29 01:07:39 lz kernel: [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
>
> lz@lz:/var/log$ dmesg
> [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.377374] Aborting journal on device nvme0n1p2-8.
> [62984.379343] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
> [62984.379516] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.381486] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
> [62984.383484] EXT4-fs error (devic...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Ramon,
Please file a separate bug since it's platform specific.

Revision history for this message
Ramon Fontes (ramonreisfontes) wrote :

I thought I could help in some way with more information. By the way, I've found the solution and my SSD works fine right now. You may want to take a lookt at https://bugzilla.kernel.org/show_bug.cgi?id=201685. Comment #294 (https://bugzilla.kernel.org/show_bug.cgi?id=201685#c294), in particular, helped me to solve the problem.

Revision history for this message
Lucas Zanella (lucaszanella) wrote :

I just want to say that after 2 years I remembered I had an SSD with different brand tham Samsung, a Kingston one. I installed it on my razer and it worked perfectly for days, I did several SSD stress tests and no errors.

The error is defintely with Samsung AND linux. And it's not a faulty SSD because it happens on both of my samsung SSDs. It does not happen on Windows, neither Qubes, with any SSD.

I tested the latest Ubuntu 21.04 and the problem still happens on Samsung SSDs right on the installation screen.

Anyways I'm not even using this computer anymore, I bought a Dell XPS 13, but the error persists and it's either Samsung's or Linux fault. Probably Samsung since other brands work ok with Samsung.

Revision history for this message
Anthony Durity (anthony-durity) wrote :

I've hit this "bug". I've a nice Clevo ODM based laptop and luckily I have two nvme drives in it so it's not a show-stopper for me but obv. it's a concern. I have an Intel one which is the boot drive and a Samsung one which is the data drive. I have a dual-boot setup. So two data points to note. The Intel nvme works in both Windows and Linux. The Samsung works in Windows, but not in Linux. When I say that it doesn't work in Linux I should say that the system brings the drive up, I can mount it read-write, everything looks good but as soon as I try and write files to it it craps out with nothing written:

[369.798910] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[369.798916] nvme nvme0: Does your device have a faulty power saving mode enabled?
[369.798918] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[369.870912] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[369.871064] nvme nvme0: Removing after probe failure status: -19
[369.890931] nvme0n1: detected capacity change from 1953525168 to 0

Output of `dmesg` attached.

Revision history for this message
tetebueno (tetebueno) wrote (last edit ):

Bump:

PC: Lenovo Legion Y520-15IKBN

SSD: Samsung SM951 M.2 PCIe SSD Drive (MZ-HPV256)

OS: Elementary OS 7.1 Horus (Ubuntu 22.04.1)

Kernel: 6.5.0-14-generic

---

lshw relevant parts:

computer
     description: Notebook
    product: 80WK (LENOVO_MT_80WK_BU_idea_FM_Lenovo Y520-15IKBN)
    vendor: LENOVO
    version: Lenovo Y520-15IKBN
    serial: PF0UM7F3
    width: 64 bits
    capabilities: smbios-3.0.0 dmi-3.0.0 smp vsyscall32
    configuration: administrator_password=disabled boot=normal chassis=notebook family=IDEAPAD frontpanel_password=disabled keyboard_password=disabled power-on_password=disabled sku=LENOVO_MT_80WK_BU_idea_FM_Lenovo Y520-15IKBN uuid=e974c0b6-54a9-11ef-8ff5-54e13d454041
(...)
              *-disk
                   description: ATA Disk
                   product: SAMSUNG MZHPV256
                   physical id: 0.0.0
                   bus info: scsi@3:0.0.0
                   logical name: /dev/sdb
                   version: 500Q
                   serial: S1X2NYAG810617
                   size: 238GiB (256GB)
                   capabilities: gpt-1.00 partitioned partitioned:gpt
                   configuration: ansiversion=5 guid=094df7bd-71f3-4587-b36f-8cb0bd3ba964 logicalsectorsize=512 sectorsize=512

---

Update: I first tried changes in comment #190 but that didn't work, the error persisted. Then, I tried only adding the pcie_aspm=off parameter only (removing the nvme_core.default_ps_max_latency_us parameter) and things got better, I was without errors for about three weeks straight; then the error manifested again. One thing to note is that the day of the failure was the only one that the computer was suspended intentionally. For now I'll keep this configuration as it seems to be the most "stable" one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.