Samsung SSD corruption (fsck needed)

Bug #1746340 reported by Lucas Zanella on 2018-01-30
56
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Kai-Heng Feng

Bug Description

Ubuntu 4.13.0-21.24-generic 4.13.13

I have a Razer Blade Stealth 2016. The first Ubuntu I installed was Ubuntu 17.04, which gave me this error after 2 weeks of usage. After that, I installed 16.04 and used it for MONTHS without any problems, until it produced the same error this week. I think it has to do with the ubuntu updates, because I did one recently and one today, just before this problem. Could be a coincidence though.

I notice the error when I try to save something on disk and it says me that the disk is in read-only mode:

lz@lz:/var/log$ touch something
touch: cannot touch 'something': Read-only file system

lz@lz:/var/log$ cat syslog
Jan 29 01:07:39 lz kernel: [62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0

lz@lz:/var/log$ dmesg
[62984.375393] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.377374] Aborting journal on device nvme0n1p2-8.
[62984.379343] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
[62984.379516] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.381486] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.383484] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.385469] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.387278] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.389262] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.391252] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[62984.393341] EXT4-fs error (device nvme0n1p2): ext4_find_entry:1442: inode #26607929: comm updatedb.mlocat: checksumming directory block 0
[63285.618078] audit: type=1400 audit(1517195560.393:63): apparmor="DENIED" operation="capable" profile="/usr/sbin/cupsd" pid=22495 comm="cupsd" capability=12 capname="net_admin"

Rebooting the ubuntu will give me a black terminal where I can run fsck /dev/nvm30n1p2 (something like that) and it fill fix a lot of orphaned inodes. The majority of time it boots back to the Ubuntu working good, but some times it boots to a broken ubuntu (no images, lots of things broken). I have to reinstall ubuntu then.

Every time I reinstall my Ubuntu, I have to try lots of times until it installs without an Input/Output error. When it installs, I can use it for some hours without having the problem, but if I run the software updates, it ALWAYS crashes and enters in read-only mode, specifically in the part that is installing kernel updates.

I noticed that Ubuntu installs updates automatically when they're for security reasons. Could this be the reason my Ubuntu worked for months without the problem, but then an update was applied and it broke?

I thought that this bug was happening: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 and tried different nvme_core.default_ps_max_latency_us= combinations, all them gave errors. I just changed to 0 and I had no error while using ubuntu (however I didn't test for a long time) but I still had the error after trying to update my ubuntu.

My Samsung 512gb SSD is:

SAMSUNG MZVLW512HMJP-00000, FW REV: CXY7501Q

on a Razer Blade Stealth.

I also asked this on ask ubuntu, without success: https://askubuntu.com/questions/998471/razer-blade-stealth-disk-corruption-fsck-needed-probably-samsung-ssd-bug-afte

Please help me, as I need this computer to work on lots of things :c
---
ApportVersion: 2.20.7-0ubuntu3.7
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: lz 1088 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 17.10
InstallationDate: Installed on 2018-01-30 (0 days ago)
InstallationMedia: Ubuntu 17.10 "Artful Aardvark" - Release amd64 (20180105.1)
MachineType: Razer Blade Stealth
Package: linux (not installed)
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-21-generic.efi.signed root=UUID=0ca062da-7e8f-425a-88b1-1f784fb40346 ro quiet splash button.lid_init_state=open nvme_core.default_ps_max_latency_us=0
ProcVersionSignature: Ubuntu 4.13.0-21.24-generic 4.13.13
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-21-generic N/A
 linux-backports-modules-4.13.0-21-generic N/A
 linux-firmware 1.169.1
Tags: wayland-session artful
Uname: Linux 4.13.0-21-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 01/12/2017
dmi.bios.vendor: Razer
dmi.bios.version: 6.00
dmi.board.name: Razer
dmi.board.vendor: Razer
dmi.chassis.type: 9
dmi.chassis.vendor: Razer
dmi.modalias: dmi:bvnRazer:bvr6.00:bd01/12/2017:svnRazer:pnBladeStealth:pvr2.04:rvnRazer:rnRazer:rvr:cvnRazer:ct9:cvr:
dmi.product.family: 1A586752
dmi.product.name: Blade Stealth
dmi.product.version: 2.04
dmi.sys.vendor: Razer

Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.15 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1746340

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected artful wayland-session
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lucas Zanella (lucaszanella) wrote :

Which kernel should I install exactly, and how to? Don't feel safe to download from http

Kai-Heng Feng (kaihengfeng) wrote :

This is a known issue for Samsung NVMe.

Please attach the output of `sudo nvme id-ctrl /dev/nvme0` and `sudo nvme get-feature -f 0x0c -H /dev/nvme0 | less`, Thanks!

Kai-Heng Feng (kaihengfeng) wrote :

Uhh sans the "less", thanks.

Lucas Zanella (lucaszanella) wrote :

Thank you for your answer. I'm desperated. I just installed debian therefore I'm not going to able to do it right now, but I have output from the last time I was using Ubuntu.

I tried nvme_core.default_ps_max_latency_us=5500 and it didn't work. Then I've put it to 0, which didn't work too. Well, with 0 it didn't generate errors while using, but while trying to update my machine, which always happens too, so I don't know anymore. I remember seeing ATSP Disabled at the output, but the error always happens when I try to update my software...

Shouldn't this bug be already fixed? Or not in my kernel? I could pay to get to the bottom of this, because I need my computer so much right now and this bug is happening every day and I can't continue my work!

The last kernel I had on ubuntu was 4.13.0-26-generic, now I'm on debian and I have 4.9.0-4.

sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S33UNX0J324060 SAMSUNG MZVLW512HMJP-00000 1 25,30 GB / 512,11 GB 512 B + 0 B CXY7501Q

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S33UNX0J324060
mn : SAMSUNG MZVLW512HMJP-00000
fr : CXY7501Q
rab : 2
ieee : 002538
cmic : 0
mdts : 0
cntlid : 2
ver : 10200
rtd3r : 186a0
rtd3e : 4c4b40
oaes : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 341
cctemp : 344
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 512110190592
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Download full text (8.7 KiB)

Kai-Heng

> On 31 Jan 2018, at 1:38 PM, Lucas Zanella <email address hidden> wrote:
>
> Thank you for your answer. I'm desperated. I just installed debian
> therefore I'm not going to able to do it right now, but I have output
> from the last time I was using Ubuntu.
>
> I tried nvme_core.default_ps_max_latency_us=5500 and it didn't work.
> Then I've put it to 0, which didn't work too. Well, with 0 it didn't
> generate errors while using, but while trying to update my machine,
> which always happens too, so I don't know anymore. I remember seeing
> ATSP Disabled at the output, but the error always happens when I try to
> update my software…

I’d like to see the output of `sudo nvme get-feature -f 0x0c -H /dev/nvme0` when you use nvme_core.default_ps_max_latency_us=0.

>
> Shouldn't this bug be already fixed? Or not in my kernel? I could pay to
> get to the bottom of this, because I need my computer so much right now
> and this bug is happening every day and I can't continue my work!

This is more likely to a low level NVMe/PCIe issue. If possible, please try to upgrade the firmware for the NVMe.

>
> The last kernel I had on ubuntu was 4.13.0-26-generic, now I'm on debian
> and I have 4.9.0-4.

You’ll get hit by this issue (again) once next Debian release uses newer kernel.

>
> sudo nvme list
> Node SN Model Namespace Usage Format FW Rev
> ---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
> /dev/nvme0n1 S33UNX0J324060 SAMSUNG MZVLW512HMJP-00000 1 25,30 GB / 512,11 GB 512 B + 0 B CXY7501Q
>
> NVME Identify Controller:
> vid : 0x144d
> ssvid : 0x144d
> sn : S33UNX0J324060
> mn : SAMSUNG MZVLW512HMJP-00000
> fr : CXY7501Q
> rab : 2
> ieee : 002538
> cmic : 0
> mdts : 0
> cntlid : 2
> ver : 10200
> rtd3r : 186a0
> rtd3e : 4c4b40
> oaes : 0
> oacs : 0x17
> acl : 7
> aerl : 3
> frmw : 0x16
> lpa : 0x3
> elpe : 63
> npss : 4
> avscc : 0x1
> apsta : 0x1
> wctemp : 341
> cctemp : 344
> mtfa : 0
> hmpre : 0
> hmmin : 0
> tnvmcap : 512110190592
> unvmcap : 0
> rpmbs : 0
> sqes : 0x66
> cqes : 0x44
> nn : 1
> oncs : 0x1f
> fuses : 0
> fna : 0
> vwc : 0x1
> awun : 255
> awupf : 0
> nvscc : 1
> acwu : 0
> sgls : 0
> subnqn :
> ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
> rwt:0 rwl:0 idle_power:- active_power:-
> ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
> rwt:1 rwl:1 idle_power:- active_power:-
> ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
> rwt:2 rwl:2 idle_power:- active_power:-
> ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
> rwt:3 rwl:3 idle_power:- active_power:-
> ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
> rwt:4 rwl:4 idle_power:- active_power:-
>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1746340
>
> Title:
> Samsung SSD corruption (fsck needed)
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Ubuntu 4.13.0-21.24-generic 4.13.13
>
>
> I have a Razer Blade Stealth 2016. The first Ubuntu I installed w...

Read more...

Lucas Zanella (lucaszanella) wrote :

Hi. I've been trying to install Windows 10 in order to try to update my SSD firmware, but I'm getting an error:

https://imgur.com/a/BM0gG

could it be that my SSD has a real hardware problem? I tried many different pen drives, in different USB ports, but I always get the same error.

I'm trying to install Ubuntu to get the output of nvme_core.default_ps_max_latency_us=0 but the installation always fails

Lucas Zanella (lucaszanella) wrote :
Download full text (5.8 KiB)

Hi! I managed to install ubuntu again, these are the outputs you asked for the ms tie of 0 milliseconds:

NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S33UNX0J324060
mn : SAMSUNG MZVLW512HMJP-00000
fr : CXY7501Q
rab : 2
ieee : 002538
cmic : 0
mdts : 0
cntlid : 2
ver : 10200
rtd3r : 186a0
rtd3e : 4c4b40
oaes : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 341
cctemp : 344
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 512110190592
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

get-feature:0xc (Autonomous Power State Transition), Current value:00000000
 Autonomous Power State Transition Enable (APSTE): Disabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 ...

Read more...

Lucas Zanella (lucaszanella) wrote :

I just installed 4.15.0-041500-generic

Lucas Zanella (lucaszanella) wrote :

Problem persists with 4.15.0-041500-generic, just happened

Kai-Heng Feng (kaihengfeng) wrote :

So you have the issue on Linux v4.15 with nvme_core.default_ps_max_latency_us=0, but not on v4.9?

APST doesn't get enabled on both of them.

Lucas Zanella (lucaszanella) wrote :

On debian (4.9) I didn't notice the issue but I didn't use much. HOWEVER, when I do apt-get upgrade on debian I do get the issue. It just updated the kernel file, didn't run the new kernel (a boot would have to happen).

On v4.15 I didn't change the nvme_core.default_ps_max_latency_us=0, I guess. I did before upgrading to v4.15, I guess. But I can try again.

This is all very strange

Lucas Zanella (lucaszanella) wrote :

I forgot to mention that I reinstalled windows and everything is fine. Even did a benchmark test on the SSD and I'm downloading lots of files to test

Kai-Heng Feng (kaihengfeng) wrote :

I am not familiar with Windows, is there anyway to check its APST table? I'd like to see if deepest power state is enabled or not.

Lucas Zanella (lucaszanella) wrote :

I searched and found nothing.

So, even with APST disabled my ssd will fail on linux. What should I do?
Does it work normally for other people when they disable it?

Lucas Zanella (lucaszanella) wrote :

I found a guy with same problem as mine and had a Razer Blade Stealth, but he didn't post anything more after that. And he was in a thread with you. I also found some people with this same problem on the same SSD. Together with the fact that I had no problem on windows (ore than 24hrs of usage by now) I think it can be fixed in the kernel.

I had no luck updating my SSD's firmware as it's OEM and Samsung's updater won't work for it. Do you have any idea? I don't have money to buy a new SSD, and I really need to work. I'd be so grateful if you could help with a solution.

Kai-Heng Feng (kaihengfeng) wrote :

Does the issue happen after system suspend?

Lucas Zanella (lucaszanella) wrote :

Initially I noted that it'd happen after opening the lid of the notebook, so yes. But now after I install Ubuntu it immediately starts looking for software updates and that's when the problem happens for the first time, when I haven't even had time to close the notebook to suspend it.

Kai-Heng Feng (kaihengfeng) wrote :

Please try [1]. It will do a PCI reset for NVMe device after resume.

people.canonical.com/~khfeng/lp1746340/

Lucas Zanella (lucaszanella) wrote :

Thanks. What's a 'PCI reset for NVMe device after resume'?

Here's the output of running sudo dpkg -i *.deb on the 4 files:

Selecting previously unselected package linux-headers-4.15.0+.
(Reading database ... 137951 files and directories currently installed.)
Preparing to unpack linux-headers-4.15.0+_4.15.0+-2_amd64.deb ...
Unpacking linux-headers-4.15.0+ (4.15.0+-2) ...
Selecting previously unselected package linux-image-4.15.0+.
Preparing to unpack linux-image-4.15.0+_4.15.0+-2_amd64.deb ...
Unpacking linux-image-4.15.0+ (4.15.0+-2) ...
Selecting previously unselected package linux-image-4.15.0+-dbg.
Preparing to unpack linux-image-4.15.0+-dbg_4.15.0+-2_amd64.deb ...
Unpacking linux-image-4.15.0+-dbg (4.15.0+-2) ...
dpkg-deb (subprocess): decompressing archive member: lzma error: compressed data is corrupt
dpkg-deb: error: subprocess <decompress> returned error exit status 2
dpkg: error processing archive linux-image-4.15.0+-dbg_4.15.0+-2_amd64.deb (--install):
 cannot copy extracted data for './usr/lib/debug/lib/modules/4.15.0+/kernel/drivers/iio/pressure/zpa2326.ko' to '/usr/lib/debug/lib/modules/4.15.0+/kernel/drivers/iio/pressure/zpa2326.ko.dpkg-new': unexpected end of file or stream
Selecting previously unselected package linux-libc-dev.
Preparing to unpack linux-libc-dev_4.15.0+-2_amd64.deb ...
Unpacking linux-libc-dev (4.15.0+-2) ...
Setting up linux-headers-4.15.0+ (4.15.0+-2) ...
Setting up linux-image-4.15.0+ (4.15.0+-2) ...
update-initramfs: Generating /boot/initrd.img-4.15.0+
W: Possible missing firmware /lib/firmware/i915/skl_dmc_ver1_27.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_dmc_ver1_04.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_guc_ver9_39.bin for module i915
W: Possible missing firmware /lib/firmware/i915/bxt_guc_ver9_29.bin for module i915
W: Possible missing firmware /lib/firmware/i915/skl_guc_ver9_33.bin for module i915
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-4.15.0+
Found initrd image: /boot/initrd.img-4.15.0+
Found linux image: /boot/vmlinuz-4.13.0-21-generic
Found initrd image: /boot/initrd.img-4.13.0-21-generic
Adding boot menu entry for EFI firmware configuration
done
Setting up linux-libc-dev (4.15.0+-2) ...
Errors were encountered while processing:
 linux-image-4.15.0+-dbg_4.15.0+-2_amd64.deb

Lucas Zanella (lucaszanella) wrote :

I downloaded again and it seems that this time it wasn't corrupted.

Output:

Preparing to unpack linux-headers-4.15.0+_4.15.0+-2_amd64.deb ...
Unpacking linux-headers-4.15.0+ (4.15.0+-2) over (4.15.0+-2) ...
Preparing to unpack linux-image-4.15.0+_4.15.0+-2_amd64(1).deb ...
Unpacking linux-image-4.15.0+ (4.15.0+-2) over (4.15.0+-2) ...
Preparing to unpack linux-image-4.15.0+-dbg_4.15.0+-2_amd64(1).deb ...
Unpacking linux-image-4.15.0+-dbg (4.15.0+-2) ...
Preparing to unpack linux-libc-dev_4.15.0+-2_amd64.deb ...
Unpacking linux-libc-dev (4.15.0+-2) over (4.15.0+-2) ...
Setting up linux-headers-4.15.0+ (4.15.0+-2) ...
Setting up linux-image-4.15.0+ (4.15.0+-2) ...
update-initramfs: Generating /boot/initrd.img-4.15.0+
W: Possible missing firmware /lib/firmware/i915/skl_dmc_ver1_27.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_dmc_ver1_04.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_guc_ver9_39.bin for module i915
W: Possible missing firmware /lib/firmware/i915/bxt_guc_ver9_29.bin for module i915
W: Possible missing firmware /lib/firmware/i915/skl_guc_ver9_33.bin for module i915
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported.
Found linux image: /boot/vmlinuz-4.15.0+
Found initrd image: /boot/initrd.img-4.15.0+
Found linux image: /boot/vmlinuz-4.13.0-21-generic
Found initrd image: /boot/initrd.img-4.13.0-21-generic
Adding boot menu entry for EFI firmware configuration
done
Setting up linux-image-4.15.0+-dbg (4.15.0+-2) ...
Setting up linux-libc-dev (4.15.0+-2) ...

Lucas Zanella (lucaszanella) wrote :

After installing everything, I rebooted to use the new kernel. I then installed updates on the machine to see if the problem would happen (easier way to make it happen is on the moment I try to update). After the update, wireless stopped working. Restarted many times and still not working.

Could it be that the update triggered the error and the so called pcie reset of this kernel made the wireless go wrong?

I'm gonna still use this kernel to see if the read only filesystem happens though

Lucas Zanella (lucaszanella) wrote :

I added an USB wireless receiver to use internet to download things so I can see if something happens. I installed more system updates through the ubuntu software updates. Is this ok? The kernel will still be yours, rigtht?

Changed in linux (Ubuntu):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
tags: added: patch
95 comments hidden view all 175 comments
Sam (samr28) wrote :

I also have the same issue on a Razer Blade 2017 - 7500U model. My system has the exact same drive in it. I have just installed the 4.18.0-3 kernel linked above and will post here if I run into the issue again.

hariprasad (hariprasad) wrote :

Hello, my Ubuntu Linux was 8-9 months out of order v. 17.10, and later 18.04. I did many installations and tests, changed SSD M.2 four times (recognized reclamation), changed the whole NUC7i7BNH (recognized reclamation). Log Issue on Intel. Finally I installed Fedora 28. and NVME M.2 SSD is in good condition and work properly. I used default LVM partition format. Ubuntu during installation on VLN filed immediatelly during installation, when updates were applied. The problem is bounded specially with Ubuntu. I doesn't check, which driver use Fedora and Ubuntu, Fedora kernel is '4.17.12-200.fc28.x86_64 #1 x86_64 GNU/Linux', so I cannot distill if it should be in driver or is system settings. But finally, i can say, that my reclamations were unauthorized. Sadly, in that case, the easiest workaround for me is to use different Linux.

Lucas Zanella (lucaszanella) wrote :

Hi hariprasad. If possible, you could try our patched kernel which disables ASPM: https://people.canonical.com/~khfeng/pm961-disable-aspm/. It worked for me but only when I added the nvme-core.default_ps_max_latency_us=1500 kernel parameter. Maybe you can try some day. We're still investigating the issue. When Kai Heng send me the patch I can try it and see what changes.

hariprasad (hariprasad) wrote :

Hello Lucas, thank you for response. Yes, badly setted kernel parameters can cause very serious problems. Additionally, I can comfirm, that problem is not bounded only on Samsung NVME SSD, but It occurs on Intel SSD-6 series as well. It looks like, that problem is in ASPM SSD Driver - kernel parameters. There are a few errors, which came in one time. The easiest way, how to simulate initframfs error during startup/restart is to install Thunderbird and download thousands emails from cloud e.q. google mail to generate traffic on SSD. Than install and startup Firefox, add plugins for video (Player) and stertup video. Firefox for Linux (last version was something about 57-61) is unstable on Linux (generally, not only Ubuntu), than Firefox begin crash, and issues - "Would you like to restart and recover Firefox?". It streses the SSD and after a few restores (about 10) probably begin crash Thunderbird. It is the time for restart system. Probably - there will be issue, that it is not possible to start Ubuntu and initframfs error occured.

Lucas Zanella (lucaszanella) wrote :

It's worth saying that the ASPM patch + 1500 kernel parameter worked for me for over a month without giving me one single error. After update to 18.04 now I see the error every 2 or 3 days. Actually, in the middle of the update process to 18.04 it gave the error right on the initramfs update, which is where it always gives the error. This is sad, it was working perfectly except inside the VMs but it was very stable :(

hariprasad (hariprasad) wrote :

Hello Lucas, thank you for response. Yes, badly setted kernel parameters can cause very serious problems. Additionally, I can comfirm, that problem is not bounded only on Samsung NVME SSD, but It occurs on Intel SSD-6 series as well. It looks like, that problem is in ASPM SSD Driver - kernel parameters. There are a few errors, which came in one time. The easiest way, how to simulate initframfs error during startup/restart is to install Thunderbird and download thousands emails from cloud e.q. google mail to generate traffic on SSD. Than install and startup Firefox, add plugins for video (Player) and stertup video. Firefox for Linux (last version was something about 57-61) is unstable on Linux (generally, not only Ubuntu), than Firefox begin crash, and issues - "Would you like to restart and recover Firefox?". It streses the SSD and after a few restores (about 10) probably begin crash Thunderbird. It is the time for restart system. Probably - there will be issue, that it is not possible to start Ubuntu and initframfs error occured.

Sam (samr28) wrote :

I'm also running the ASPM patch and haven't had problems for the last month or so. Any idea when this will get merged?

Stumbled onto this bug from somewhere else, and noticed that it seems I have the same samsung SSD drive SM961/PM961 (Same output on lspci --vvnn regarding the NVMe as Lucas posted,). However, for me it has worked without any problems on stock ubuntu 18.04 / mint / kubuntu installations. Perhaps it depends on the system configuration as well instead of just the SSD? Not sure this information helps but though to post it anyway.

Fabian (fabiangieseke) wrote :

I have had the same issues with Ubuntu 18.04 and a Samsung MZ-V7E1T0 1000GB M.2 PCI Express 3.0 and the default installation (ext4): Plenty of errors, especially when upgrading/installing packages via apt.

I have reinstalled the whole system. Instead of the standard journaling file system (ext4), I have btrfs for the root mount point (/). System works perfectly now, no errors for a couple of days with plenty of software being installed.

Not sure, might be a ext4/kernel bug (?).

To add to my previous comment, I've been running ext4 all the time.

Lucas Zanella (lucaszanella) wrote :

Fabian, did you have any problems installing ubuntu? Mine would give disk errors about 7/10 times I tried to install. I had to try many times until no error appeared.

I'd like to try btrfs but I don't have the time to do it right now. I also had problems with apt, but when upgrading the system. It'd always give the error in the initramfs update, or something like that.

I'll try to install a fresh ubuntu 18.04 soon too, as Janne suggested.

Fabian (fabiangieseke) wrote :

I have tried two things:

(1) Fresh install, Ubuntu 18.04 (about ten days ago), ext4. No errors during the installation. However, when installing stuff via apt afterwards (or upgrading), I got many errors along the lines described above (e.g., "compressed data is corrupt... unexpected end of file or stream"). This happened for, I guess, arbitrary packages. No errors for initramfs update for me ...

(2) Fresh install, Ubuntu 18.04 (about four days ago), btrfs for /. No errors at all.

I have this bug (MSI laptop, Ubuntu Studio 18.04) and it's getting quite annoying to be honest. If there's anything I can do to help remedy the situation within reasonable time (I'm about to reinstall) then let me know.

Lucas Zanella (lucaszanella) wrote :

Hi Ole Christian. First, did you have any problems in the ubuntu installation? In mine I had to try to install several times until it installed without any disk errors.

Also, you can try this kernel https://people.canonical.com/~khfeng/pm961-disable-aspm/ with this kernel parameter "nvme-core.default_ps_max_latency_us=1500". This is what worked for me, but it's not a definitive solution, I still get the error in some situations (much more rare than before though). You can read our discussion to understand it better.

I guess someone is working on this bug for a definitive solution...

Hi Lucas! Thanks for the reply. No, I had no problems during installation. The computer just shuts down at random intervals to a black screen with all kinds of EXT4-fs errors and reports that the file system is read only. Often the disk isn't even recognized at reboot, so I have to boot into a live environment and use Gparted to fix it from there.

I do music production professionally, so if I can't get it fixed relatively easily and permanently then I'll have to look elsewhere unfortunately.

Thanks though. :)

Lucas Zanella (lucaszanella) wrote :

Ok Christian, thanks for the info. You can try the kernel for now, and I also read that using ubuntu with brtfs system instead of ext4 also solves the problem, you could try

I may try that. Are we sure it's a kernel bug though? I can't remember having this problem when I used Solus OS for a while. But I may not have used for long enough since I discovered it didn't support Jack2 and was pretty much unusable to me.

I can confirm that bug with two different NVMe drivers - Samsung EVO970 and WD Black in 4.18.0-10-generic and in 4.15.0-20-generic kernels. H270 Intel chipset on the motherboard

I have the WD Black 256 Gb drive.

Attachment contains error from dmesg output. For me reproduction steps are: write large (>10G) amount of data to NVMe ssd.

Kai-Heng Feng (kaihengfeng) wrote :

What's the PCI ID for EVO 970 and WD Black?

Kai-Heng Feng (kaihengfeng) wrote :

If you use Samsung (144d:a804) or Sk Hynix (1c5c:1285), please try kernel in [1].

[1] https://people.canonical.com/~khfeng/lp1785715/

My Samsung is indeed [144d:a808]. I'll check WD later on - it's not connected at this time.
I was not able to reproduce this bug using Clear Linux current kernel (4.18.16-645).

My Samsung is indeed [144d:a808]. I'll check WD later on - it's not connected at this time.
I was not able to reproduce this bug using Clear Linux current kernel (4.18.16-645).
I checked kernel https://people.canonical.com/~khfeng/lp1785715/ with no nvme-core.default_ps_max_latency_us= settings and I was not able to reproduce the issue with my "copy lots of data" scenario that triggered the bug every time yesterday.
So it looks like success! I'll keep using that kernel for now and report if any problems arise.

Kai-Heng Feng (kaihengfeng) wrote :

The kernel doesn't do anything special for 144d:a808, it's for 144d:a804.

Then I'm puzzled. I'll retest later with WD.

Here is the output of lspci -vvnn on my computer. It's from the 256GB version of the samsung NVMe.
On my systems I've never had any corruption problems, even moving large (60GB+) VM files and installing OS on ext4 multiple times. Currently running stock LM 19. Hope this helps.

lshw output:

*-storage
                description: Non-Volatile memory controller
                product: Sandisk Corp
                vendor: Sandisk Corp
                physical id: 0
                bus info: pci@0000:04:00.0
                version: 00
                width: 64 bits
                clock: 33MHz
                capabilities: storage pm pciexpress msix nvm_express bus_master cap_list
                configuration: driver=nvme latency=0
                resources: irq:16 memory:df100000-df103fff

lspci output:

04:00.0 Non-Volatile memory controller: Sandisk Corp WD Black NVMe SSD

lspci -vvnn:

04:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black NVMe SSD [15b7:5001] (prog-if 02 [NVM Express])
        Subsystem: Marvell Technology Group Ltd. WD Black NVMe SSD [1b4b:1093]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 16
        NUMA node: 0
        Region 0: Memory at df100000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: nvme
        Kernel modules: nvme

I should probably mention that while it is installed in my laptop it is not currently being used as I had to revert to using an ordinary HDD.

Lucas Zanella (lucaszanella) wrote :

Any news on this problem? Im still having it

I too am having an SSD corruption issue with Ubuntu 18.04, same exact symptoms. I have a Kingston 480gb SSD, not nvme, connected over SATA. My PC is a desktop, I have attached the output of lspci -vvnn. I have to do manual fsck every 1.5 weeks or so. When I am using my PC, it will freeze up occasionally for about 15 seconds with very high SSD I/O usage - I have attached an iotop log which recorded a freeze at around 18:03:22 (the log records every 1 second, and you will see there is a gap between a recording at 18:03:22 and 18:03:35 which indicates the freeze, followed by 90%+ io. I have included my SSD smart info as well as my current lsblk output below:

=== START OF INFORMATION SECTION ===
Device Model: KINGSTON SA400S37480G
Serial Number: 50026B76825B4FA0
LU WWN Device Id: 5 0026b7 6825b4fa0
Firmware Version: SBFKB1C2
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Feb 25 17:57:34 2019 SAST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

lsblk::

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 3.7M 1 loop /snap/gnome-system-monitor/57
loop1 7:1 0 13M 1 loop /snap/gnome-characters/103
loop2 7:2 0 91M 1 loop /snap/core/6350
loop3 7:3 0 3.7M 1 loop /snap/gnome-system-monitor/51
loop4 7:4 0 2.3M 1 loop /snap/gnome-calculator/180
loop5 7:5 0 140.7M 1 loop /snap/gnome-3-26-1604/78
loop6 7:6 0 270.5M 1 loop /snap/pycharm-community/112
loop7 7:7 0 86.9M 1 loop /snap/core/4917
loop8 7:8 0 91M 1 loop /snap/core/6405
loop9 7:9 0 14.5M 1 loop /snap/gnome-logs/45
loop10 7:10 0 140.7M 1 loop /snap/gnome-3-26-1604/74
loop11 7:11 0 13M 1 loop /snap/gnome-characters/139
loop12 7:12 0 14.5M 1 loop /snap/gnome-logs/37
loop13 7:13 0 2.3M 1 loop /snap/gnome-calculator/260
loop14 7:14 0 34.7M 1 loop /snap/gtk-common-themes/319
loop15 7:15 0 34.6M 1 loop /snap/gtk-common-themes/818
loop16 7:16 0 140.9M 1 loop /snap/gnome-3-26-1604/70
loop17 7:17 0 34.8M 1 loop /snap/gtk-common-themes/1122
sda 8:0 0 447.1G 0 disk
└─sda1 8:1 0 447.1G 0 part /

Here is the iotop log I mentioned above (attached)

My issue has been resolved by upgrading the firmware of my SSD from SBFKB1C2 to SBFKB1C3.

https://askubuntu.com/questions/1107053/ubutnu-18-04-ssd-sometimes-freeze-for-seconds

Lucas Zanella (lucaszanella) wrote :

Just tried Ubuntu 19 today and the problem persists (can't even install ubuntu because it gives io error)

Lucas Zanella (lucaszanella) wrote :

Hi Kai-Heng Feng, do you have any news on this problem? It'd be great to know.

Than you so much!

Fabian (fabiangieseke) wrote :

Hi,

a little update from my side: It seems that faulty memory was the reason for the data corruptions in my case. I have replaced the memory module and everything seems to work fine now. I was quite surprised though that the memory was defective since I did test it carefully for many hours with memtest (20+ passes without any errors). The errors only occured when running Ubuntu ...

The memory was the only thing I have changed, so I am very sure that this was the cause ...

Kai-Heng Feng (kaihengfeng) wrote :

Lucas,
Do you still have this issue on mainline kernel?

Lucas Zanella (lucaszanella) wrote :

I tried the Ubuntu 19.04 installer and I couldn't even install it because of IO errors. Does the installer of Ubuntu 19.04 uses the new kernel?

Lucas Zanella (lucaszanella) wrote :

Hi Kai-Heng Feng, I just installed kernel 5.1.1 and the error still happens

Displaying first 40 and last 40 comments. View all 175 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers