Root filesystem becomes readonly frequenty freezing system

Bug #1927866 reported by anoopjohn
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I have a brand new Lenovo Ideapad Flex 5.14 which came with Windows 10 and I have installed Ubuntu 21.04 with dual boot.

I am running into this problem where the root filesystem becomes readonly and the system freezes with a screen full of errors shown about unable to write to filesystem.

I booted into a live boot usb and did fsck and e2fsck but did not see any errors there. I read online that this could be because of io errors but smartctl and fsck does not show any errors

So I was wondering if this could be a bug vs a hardward issue. Please let me know if you need any further information from the system.

=============================================================
=============================================================

fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 476.94 GiB, 512110190592 bytes, 1000215216 sectors
Disk model: WDC PC SN530 SDBPMPZ-512G-1101
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: E8803E28-C4F7-4FA5-9B6D-0F3CAB7027A5

Device Start End Sectors Size Type
/dev/nvme0n1p1 2048 534527 532480 260M EFI System
/dev/nvme0n1p2 534528 567295 32768 16M Microsoft reserved
/dev/nvme0n1p3 567296 199446527 198879232 94.8G Microsoft basic data
/dev/nvme0n1p4 998166528 1000214527 2048000 1000M Windows recovery environmen
/dev/nvme0n1p5 199446528 203352063 3905536 1.9G Linux filesystem
/dev/nvme0n1p6 203352064 219353087 16001024 7.6G Linux swap
/dev/nvme0n1p7 219353088 414664703 195311616 93.1G Linux filesystem
/dev/nvme0n1p8 414664704 998166527 583501824 278.2G Microsoft basic data

=============================================================
=============================================================

smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.11.0-16-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: WDC PC SN530 SDBPMPZ-512G-1101
Serial Number: 205135806243
Firmware Version: 21160001
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 8b484daff7
Local Time is: Sun May 9 12:45:05 2021 EDT
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 3.50W 2.10W - 0 0 0 0 0 0
 1 + 2.40W 1.60W - 0 0 0 0 0 0
 2 + 1.90W 1.50W - 0 0 0 0 0 0
 3 - 0.0250W - - 3 3 3 3 3900 11000
 4 - 0.0050W - - 4 4 4 4 5000 39000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
 0 + 512 0 2
 1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 624,535 [319 GB]
Data Units Written: 526,042 [269 GB]
Host Read Commands: 4,462,163
Host Write Commands: 4,686,788
Controller Busy Time: 25
Power Cycles: 20
Power On Hours: 76
Unsafe Shutdowns: 9
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

=============================================================
=============================================================

cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/nvme0n1p7 during installation
UUID=3167581c-60b0-4db8-972c-a8dfafc882c5 / ext4 errors=remount-ro 0 1
# /boot was on /dev/nvme0n1p5 during installation
UUID=9c06f7c8-ad3b-4b90-b32c-4d4099b4548d /boot ext4 defaults 0 2
# /boot/efi was on /dev/nvme0n1p1 during installation
UUID=6C28-8155 /boot/efi vfat umask=0077 0 1
# swap was on /dev/nvme0n1p6 during installation
UUID=38034881-4f37-4d16-8cc5-e343f61b4ef7 none swap sw 0 0
/dev/disk/by-uuid/0F7EA5A832F72C99 /mnt/0F7EA5A832F72C99 auto nosuid,nodev,nofail,x-gvfs-show 0 0

=============================================================
=============================================================

nvme error-log shows the same error message as below, all 64 of them are identical. Pasting one item here.

nvme error-log /dev/nvme0
Error Log Entries for device:nvme0 entries:64
.................
 Entry[ 0]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu65
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: anoopjohn 1695 F.... pulseaudio
 /dev/snd/pcmC1D0p: anoopjohn 1695 F...m pulseaudio
 /dev/snd/controlC0: anoopjohn 1695 F.... pulseaudio
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 21.04
InstallationDate: Installed on 2021-04-30 (11 days ago)
InstallationMedia: Ubuntu 21.04 "Hirsute Hippo" - Release amd64 (20210420)
MachineType: LENOVO 82HU
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.11.0-16-generic root=UUID=3167581c-60b0-4db8-972c-a8dfafc882c5 ro quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 5.11.0-16.17-generic 5.11.12
RelatedPackageVersions:
 linux-restricted-modules-5.11.0-16-generic N/A
 linux-backports-modules-5.11.0-16-generic N/A
 linux-firmware 1.197
Tags: hirsute wayland-session
Uname: Linux 5.11.0-16-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin lxd plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 02/23/2021
dmi.bios.release: 1.19
dmi.bios.vendor: LENOVO
dmi.bios.version: GJCN19WW
dmi.board.asset.tag: No Asset Tag
dmi.board.name: LNVNB161216
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40709 WIN
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 31
dmi.chassis.vendor: LENOVO
dmi.chassis.version: IdeaPad Flex 5 14ALC05
dmi.ec.firmware.release: 1.12
dmi.modalias: dmi:bvnLENOVO:bvrGJCN19WW:bd02/23/2021:br1.19:efr1.12:svnLENOVO:pn82HU:pvrIdeaPadFlex514ALC05:rvnLENOVO:rnLNVNB161216:rvrSDK0J40709WIN:cvnLENOVO:ct31:cvrIdeaPadFlex514ALC05:
dmi.product.family: IdeaPad Flex 5 14ALC05
dmi.product.name: 82HU
dmi.product.sku: LENOVO_MT_82HU_BU_idea_FM_IdeaPad Flex 5 14ALC05
dmi.product.version: IdeaPad Flex 5 14ALC05
dmi.sys.vendor: LENOVO

Revision history for this message
anoopjohn (anoop.john) wrote :
affects: launchpad → ubuntu
anoopjohn (anoop.john)
description: updated
Revision history for this message
anoopjohn (anoop.john) wrote :

Did nvme tests short and long. Results below

nvme device-self-test /dev/nvme0 -n 1 -s 1
nvme device-self-test /dev/nvme0 -n 1 -s 2

nvme self-test-log /dev/nvme0
Device Self Test Log for NVME device:nvme0
Current operation : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result : 0
  Self Test Code : 2
  Valid Diagnostic Information : 0
  Power on hours (POH) : 0x4c
  Vendor Specific : 0 0
Self Test Result[1]:
  Operation Result : 0
  Self Test Code : 1
  Valid Diagnostic Information : 0
  Power on hours (POH) : 0x4c
  Vendor Specific : 0 0
Self Test Result[2]:
  Operation Result : 0xf
Self Test Result[3]:
  Operation Result : 0xf
Self Test Result[4]:
  Operation Result : 0xf
Self Test Result[5]:
  Operation Result : 0xf
Self Test Result[6]:
  Operation Result : 0xf
Self Test Result[7]:
  Operation Result : 0xf
Self Test Result[8]:
  Operation Result : 0xf
Self Test Result[9]:
  Operation Result : 0xf
Self Test Result[10]:
  Operation Result : 0xf
Self Test Result[11]:
  Operation Result : 0xf
Self Test Result[12]:
  Operation Result : 0xf
Self Test Result[13]:
  Operation Result : 0xf
Self Test Result[14]:
  Operation Result : 0xf
Self Test Result[15]:
  Operation Result : 0xf
Self Test Result[16]:
  Operation Result : 0xf
Self Test Result[17]:
  Operation Result : 0xf
Self Test Result[18]:
  Operation Result : 0xf
Self Test Result[19]:
  Operation Result : 0xf

Revision history for this message
anoopjohn (anoop.john) wrote :

I have more updates.

I booted into windows and did a Lenovo update and it pulled down a BIOS update. That seems to have changed the frequency of the issue. Earlier the issue was happening every 30 minutes or so but now it has reduced to only once in the last 24 hours and that happened when the system was left unattended overnight. I was wondering if this had something to do with power management or anything like that?

I came across this issue related to a Samsung SSD drive - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184. Was wondering if I should try that. But now my repeatability of the freeze is tougher. So not sure.

I had also run into another issue after I ran the BIOS update. My Wifi adapter disappeared in Ubuntu. I booted into windows and disabled APM on wifi adapater and did a cold restart into Ubuntu and the adapter came back.

The bios version is now GJCN19WW build date 2/23/2021. Phoenix Technologies Ltd.

Revision history for this message
Chris Guiver (guiverc) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better.

Bug reporting is about finding & fixing problems thus preventing future users from hitting the same bug.

I suspect a Support site would be more appropriate, eg. https://answers.launchpad.net/ubuntu. You can also find help with your problem in the support forum of your local Ubuntu community http://loco.ubuntu.com/ or asking at https://askubuntu.com or https://ubuntuforums.org, or for more support options please look at https://discourse.ubuntu.com/t/community-support/709

affects: ubuntu → linux (Ubuntu)
Revision history for this message
Chris Guiver (guiverc) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Please execute the following command only once, as it will automatically gather debugging information, in a terminal:

apport-collect 1927866

When reporting bugs in the future please use apport by using 'ubuntu-bug' and the name of the package affected. You can learn more about this functionality at https://wiki.ubuntu.com/ReportingBugs.

Revision history for this message
anoopjohn (anoop.john) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected hirsute wayland-session
description: updated
Revision history for this message
anoopjohn (anoop.john) wrote : CRDA.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : CurrentDmesg.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : IwConfig.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : Lspci.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : Lspci-vt.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : Lsusb.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : Lsusb-t.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : Lsusb-v.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : PaInfo.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : ProcInterrupts.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : ProcModules.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : PulseList.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : RfKill.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : UdevDb.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : WifiSyslog.txt

apport information

Revision history for this message
anoopjohn (anoop.john) wrote : acpidump.txt

apport information

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
anoopjohn (anoop.john) wrote :

Thanks for looking into this issue Chris.

Thanks for pointing about ubuntu-bug and apport and the bug reporting guidelines. Will do so in the future.

Revision history for this message
anoopjohn (anoop.john) wrote :

I just had another freeze but unlike in the past I was still in the desktop and did not go into the terminal view. I was able to switch to the terminal but I was not able to run any commands like 'cat /var/log/dmesg' or 'dmesg'. I have seen this in the past where all the icons including the icons in the notification area and favorites become blank icons but I have not been able to do anything. I tried to run ubuntu-bug from the terminal but it didn't start and showed io error.

Is there any command I should have run when I run into this scenario?

I rebooted into a live usb and ran fsck and badblock. No errors.

sudo badblocks -nsv /dev/nvme0n1p7
Checking for bad blocks in non-destructive read-write mode
From block 0 to 97655807
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: done
Pass completed, 0 bad blocks found. (0/0/0 errors)

sudo e2fsck -fcck /dev/nvme0n1p7
e2fsck 1.45.7 (28-Jan-2021)
Checking for bad blocks (non-destructive read-write test)
Testing with random pattern: done
/dev/nvme0n1p7: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/nvme0n1p7: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p7: 248682/6111232 files (1.4% non-contiguous), 4843265/24413952 blocks

Revision history for this message
anoopjohn (anoop.john) wrote :
Download full text (6.8 KiB)

I went ahead and applied the nvme_core.default_ps_max_latency_us=0 kernel parameter to grub default yesterday (Based on this issue reported https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184)

However, I just ran into the freeze again. System was unattended for a while and I see the login screen with all the icons blanked out.

sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x15b7
ssvid : 0x15b7
sn : 205135806243
mn : WDC PC SN530 SDBPMPZ-512G-1101
fr : 21160001
rab : 4
ieee : 001b44
cmic : 0
mdts : 7
cntlid : 0x1
ver : 0x10400
rtd3r : 0x7a120
rtd3e : 0xf4240
oaes : 0x200
ctratt : 0x2
rrls : 0
cntrltype : 1
fguid :
crdt1 : 0
crdt2 : 0
crdt3 : 0
oacs : 0x17
acl : 4
aerl : 7
frmw : 0x14
lpa : 0x1e
elpe : 255
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 353
cctemp : 358
mtfa : 50
hmpre : 51200
hmmin : 823
tnvmcap : 512110190592
unvmcap : 0
rpmbs : 0
edstt : 57
dsto : 1
fwug : 1
kas : 0
hctma : 0x1
mntmt : 273
mxtmt : 358
sanicap : 0x60000002
hmminds : 0
hmmaxd : 8
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
pels : 1
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x5f
fuses : 0
fna : 0
vwc : 0x7
awun : 0
awupf : 0
nvscc : 1
nwpc : 0
acwu : 0
sgls : 0
mnan : 0
subnqn : nqn.2018-01.com.wdc:guid:E8238FA6BF53-0001-001B448B484DAFF7
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
msdbd : 0
ps 0 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:2.10W
ps 1 : mp:2.40W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:1.60W
ps 2 : mp:1.90W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:1.50W
ps 3 : mp:0.0250W non-operational enlat:3900 exlat:11000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:0.0250W active_power:-
ps 4 : mp:0.0050W non-operational enlat:5000 exlat:39000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:0.0050W active_power:-

===========================================
===========================================

sudo nvme get-feature -f 0x0c -H /dev/nvme0n1
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
 Autonomous Power State Transition Enable (APSTE): Disabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 4]
 .................
 Idle Time P...

Read more...

Revision history for this message
anoopjohn (anoop.john) wrote :

The system was crashing every few minutes today. Took a backup of the relevant data in the system and changed fstab mount option for root partition to

UUID=3167581c-60b0-4db8-972c-a8dfafc882c5 / ext4 defaults

So took out the remount,ro option.

I am hoping to see what really triggers the error in dmesg even if that might corrupt some data somewhere in the process.

The only challenge now is that I may not see the i/o error that freezes the system as root will not become read only.

What should I specifically look for in dmesg?

If I move /var/log into a separate partition on the same disk and keep root as remount,ro on errors and mount the /var/log partition without the ro remount option would I be able to see the error that leads to the remounting read only?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please try kernel parameter "nvme-core.default_ps_max_latency_us=11000"?

Revision history for this message
anoopjohn (anoop.john) wrote :

Thanks for looking into this @kaihengfeng. I will try that and post results.

I have been running the system for a few days without running into the error. The usual symptom I see when I run into the issue is that all the icons on the desktop becomes a blank icon / broken icon. So all the icons in the favorites bar and the notification area becomes same icon.

In addition to the above change with the filessytem remount rw, I had also turned off fast boot in Windows. I hoped that would stop windows from setting unnecessary hardware settings when it shuts down.

However I ran into the same blank icon state today. The system was in standby mode. I had closed the laptop (not shutdown) yesterday and opened it today. I worked on the system for a few minutes and then this crash happened. I tried to cat /var/syslog but that did not work. What commands should I try next time I run into this state?

I have uploaded /var/log/syslog here.

I am going to also make the change suggested by kaihengfeng and will keep the thread posted.

Revision history for this message
anoopjohn (anoop.john) wrote :

@kaihengfeng - I have set the parameter.

cat /etc/default/grub

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=11000"
GRUB_CMDLINE_LINUX=""

cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
11000

Revision history for this message
anoopjohn (anoop.john) wrote :

I just got the crash again. System had locked and I was in the lockscreen when this happened. The icons on the login screen turned blank and I was not able to login.

Revision history for this message
anoopjohn (anoop.john) wrote :

It has crashed multiple times after that today. I also saw the filestystem read only error as well where the whole screen goes into the terminal view and the list of io errors like I had reported originally.

What else need to be looked at?

If it crashes too frequently I will probably change the default_ps_max_latency_us back to 0

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please try kernel parameter "pcie_aspm=force pcie_aspm.policy=performance"?

Revision history for this message
anoopjohn (anoop.john) wrote :

Thanks @kaihengfeng. I have added this.

cat /sys/module/pcie_aspm/parameters/policy
default [performance] powersave powersupersave

This is the current line in /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=11000 pcie_aspm=force pcie_aspm.policy=performance"

Have not changed anything else.

Revision history for this message
anoopjohn (anoop.john) wrote :

The system just crashed and the filesystem became readonly and I saw the console screen with the io errors. BTW, I don't have errors=mount-ro in the fstab

I have attached syslog here.

Anything else I have to do specifically when the system crashes next time?

Revision history for this message
anoopjohn (anoop.john) wrote :

Crashed again. I have trimmed the syslog to a shorter version to include just the last couple of hours of logs.

Revision history for this message
anoopjohn (anoop.john) wrote :

I changed to pcie_aspm=force

Rebooted and logged in and the system crashed again.

Revision history for this message
anoopjohn (anoop.john) wrote :

Crashed multiple times. That shows that the symptom is happening frequently. Now trying to use the system for a while in Windows to see if it crashes at all. Have not had a crash on windows but I have not used in windows much. Will try to use in windows for a while to see if any crash happens in windows at all.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please attach the out put of `sudo nvme get-feature -f 0x0c -H /dev/nvme0n1` when there's no kernel parameter?

Revision history for this message
anoopjohn (anoop.john) wrote :

I finally gave up on the laptop. I am going to return this and get another model. The crashes were making it unusable for my work. I tried. I wiped clean the system. However I still have the device. I booted from live usb and got the details. Please find below the details.

Thanks for everybody who tried to help on this thread. I am sorry I had to give up as it was making it difficult for me to work with this. I hope the thread will help somebody else to move forward the research on this bug forward.

If there is something else that need to be pulled up before I return this, please let me know and I will be happy to do that.

sudo nvme get-feature -f 0x0c -H /dev/nvme0n1
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
 Autonomous Power State Transition Enable (APSTE): Enabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 745 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 745 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 745 ms
 Idle Transition Power State (ITPS): 3
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 2200 ms
 Idle Transition Power State (ITPS): 4
 .................

All subsequent entries all the way to 31 are 0ms and 0s.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Thanks. Please give this kernel a try:
https://people.canonical.com/~khfeng/lp1927866/

Revision history for this message
anoopjohn (anoop.john) wrote :

Thanks for the updated kernel @kaihengfeng

I couldn't resist the urge to reinstall Ubuntu again. I had already wiped the system clean for return.

I reinstalled ubuntu. The last time I had installed the system I had not checked the secure boot option so that would not have installed any proprietary firmware. This time I checked that and installed proprietary firmware.

I am waiting for the crash to happen before I try the new kernel.

In the meanwhile, I took a look at default_ps_max_latency_us. It looks like the default value is actually 100000 with a default install without any changes made.

cat /sys/module/nvme_core/parameters/default_ps_max_latency_us
100000

Revision history for this message
anoopjohn (anoop.john) wrote :

The system ran for a whole day without crashing and it had gone into screen lock multiple times but did not crash. Yesterday night I disconnected power and closed laptop. Today when I opened, the lock screen came up, I entered password and it crashed at that point.

The screen scrolled too much and I couldn't get a screenshot. However the first line was

error while async write back metadata

I also recollect the third line which said

failed to set APST feature

I think I saw -19 after that, not sure though.

I ran this

sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x15b7
ssvid : 0x15b7
sn : 205135806243
mn : WDC PC SN530 SDBPMPZ-512G-1101
fr : 21160001
rab : 4
ieee : 001b44
cmic : 0
mdts : 7
cntlid : 0x1
ver : 0x10400
rtd3r : 0x7a120
rtd3e : 0xf4240
oaes : 0x200
ctratt : 0x2
rrls : 0
cntrltype : 1
fguid :
crdt1 : 0
crdt2 : 0
crdt3 : 0
oacs : 0x17
acl : 4
aerl : 7
frmw : 0x14
lpa : 0x1e
elpe : 255
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 353
cctemp : 358
mtfa : 50
hmpre : 51200
hmmin : 823
tnvmcap : 512110190592
unvmcap : 0
rpmbs : 0
edstt : 57
dsto : 1
fwug : 1
kas : 0
hctma : 0x1
mntmt : 273
mxtmt : 358
sanicap : 0x60000002
hmminds : 0
hmmaxd : 8
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
pels : 1
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x5f
fuses : 0
fna : 0
vwc : 0x7
awun : 0
awupf : 0
nvscc : 1
nwpc : 0
acwu : 0
sgls : 0
mnan : 0
subnqn : nqn.2018-01.com.wdc:guid:E8238FA6BF53-0001-001B448B484DAFF7
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
msdbd : 0
ps 0 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:2.10W
ps 1 : mp:2.40W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:1.60W
ps 2 : mp:1.90W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:1.50W
ps 3 : mp:0.0250W non-operational enlat:3900 exlat:11000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:0.0250W active_power:-
ps 4 : mp:0.0050W non-operational enlat:5000 exlat:39000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:0.0050W active_power:-

I am going to try to replicate the error by running through the same scenario. Allowing the laptop to suspend while power disconnected and then try to login after that.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So the original issue is solved, the issue now only happens after sleep?

Revision history for this message
anoopjohn (anoop.john) wrote :

I tried to re-create the scenario that causes it to crash but have been unsuccessful so far. It crashed a few minutes back. Even now I am not too sure what exactly causes it to crash. So far what I have seen is that (i.e. after the re-install) the crashes have happened after I resume the system from sleep / suspend and login and start doing something. Still not been able to replicate this exactly though. I wish there was some way we could capture more logging. Keep a USB drive mounted and use that for /var/log?

I am going to go ahead and install the updated kernel now.

Revision history for this message
anoopjohn (anoop.john) wrote :

I am running with the custom kernel but I ran into the error now. As earlier, the first line was

Error while async write back metadata

Couldn't take a photo of the screen with the remaining lines. The specific scenario was that I was working with the system, thought I should check something in Windows, rebooted, picked windows, started booting, came to the windows repair screen (I had wiped and restored windows earlier because I was getting the system ready to be returned), clicked on restart instead of repair as I didn't want to set up windows user etc, hot restarted back into ubuntu and the system crashed on the login screen when I entered password.

I will try to replicate this error now.

Revision history for this message
anoopjohn (anoop.john) wrote :

I ran into this error again. The system did not go into the console mode with full errors but it was in the login screen with all images replaced by blank icons and I couldn't do anything.

There was no immediate reboot before this.

I wish we could have a mechanism where the specific scenario that triggers this error could be logged.

Revision history for this message
anoopjohn (anoop.john) wrote :

Ran into this again. Was connected to power, closed lid, disconnected from power, kept for a while. Opened, logged in, opened browser, started working on an email and boom. Crashed.

Isn't there anything we can do to get more logging on this? Mount an external usb drive as /var/log?

Revision history for this message
anoopjohn (anoop.john) wrote :

I am going to give one last shot. I am going to try Ubuntu 20.04.2.0

Revision history for this message
anoopjohn (anoop.john) wrote :

Was able to replicate the crash in Ubuntu 20.04.2.0 as well.
Used for a while, closed, opened, logged in, open browser, click new tab. Crash.
I think I am going to return this laptop now.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does this issue happen _before_ suspend?

Revision history for this message
anoopjohn (anoop.john) wrote :

Thanks for your reply @kainhengfeng. The issue has happened before suspend as well. I have seen it more often after I close the laptop i.e. ever since I started consciously trying to do that to recreate the error. What is interesting is that even that is not fully predictable. It has not happened every time I suspend and restart.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

If you haven't returned the laptop, please give this kernel a try:
https://people.canonical.com/~khfeng/lp1927866/

Revision history for this message
anoopjohn (anoop.john) wrote :

I have 20.04 running on the system now. Install this on that?

Revision history for this message
anoopjohn (anoop.john) wrote :
Download full text (3.2 KiB)

I went ahead and tried to install on 20.04. Looks like it ran into issues during install. Anything else to do to get this to install on 20.04?

sudo dpkg -i *.deb
Selecting previously unselected package linux-buildinfo-5.13.0-8-generic.
(Reading database ... 183555 files and directories currently installed.)
Preparing to unpack linux-buildinfo-5.13.0-8-generic_5.13.0-8.8_amd64.deb ...
Unpacking linux-buildinfo-5.13.0-8-generic (5.13.0-8.8) ...
Selecting previously unselected package linux-headers-5.13.0-8-generic.
Preparing to unpack linux-headers-5.13.0-8-generic_5.13.0-8.8_amd64.deb ...
Unpacking linux-headers-5.13.0-8-generic (5.13.0-8.8) ...
Selecting previously unselected package linux-image-unsigned-5.13.0-8-generic.
Preparing to unpack linux-image-unsigned-5.13.0-8-generic_5.13.0-8.8_amd64.deb ...
Unpacking linux-image-unsigned-5.13.0-8-generic (5.13.0-8.8) ...
Selecting previously unselected package linux-modules-5.13.0-8-generic.
Preparing to unpack linux-modules-5.13.0-8-generic_5.13.0-8.8_amd64.deb ...
Unpacking linux-modules-5.13.0-8-generic (5.13.0-8.8) ...
Selecting previously unselected package linux-modules-extra-5.13.0-8-generic.
Preparing to unpack linux-modules-extra-5.13.0-8-generic_5.13.0-8.8_amd64.deb ...
Unpacking linux-modules-extra-5.13.0-8-generic (5.13.0-8.8) ...
Selecting previously unselected package linux-unstable-headers-5.13.0-8.
Preparing to unpack linux-unstable-headers-5.13.0-8_5.13.0-8.8_all.deb ...
Unpacking linux-unstable-headers-5.13.0-8 (5.13.0-8.8) ...
Setting up linux-buildinfo-5.13.0-8-generic (5.13.0-8.8) ...
dpkg: dependency problems prevent configuration of linux-headers-5.13.0-8-generic:
 linux-headers-5.13.0-8-generic depends on libc6 (>= 2.33); however:
  Version of libc6:amd64 on system is 2.31-0ubuntu9.2.

dpkg: error processing package linux-headers-5.13.0-8-generic (--install):
 dependency problems - leaving unconfigured
Setting up linux-unstable-headers-5.13.0-8 (5.13.0-8.8) ...
Setting up linux-image-unsigned-5.13.0-8-generic (5.13.0-8.8) ...
I: /boot/vmlinuz.old is now a symlink to vmlinuz-5.8.0-55-generic
I: /boot/initrd.img.old is now a symlink to initrd.img-5.8.0-55-generic
I: /boot/vmlinuz is now a symlink to vmlinuz-5.13.0-8-generic
I: /boot/initrd.img is now a symlink to initrd.img-5.13.0-8-generic
Setting up linux-modules-5.13.0-8-generic (5.13.0-8.8) ...
Setting up linux-modules-extra-5.13.0-8-generic (5.13.0-8.8) ...
Processing triggers for linux-image-unsigned-5.13.0-8-generic (5.13.0-8.8) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.13.0-8-generic
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.13.0-8-generic
Found initrd image: /boot/initrd.img-5.13.0-8-generic
Found linux image: /boot/vmlinuz-5.8.0-55-generic
Found initrd image: /boot/initrd.img-5.8.0-55-generic
Found linux image: /boot/vmlinuz-5.8.0-43-generic
Found initrd image: /boot/initrd.img-5.8.0-43-generic
Found Windows Boot Manager on /dev/nvme0n1p1@/EFI/Microsoft/Boot/bootmgfw.efi
Adding boot menu entry for UEFI Firmware...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The headers are not needed. Please just boot the newly installed kernel.

Revision history for this message
anoopjohn (anoop.john) wrote :

Ok. Booted with the new kernel

uname -a
Linux aj-2 5.13.0-8-generic #8 SMP Wed Jun 16 16:38:14 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

Will take it through the different scenarios and see if it crashes.

Revision history for this message
anoopjohn (anoop.john) wrote :

Have been using the new kernel on Ubuntu 20.04 since 17th. Have tested the following scenarios

On power
On battery
Closed Lid
Suspend from Shutdown menu
Cold reboot

Have not seen the crash since then. Anything else to do? Install 21.04 and the custom kernel and try again?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
anoopjohn (anoop.john) wrote :

Thanks for the update @kaihengfeng

I think I spoke too soon. The system just crashed again :(

There was a set of system updates yesterday from Ubuntu for 20.04 that I ran. Not sure if that contributed.

Can we add some debug capability to capture a little more info on what is happening?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

It's only possible if rootfs is on another storage.

Revision history for this message
anoopjohn (anoop.john) wrote :

Will this work if we boot and run from USB drive?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Yes it will work.

Revision history for this message
anoopjohn (anoop.john) wrote :

Great. If you can please direct me on what to configure to capture any debugging info and what do with ubuntu live cd boot I can do that.

BTW, I have been using the system since 25th and it has not crashed since then. It has been just that one crash on 25th since 17th.

I will give one last shot to debug this further.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I think it's better if the Ubuntu is installed on the USB. Live USB has it's limitation.

Personally I would suggest replace it with another laptop if it's your daily driver.

Revision history for this message
anoopjohn (anoop.john) wrote :

For anybody coming across this thread. I was not able to get this to a closure. I returned the laptop :(

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.