BUG: Bad rss-counter state mm:000000002ddfedce idx:2 val:-1

Bug #1765838 reported by Malachi de AElfweald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned

Bug Description

Booted. Started firefox. A couple seconds later it was back on the lock screen. Logged in again and it hung.

This is a fresh "Erase disk and reinstall" of minimal from yesterdays bionic iso

Ubuntu 4.15.0-15.16-generic 4.15.15

ProblemType: KernelOops
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-15-generic 4.15.0-15.16
ProcVersionSignature: Ubuntu 4.15.0-15.16-generic 4.15.15
Uname: Linux 4.15.0-15-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
Annotation: Your system might become unstable now and might need to be restarted.
ApportVersion: 2.20.9-0ubuntu5
Architecture: amd64
Date: Fri Apr 20 12:13:12 2018
Failure: oops
InstallationDate: Installed on 2018-04-19 (0 days ago)
InstallationMedia:

MachineType: System manufacturer System Product Name
OopsText:
 BUG: Bad rss-counter state mm:000000002ddfedce idx:2 val:-1
 TaskSchedulerFo[35882]: segfault at 5cf3c85816d8 ip 0000557fa5ed3ed0 sp 00007ff500037420 error 4 in chrome[557fa4a32000+5cd4000]
 traps: wget[35886] general protection ip:7fbe54e7d2ff sp:7ffc7aa235a0 error:0 in ld-2.27.so[7fbe54e70000+27000]
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-15-generic.efi.signed root=UUID=d9b05c55-71bb-4f4f-bdfd-43dd79de4c1d ro reboot=pci pcie_aspm=off
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions: kerneloops-daemon N/A
SourcePackage: linux
Title: BUG: Bad rss-counter state mm:000000002ddfedce idx:2 val:-1
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 12/21/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0902
dmi.board.asset.tag: Default string
dmi.board.name: ROG ZENITH EXTREME
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0902:bd12/21/2017:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnROGZENITHEXTREME:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Malachi de AElfweald (malachid) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Malachi de AElfweald (malachid) wrote :

Unfortunately, the system is unusable this morning. Still trying to recover it. May have to flatline it again.

It seems I have gotten myself stuck in a loop:
1. try to reboot and that causes kernel panic
2. after that happens a few times, the NVME needs fsck'd because of corrupt group descriptors
3. `fsck -CVvfy` the drive (twice for the ext partition and once for the EFI)
4. after doing 1-3 a few times, packages and symlinks start getting broken. I try to manually repair them until eventually I can't get into the system anymore.

I tried to run memtest. If it is set to 1 cpu at a time, it goes without error until it eventually hangs on a random (inconsistent) test. If I run with all cpus, it shows tons of errors pretty quickly. Always on the same bit of every bank (ie: 80808080 -> 8080A080) and always off by two. But again, it doesn't do that unless multiple cpus are running at the same time. I thought it could be the other security features (interleaving, memory encryption, etc) that the BIOS has set to auto.

Launching the live usb and just sitting at a terminal with `journalctl --follow`, the last thing that happens before it hangs is usually cleaning temp files; but I haven't run that enough to know if it is a pattern.

From the BIOS, I can set it to auto overclock or manual -- there is no option to disable overclocking; so I cleared the CMOS and tried again immediately after that without any change.

I have attempted 44 bionic installs this month. 4 of those went through to completion. Two normal and two minimal. The rest failed during ubiquity.

grub-install almost always succeeds when acpi=off and almost always hangs when it isn't.

I also have to have pcie_aspm=off or the system is spammed with errors and crashes quickly. Others have reported the same thing for threadripper.

I have tried with and without livepatch enabled.

The system is stable when mining or gaming, and seems unstable when underutilized -- so I tried disabling the C-states in the BIOS. I have tried disabling every form of power management I could find in the OS and in the BIOS. I am sure I have missed quite a few.

I have tried manually updating the kernel (per your requests) as well as using ukuu. Since it is my primary machine, I tend to have things installed that have to then be uninstalled for that to work well (like nvidia drivers, virtualbox, etc).

I am seeing a ton of segfaults, even from the live usb. It more often happens when the machine is sitting idle for a few minutes (which is what had me thinking about power management). I thought it could be the memory, but since they don't fail memtest (if I run then 1 cpu at a time)....

I know that "Erase disk and reinstall" will not solve the problem. It would be nice to figure out how to solve the problem before I do that again.

So... I'm not sure how I can try a new kernel for you. If there is some way for me to update a live usb with an alternate kernel from a live usb; that might work since I see errors on the daily bionic iso as well.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.