Comment 4 for bug 1765838

Revision history for this message
Malachi de AElfweald (malachid) wrote :

Unfortunately, the system is unusable this morning. Still trying to recover it. May have to flatline it again.

It seems I have gotten myself stuck in a loop:
1. try to reboot and that causes kernel panic
2. after that happens a few times, the NVME needs fsck'd because of corrupt group descriptors
3. `fsck -CVvfy` the drive (twice for the ext partition and once for the EFI)
4. after doing 1-3 a few times, packages and symlinks start getting broken. I try to manually repair them until eventually I can't get into the system anymore.

I tried to run memtest. If it is set to 1 cpu at a time, it goes without error until it eventually hangs on a random (inconsistent) test. If I run with all cpus, it shows tons of errors pretty quickly. Always on the same bit of every bank (ie: 80808080 -> 8080A080) and always off by two. But again, it doesn't do that unless multiple cpus are running at the same time. I thought it could be the other security features (interleaving, memory encryption, etc) that the BIOS has set to auto.

Launching the live usb and just sitting at a terminal with `journalctl --follow`, the last thing that happens before it hangs is usually cleaning temp files; but I haven't run that enough to know if it is a pattern.

From the BIOS, I can set it to auto overclock or manual -- there is no option to disable overclocking; so I cleared the CMOS and tried again immediately after that without any change.

I have attempted 44 bionic installs this month. 4 of those went through to completion. Two normal and two minimal. The rest failed during ubiquity.

grub-install almost always succeeds when acpi=off and almost always hangs when it isn't.

I also have to have pcie_aspm=off or the system is spammed with errors and crashes quickly. Others have reported the same thing for threadripper.

I have tried with and without livepatch enabled.

The system is stable when mining or gaming, and seems unstable when underutilized -- so I tried disabling the C-states in the BIOS. I have tried disabling every form of power management I could find in the OS and in the BIOS. I am sure I have missed quite a few.

I have tried manually updating the kernel (per your requests) as well as using ukuu. Since it is my primary machine, I tend to have things installed that have to then be uninstalled for that to work well (like nvidia drivers, virtualbox, etc).

I am seeing a ton of segfaults, even from the live usb. It more often happens when the machine is sitting idle for a few minutes (which is what had me thinking about power management). I thought it could be the memory, but since they don't fail memtest (if I run then 1 cpu at a time)....

I know that "Erase disk and reinstall" will not solve the problem. It would be nice to figure out how to solve the problem before I do that again.

So... I'm not sure how I can try a new kernel for you. If there is some way for me to update a live usb with an alternate kernel from a live usb; that might work since I see errors on the daily bionic iso as well.