[Asus TUF Gaming FA506IV] Random system stops

Bug #1939966 reported by Kuroš Taheri-Golværzi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-hwe-5.11 (Ubuntu)
New
Undecided
Unassigned
nvidia-graphics-drivers-470 (Ubuntu)
New
Undecided
Unassigned

Bug Description

Every once in a while (it's been three times this week), the entire system will just stop. There's still electricity running. The keyboard is lit up, and the monitor is black (electricity-running black, and not shutdown-black). There's no picture and no sound. Just immediately before it does, I know it's about to happen because whatever audio is playing at the time will begin repeating, and then the computer goes down. I don't actually know if it's an Xorg problem. It just sounded like the closest possibility, based on the way the computer actually behaves.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: xorg 1:7.7+19ubuntu14
ProcVersionSignature: Ubuntu 5.11.0-25.27~20.04.1-generic 5.11.22
Uname: Linux 5.11.0-25-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
.proc.driver.nvidia.capabilities.gpu0: Error: path was not a regular file.
.proc.driver.nvidia.capabilities.mig: Error: path was not a regular file.
.proc.driver.nvidia.gpus.0000.01.00.0: Error: path was not a regular file.
.proc.driver.nvidia.registry: Binary: ""
.proc.driver.nvidia.suspend: suspend hibernate resume
.proc.driver.nvidia.suspend_depth: default modeset uvm
.proc.driver.nvidia.version:
 NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021
 GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
ApportVersion: 2.20.11-0ubuntu27.18
Architecture: amd64
BootLog: Error: [Errno 13] Permission denied: '/var/log/boot.log'
CasperMD5CheckResult: skip
CompizPlugins: No value set for `/apps/compiz-1/general/screen0/options/active_plugins'
CompositorRunning: None
CurrentDesktop: XFCE
Date: Sat Aug 14 19:34:11 2021
DistUpgraded: Fresh install
DistroCodename: focal
DistroVariant: ubuntu
DkmsStatus:
 nvidia, 470.57.02, 5.11.0-25-generic, x86_64: installed
 nvidia, 470.57.02, 5.8.0-63-generic, x86_64: installed
 virtualbox, 6.1.22, 5.11.0-25-generic, x86_64: installed
 virtualbox, 6.1.22, 5.8.0-63-generic, x86_64: installed
ExtraDebuggingInterest: Yes
GraphicsCard:
 NVIDIA Corporation TU106 [GeForce RTX 2060] [10de:1f15] (rev a1) (prog-if 00 [VGA controller])
   Subsystem: ASUSTeK Computer Inc. Device [1043:1e21]
 Advanced Micro Devices, Inc. [AMD/ATI] Renoir [1002:1636] (rev f0) (prog-if 00 [VGA controller])
   Subsystem: ASUSTeK Computer Inc. Renoir [1043:1e21]
InstallationDate: Installed on 2021-03-20 (147 days ago)
InstallationMedia: Xubuntu 20.04.2 LTS "Focal Fossa" - Release amd64 (20210204)
MachineType: ASUSTeK COMPUTER INC. TUF Gaming FA506IV_TUF506IV
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.11.0-25-generic root=UUID=0fc8340c-98de-4434-9806-ec613e241e8c ro quiet splash vt.handoff=7
SourcePackage: xorg
Symptom: display
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/02/2020
dmi.bios.release: 5.16
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: FA506IV.309
dmi.board.asset.tag: ATN12345678901234567
dmi.board.name: FA506IV
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: 1.0
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: ASUSTeK COMPUTER INC.
dmi.chassis.version: 1.0
dmi.ec.firmware.release: 3.9
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrFA506IV.309:bd07/02/2020:br5.16:efr3.9:svnASUSTeKCOMPUTERINC.:pnTUFGamingFA506IV_TUF506IV:pvr1.0:rvnASUSTeKCOMPUTERINC.:rnFA506IV:rvr1.0:cvnASUSTeKCOMPUTERINC.:ct10:cvr1.0:
dmi.product.family: TUF Gaming FA506IV
dmi.product.name: TUF Gaming FA506IV_TUF506IV
dmi.product.version: 1.0
dmi.sys.vendor: ASUSTeK COMPUTER INC.
version.compiz: compiz N/A
version.libdrm2: libdrm2 2.4.102-1ubuntu1~20.04.1
version.libgl1-mesa-dri: libgl1-mesa-dri 20.2.6-0ubuntu0.20.04.1
version.libgl1-mesa-glx: libgl1-mesa-glx N/A
version.nvidia-graphics-drivers: nvidia-graphics-drivers-* N/A
version.xserver-xorg-core: xserver-xorg-core 2:1.20.9-2ubuntu1.2~20.04.2
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev N/A
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:19.1.0-1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20200226-1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.16-1

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :
affects: ubuntu → xorg (Ubuntu)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Thanks for the bug report. Next time the problem happens, please:

1. Wait 10 seconds.

2. Reboot.

3. Run:

   journalctl -b-1 > prevboot.txt

4. Attach the resulting text file here.

5. Also check for crashes using these instructions: https://wiki.ubuntu.com/Bugs/Responses#Missing_a_crash_report_or_having_a_.crash_attachment

affects: xorg (Ubuntu) → ubuntu
Changed in ubuntu:
status: New → Incomplete
Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Okay, for sure. I'll do that.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Done and done. Please let me know if there's anything else I can do to help.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Thanks. I can't see any relevant problems in that log. Please try following all of those steps again, and importantly not until the "stop" happens. Please also remember step 5.

P.S. Is there any reason you might think this is a different issue to bug 1906449?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

I did. This past time that my computer stopped experiencing activity, I actually waited a few minutes before I finally actually turned it off just to make sure to give it enough time. Is there anything else I can do?

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

1. Please check for crashes using these instructions: https://wiki.ubuntu.com/Bugs/Responses#Missing_a_crash_report_or_having_a_.crash_attachment

2. Is there any reason you might think this is a different issue to bug 1906449?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

In /var/crash/, the most recent file in there is from two days ago, and My computer actually just did it again. I looked inside of /var/lib/whoopsie/whoopsie-id, and there's just what appears to be a hexidecimal string, and not actually a link of any kind. And, in Issue #1906449, it's simply a frozen screen. In my case, the screen actually goes black (electricity-running black, and not powered-down black), and there's no more sound. I don't know what to do.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

The hexadecimal string is what we need, please :)

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Ah! Okay, here you go:

33b53024b6560a3f0dadd0a3a5d1ad8822eb6f710ba414e81e95f3d16e6364a25093a5802b05a6d529c5e1af30a072e3c4eae81025a134ac4ea39a6b8be4dc79

Revision history for this message
Daniel van Vugt (vanvugt) wrote :
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

I recorded a video (using my phone) of what the computer looks like once the crash has happened:
https://www.youtube.com/watch?v=36kg68mfrds

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

It's gotten to the point where this happens at least once a day. I'm lost for answers.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

We should get some hints from the system log usually... Each time the crash happens please:

1. Wait 10 seconds.

2. Reboot.

3. Run:

   journalctl -b-1 > prevboot.txt

4. Attach the resulting text file here.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

I've tried that. As you've noted, that doesn't show any crashes. Is there anything that would prevent the system from writing a crash log?

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Sometimes the log will be incomplete which is why we need you to repeat those steps, please.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

It wasn't quite the same thing that happened, but in any case, I just got a +wall message saying there was a "soft lockup" (I have no idea what that means), so I figured I'd wait a couple minutes and then reboot the system, and then do the journalctl thing. So, here's the record of the previous boot.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I agree that doesn't sound like quite the same issue but it might be the same issue...

Sep 15 16:42:04 HajnaliCsillag kernel: watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0]

affects: linux (Ubuntu) → linux-hwe-5.11 (Ubuntu)
tags: added: amdgpu nvidia
Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

It did the blackscreen crash thing again, so I ran journalctl. Hopefully, this one has some more answers.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Comment #19 seems to have a bunch of:

  kernel: rtw_8822ce 0000:03:00.0: pci bus timeout, check dma status

which can also be seen in the log of comment #17. And that message relates to:

  Realtek Semiconductor Co., Ltd. RTL8822CE 802.11ac PCIe Wireless Network Adapter

It's unclear to me whether that card is causing the problems or just a victim of some deeper bus/motherboard issue.

Changed in linux-hwe-5.11 (Ubuntu):
status: Incomplete → New
Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Is there anything I can do to test my hardware to check?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Also, I don't know if this is relevant, but I have two SSD drives. My main one (the one that keeps crashing and going blackscreen) is Xubuntu, and my secondary one is Kubuntu which, incidentally, works just fine. In fact, I actually do all my gaming on my Kubuntu installation (I've logged over 400 hours playing "The Outer Worlds") and it's never given me any issue. Meanwhile, my Xubuntu installation goes blackscreen almost every day.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Just crashed again. Maybe this report has more information.

summary: - Random system stops
+ [Asus TUF Gaming FA506IV] Random system stops
Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Is there anything I can do to narrow down the problem?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

From yesterday

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

From yesterday

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

And, from today

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Okay, not exactly the same thing. It didn't do the blackscreen crash. What happened this time was that it stopped accepting input from the keyboard. I have no idea if it's related, but it might be, so I'm sending this one anyway. Maybe it might have some helpful answers.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

And another blackscreen from just now. Do these logs say anything useful? It happens at the absolute minimum, at least once a day. These constant crashes are getting seriously infuriating. It doesn't happen on Kubuntu. Just Xubuntu. Is there anything I can do to help you guys diagnose the problem?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Just did it again. This time, it didn't do a blackscreen outright. Rather, the audio just suddenly started insta-repeating, and the system clenched up and froze, and became completely unresponsive. Is there anything I can do to help figure out what the problem is? Any sort of tests I can run? Having my computer Epstein itself multiple times a day is getting seriously infuriating.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Comment #30 is a crash in the 'nvidia' kernel module. To work around that I can only recommend uninstalling the 470 driver and installing the 460 or 465 driver instead.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

I wonder if it was the crash Nvidia mentioned in their latest release:

https://www.nvidia.com/en-us/drivers/results/180475/

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

I tried purging `nvidia-*` and installed `nvidia-driver-460` (since when I kept trying to install -465, it kept installing -470. So, I'll find how that goes.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Also, here's a report from yesterday's third crash. Hopefully, there's some useful information in it.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

And, I'm honestly hesitant to suspect Nvidia, because I also use Nvidia 470 on my Kubuntu installation. I very clearly remember installing 470 because I use Kubuntu for gaming, and I had to update my Nvidia drivers for a couple of games to work properly. I'll find out how it goes, though.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Comment #34 appears to be a *different* kernel bug again:

  watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [apt-esm-hook:2930]

although what came before that is suspicious too:

  kernel: BUG: kernel NULL pointer dereference, address: 000000000000006c
  kernel: #PF: supervisor write access in kernel mode
  kernel: #PF: error_code(0x0002) - not-present page

So I would recommend trying:

 * Uninstalling VirtualBox and its kernel drivers to see if the crashes stop happening.
 * Testing the RAM overnight with https://www.memtest.org/ or https://www.memtest86.com/

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Just happened again with Nvidia-460. I'll try uninstalling VirtualBox and see what happens.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

So, I'd uninstalled VirtualBox and did a `sudo apt autoremove` and installed Nvidia 460, and it still blackscreened. I'll try installing Nvidia 450 and see what happens.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Nope. Deleting VirtualBox and its drivers and downgrading to Nvidia 450 didn't work either. This is happening multiple times a day now, and it never, ever happens with Kubuntu; only Xubuntu. I'm at a loss for ideas.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Second one today. Hopefully this has more information.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

And third one of the day from last night. Hopefully, this has some useful information. I'm seriously *this* close to just reinstalling. I can't work this way. Kubuntu never does this. I use my Kubuntu for gaming, and I'll play "The Outer Worlds" (a game that's very system-heavy) for like 10 - 16 hours at a time with no issue whatsoever on Kubuntu. Meanwhile, Xubuntu blackscreens an absolute minimum of once a day, usually around three times to five times a day. This is getting seriously infuriating.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's first crash report.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Second blackscreen crash today. Hopefully, this has some useful information in it.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

So, out of pure frustration, I wondered if maybe it was the particular SSD slot, so I decided to simply switch my SSDs across the slots on my motherboard (because the Kubuntu installation never crashed). Suddenly, now, that one was crashing, and the Xubuntu installation worked fine (for a while). so I thought it was the SSD slot. Then, the "good" slot started crashing randomly, too.

I started looking around for my specific computer, and it turns out that the "soft lockup" is from a misconfiguration in the grub setup. I thought to myself, "I haven't touched my grub. The only time I ever did anything was the `sudo apt dist-upgrade` about three weeks ago." And then it hit me: the crashes started about a day or two after I did a dist-upgrade. So I started looking around, and I found this:

https://github.com/jfinancial/linux/blob/main/AMD_Linux_Build.md

and subsequently this:

https://askubuntu.com/questions/1234299/amd-ryzen-5-3600-ubuntu-20-04-problems/1241636#1241636

so I went into /etc/default/grub and changed the line to:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"

and did a `sudo update-grub`. So far, my computer has an uptime of around 13 hours, so this is looking promising. Also, I am never, ever, ever updating my kernel ever again, ever. Every time I do, something absolutely vitally, crucially important breaks. This is especially true about Arch and all Arch-based distros (which I'll never use again).

I'll try leaving my computer turned on for a week and see if changing that grub line helps. Here's to hoping.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Alrighty, so, after having adjusted the grub configuration, this time around, it lasted about 16½ hours before coming to a freeze. Plus, actually, this time, it didn't outright blackscreen. What happened was: it froze and the audio began looping. I let the audio loop for a while in the hopes that the computer would have enough time to write the errors to the log. Hopefully, this has some answers.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Okay, so, after having changed the GRUB configuration, this time around, it lasted 1 day, 4 hours, and about 45 minutes before finally blackscreening. Also, during that time, at around 1 day and a little bit over 2 hours in, there was a +wall message across all my terminals saying there was a "soft lockup". Maybe this log has some useful info.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Blackscreened again. Hopefully this log has something useful.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Tried using:

nvme_core.default_ps_max_latency_us=5500

as inspired by this try:

https://forums.linuxmint.com/viewtopic.php?t=307471

to create the line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=assign-busses apicmaintimer idle=poll reboot=cold,hard nvme_core.default_ps_max_latency_us=5500"

in /etc/default/grub

This time, it lasted 11 hours before blackscreening. This time, I'll try setting it to:

nvme_core.default_ps_max_latency_us=0

and see what happens. Hopefully, this log has something useful in it.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Lasted 16 hours before blackscreening (a new record). I'm all out of ideas.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Results of doing a memtest.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's crash.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Ever since I turned off the ability to manage CPU cores, these "soft lockups have been happening a lot less often. Now, it's only once or twice a week, instead of 3 - 5 times a day. Today's crash report.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's report.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's also report.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's first report. Is there anything useful in these logs?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's 2nd crash report. Is there anything useful in these logs?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's 3rd crash report. Is there anything useful in these logs?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Report from last night. Is there anything useful in these recent ones?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Today's first crash, which happened WHILE I was taking the Linux Foundation Certified SysAdmin (LFCS) online exam. Is there anything helpful in these error report logs? Anything? Anything at all? Am I wasting my time?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

From a few days ago

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Also from a few days ago

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

From a couple days ago

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

Also from a couple days ago

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

From yesterday

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

From today.

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

journalctl -p emerg

yields a "soft lockup" on all CPU cores, and then a "hard lockup" on almost all CPU cores until there's only one still left alive, which I'm guessing is when it crashes (i.e, that one last-remaining core locks up). Is there any way I can test this?

Revision history for this message
Kuroš Taheri-Golværzi (ktaherig) wrote :

I've been looking for possible fixes regarding the terms "NMI watchdog", "soft lockup", and "hard lockup", and nothing has worked. Does anybody reading this have any ideas?

Revision history for this message
Popa Adrian Marius (mapopa) wrote :

seems to be a nouveau issue , please install nvidia official drivers from ubuntu repos and blacklist nouveau driver at boot

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.