Repeatable hang within 5 minutes using stress-ng + sleep + usb mouse

Bug #1862281 reported by Luke Barone-Adesi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
xorg-server (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

If I run a thermal transition test script (30 seconds stress-ng, 30 seconds sleep, in a loop) and move a local USB mouse, Kubuntu reliably crashes, usually in the first couple of runs and almost 100% of the time by run 6.

This appears to be hardware-linked, but not due to a specific piece of bad hardware: I have swapped literally every piece of hardware in the system.

It shows up (while running the script at the end):
- On both an MSI B450 Gaming plus max and MSI MPG X570 Gaming plus mainboard.
- On both an AMD Ryzen 5 3600 and 3600X CPU.
- With one or two sticks of RAM. I've tested both sticks individually, in more than one mainboard slot.
- Regardless of whether the mainboard is in/attached to a case.
- Regardless of whether there is an m.2 SSD installed or I'm running off a live Kubuntu 19.10 USB stick with no hard disk attached.
- Regardless of which of two mice I use (an old Logitech one, or a GTX 133 Gaming mouse).
- Regardless of whether I'm using a Corsair VS650 or Corsair AX850 PSU.
- Regardless of whether I'm using an AMD RX 5700 XT or using an Nvidia Gigabyte GeForce RTX 2700 Super (with open source drivers in both cases).
- Regardless of whether I'm using KDE or XFCE.
- Regardless of whether I'm using the default KDE DM or switch to GDM3 and set WaylandEnable=false.
- Regardless of whether I use the default 5.3.0-29-generic kernel or 5.4.17-050417-generic.
- Regardless of whether I go directly into the graphical environment or start in runlevel 3 and then manually run startx.
- Regardless of whether it's on the rising or falling edge of the stress-script's temperature changes.
- Regardless of bios version on the X570 mainboard (the one it shipped with, or the newest one released in January 2020).
- Regardless of whether XMP is on or off in the bios.
- Regardless of whether I use the default or set global c-state to "control = disabled" in the bios.
- Regardless of whether I add processor.max_cstate=5 idle=halt in grub.
- Regardless of whether or not speakers are plugged in.
- Regardless of whether I'm using a USB port that is directly on the motherboard or is on the front of the case.
- Regardless of which monitor it is attached to.

It doesn't show up:
- On an old i7-4771 machine I have, also running Kubuntu 19.10, while running the test script.
- When I use a mouse remotely with ssh -Y [ip of the machine I am reporting this from] xeyes, while running the test script.
- When I do non-mouse USB input, ie via a USB keyboard or USB wifi dongle, including under saturated network load, while running the test script.
- During stress tests of the GPU, CPU, etc. Tools like memtest, mprime, Unigine Superposition, repeated kernel compiles, etc run stably overnight.
- When the system is entirely idle aside from mouse movement.
- When I start in runlevel 3 and run the same test script, using the mouse with gpm.
- Running the same test script without mouse movement: this was stable overnight, then crashed within a couple of minutes of moving the mouse.

It shows up with load other than the stress-ng+sleep script too, but much less reliably - I'm writing this bug report on the relevant machine, with firefox open. Crashes occur at least once a week under these conditions, but not frequently.

Crashes occur with sensor-reported CPU temperatures of 32 to 41 degrees Celsius. Nothing is overheating, and the system is stable at much higher temperatures under sustained stress tests.

The symptoms of the crash: the display stops updating and the system does not respond to any further input, including via the network or magic sysrq key. There is nothing related to it in syslog or journalctl, including when I'm running journalctl -f at the time of the crash.

The test script:
#!/bin/bash
for x in {1..10000}
do
        echo "Run $x at `date`"
        stress-ng --cpu 12 --cpu-method all --verify -t 30s --metrics-brief
        sleep 30
done

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: xorg 1:7.7+19ubuntu12
Uname: Linux 5.4.17-050417-generic x86_64
ApportVersion: 2.20.11-0ubuntu8.2
Architecture: amd64
BootLog: Error: [Errno 13] Permission denied: '/var/log/boot.log'
CompositorRunning: None
CurrentDesktop: KDE
Date: Fri Feb 7 00:02:22 2020
DistUpgraded: Fresh install
DistroCodename: eoan
DistroVariant: ubuntu
GraphicsCard:
 NVIDIA Corporation Device [10de:1e84] (rev a1) (prog-if 00 [VGA controller])
   Subsystem: Gigabyte Technology Co., Ltd Device [1458:4008]
InstallationDate: Installed on 2020-01-30 (7 days ago)
InstallationMedia: Kubuntu 19.10 "Eoan Ermine" - Release amd64 (20191017)
MachineType: Micro-Star International Co., Ltd. MS-7C37
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.4.17-050417-generic root=UUID=1a1a3bcc-cc59-4982-a1f5-f721ef6fe937 ro quiet splash acpi_enforce_resources=lax vt.handoff=7
SourcePackage: xorg
Symptom: display
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/08/2020
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: A.71
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: MPG X570 GAMING PLUS (MS-7C37)
dmi.board.vendor: Micro-Star International Co., Ltd.
dmi.board.version: 2.0
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Micro-Star International Co., Ltd.
dmi.chassis.version: 2.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrA.71:bd01/08/2020:svnMicro-StarInternationalCo.,Ltd.:pnMS-7C37:pvr2.0:rvnMicro-StarInternationalCo.,Ltd.:rnMPGX570GAMINGPLUS(MS-7C37):rvr2.0:cvnMicro-StarInternationalCo.,Ltd.:ct3:cvr2.0:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: MS-7C37
dmi.product.sku: To be filled by O.E.M.
dmi.product.version: 2.0
dmi.sys.vendor: Micro-Star International Co., Ltd.
version.compiz: compiz N/A
version.libdrm2: libdrm2 2.4.99-1ubuntu1
version.libgl1-mesa-dri: libgl1-mesa-dri 19.2.8-0ubuntu0~19.10.2
version.libgl1-mesa-glx: libgl1-mesa-glx N/A
version.xserver-xorg-core: xserver-xorg-core 2:1.20.5+git20191008-0ubuntu1
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev N/A
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:19.0.1-1ubuntu1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20190815-1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.16-1

Revision history for this message
Luke Barone-Adesi (baluke) wrote :
description: updated
affects: ubuntu → xorg (Ubuntu)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It sounds like some part of the system has crashed. To help us find the cause of the crash please follow these steps:

1. Look in /var/crash for crash files and if found run:
    ubuntu-bug YOURFILE.crash
Then tell us the ID of the newly-created bug.

2. If step 1 failed then look at https://errors.ubuntu.com/user/ID where ID is the content of file /var/lib/whoopsie/whoopsie-id on the machine. Do you find any links to recent problems on that page? If so then please send the links to us.

3. If step 2 also failed then apply the workaround from bug 994921, reboot, reproduce the crash, and retry step 1.

Please take care to avoid attaching .crash files to bugs as we are unable to process them as file attachments. It would also be a security risk for yourself.

tags: added: nouveau
tags: removed: nouveau
affects: xorg (Ubuntu) → ubuntu
Changed in ubuntu:
status: New → Incomplete
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

4. If all of the above steps fail then please reproduce the crash, reboot and then immediately run:

   journalctl -b-1 > prevboot.txt

   and attach the file 'prevboot.txt.

Revision history for this message
Luke Barone-Adesi (baluke) wrote :

All of the above steps fail, though I've reported two unrelated bugs from /var/crash now.

I've reproduced the problem three times in the last 15 minutes.
* The first reproduction was under KDE with the script mentioned in my first bug report.
* The second reproduction was under KDE while trying to report the first reproduction.
* The third reproduction was under XFCE to underscore the point that nothing useful shows up near the end of journalctl's output for diagnosing this problem. It last contains warnings from nm-applet at 19:41:06, but the system crashed 25 seconds later, at 19:41:33.

As I said during the initial report:
"The symptoms of the crash: the display stops updating and the system does not respond to any further input, including via the network or magic sysrq key. There is nothing related to it in syslog or journalctl, including when I'm running journalctl -f at the time of the crash."

Revision history for this message
Luke Barone-Adesi (baluke) wrote :
Revision history for this message
Luke Barone-Adesi (baluke) wrote :
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Thanks. It sounds like the problem is either in Xorg or in the kernel.

Please log into "Ubuntu on Wayland" or install 'weston' and try that. We need to know if the same bug occurs in a Wayland compositor without any Xorg server running.

affects: ubuntu → xorg-server (Ubuntu)
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Luke Barone-Adesi (baluke) wrote :

I've installed weston, run telinit 3, and from the terminal run weston-launch. 20 minutes of moving the mouse later, this appears to be stable (just as gpm in a console after booting into runlevel 3 is).

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Thanks. Please also test "Ubuntu on Wayland" to be sure. It sounds like the problem is specific to Xorg but we need to be sure it's not the kernel.

Revision history for this message
Luke Barone-Adesi (baluke) wrote :

The problem does not show up on "Ubuntu on Wayland" (as judged by 20 minutes of moving the mouse while running the stress-ng script in the first post).

no longer affects: linux (Ubuntu)
Changed in xorg-server (Ubuntu):
status: Incomplete → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xorg-server (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.