Gpu watchdog segfault and video+kbd+mouse freeze on optiplex 7060 intel gpu

Bug #1861294 reported by Bogdan Harjoc
58
This bug affects 11 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

Running up-to-date Ubuntu-18.04.3 with kernel 5.3.0-26 on a Dell Optiplex 7060 with an i7-8700 CPU and Intel UHD Graphics 630 (Coffeelake 3x8 GT2).

I had chrome, slack and vmware-player running in Gnome. While doing some git clone, screen+mouse+keyboard froze for 2 minutes after which xorg and everything else recovered. I saw this in dmesg:

kernel: show_signal_msg: 2 callbacks suppressed
kernel: GpuWatchdog[20399]: segfault at 0 ip 0000556fd1665ded sp 00007efbf17e46c0 error 6 in chrome[556fcd72a000+7171000]
kernel: Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d
kernel: nvme nvme0: I/O 202 QID 6 timeout, aborting
kernel: nvme nvme0: I/O 203 QID 6 timeout, aborting
kernel: nvme nvme0: I/O 204 QID 6 timeout, aborting
kernel: nvme nvme0: I/O 205 QID 6 timeout, aborting
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: Abort status: 0x0
kernel: nvme nvme0: I/O 202 QID 6 timeout, reset controller
kernel: nvme nvme0: 12/0/0 default/read/poll queues

While writing this bug report, the system froze again, and this time it didn't recover. After a cold reset I didn't see any other GpuWatchdog messages in journalctl.

Ubuntu applied a BIOS firmware update before the first freeze, so my BIOS was updated as part of the cold reset I did. Not sure if this is relevant to reproducing the freeze.

Tags: bionic amd64
Revision history for this message
Bogdan Harjoc (harjoc-gmail) wrote :

Issue occurred again after BIOS update, during make -j12. I also had chrome and vmplayer running. Dmesg errors from journalctl:

kernel: pcieport 0000:00:1b.0: AER: Multiple Corrected error received: 0000:01:00.0
kernel: pcieport 0000:00:1b.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
kernel: pcieport 0000:00:1b.0: AER: device [8086:a340] error status/mask=00001000/00002000
kernel: pcieport 0000:00:1b.0: AER: [12] Timeout
kernel: nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
kernel: nvme 0000:01:00.0: AER: device [1344:5410] error status/mask=00000040/00002000
kernel: nvme 0000:01:00.0: AER: [ 6] BadTLP
kernel: nvme 0000:01:00.0: AER: Error of this Agent is reported first

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

I'd say your hardware is falling apart

affects: intel-gpu-tools (Ubuntu) → ubuntu
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hmm actually, are you able to reproduce this? Do you have an earlier 5.3.0 kernel to try (-24 for instance)?

affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

-24 is available, install it with:

sudo apt install linux-image-5.3.0-24-generic linux-modules-5.3.0-24-generic linux-modules-extra-5.3.0-24-generic

then select it from the grub menu

Revision history for this message
Bogdan Harjoc (harjoc-gmail) wrote : Re: [Bug 1861294] Re: Gpu watchdog segfault and video+kbd+mouse freeze on optiplex 7060 intel gpu

Similar issue on Red Hat Bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=1616364

I added the boot cmdline options mentioned there, so far so good.

On Thu, Jan 30, 2020 at 2:10 PM Timo Aaltonen <email address hidden> wrote:
>
> I'd say your hardware is falling apart
>
> ** Package changed: intel-gpu-tools (Ubuntu) => ubuntu
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1861294
>
> Title:
> Gpu watchdog segfault and video+kbd+mouse freeze on optiplex 7060
> intel gpu
>
> Status in Ubuntu:
> New
>
> Bug description:
> Running up-to-date Ubuntu-18.04.3 with kernel 5.3.0-26 on a Dell
> Optiplex 7060 with an i7-8700 CPU and Intel UHD Graphics 630
> (Coffeelake 3x8 GT2).
>
> I had chrome, slack and vmware-player running in Gnome. While doing
> some git clone, screen+mouse+keyboard froze for 2 minutes after which
> xorg and everything else recovered. I saw this in dmesg:
>
> kernel: show_signal_msg: 2 callbacks suppressed
> kernel: GpuWatchdog[20399]: segfault at 0 ip 0000556fd1665ded sp 00007efbf17e46c0 error 6 in chrome[556fcd72a000+7171000]
> kernel: Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d
> kernel: nvme nvme0: I/O 202 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 203 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 204 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 205 QID 6 timeout, aborting
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: I/O 202 QID 6 timeout, reset controller
> kernel: nvme nvme0: 12/0/0 default/read/poll queues
>
> While writing this bug report, the system froze again, and this time
> it didn't recover. After a cold reset I didn't see any other
> GpuWatchdog messages in journalctl.
>
> Ubuntu applied a BIOS firmware update before the first freeze, so my
> BIOS was updated as part of the cold reset I did. Not sure if this is
> relevant to reproducing the freeze.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+bug/1861294/+subscriptions

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

so you have radeon graphics?

Revision history for this message
Bogdan Harjoc (harjoc-gmail) wrote :

I have intel graphics as described in the initial report. The machine
just crashed again while doing apt install, no nvme or any other
relevant errors in journalctl this time. I downgraded to
linux-image-5.0.0-23-generic and if it's stable I will update to
5.3.0-24 and report back.

On Thu, Jan 30, 2020 at 6:50 PM Timo Aaltonen <email address hidden> wrote:
>
> so you have radeon graphics?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1861294
>
> Title:
> Gpu watchdog segfault and video+kbd+mouse freeze on optiplex 7060
> intel gpu
>
> Status in linux package in Ubuntu:
> Incomplete
>
> Bug description:
> Running up-to-date Ubuntu-18.04.3 with kernel 5.3.0-26 on a Dell
> Optiplex 7060 with an i7-8700 CPU and Intel UHD Graphics 630
> (Coffeelake 3x8 GT2).
>
> I had chrome, slack and vmware-player running in Gnome. While doing
> some git clone, screen+mouse+keyboard froze for 2 minutes after which
> xorg and everything else recovered. I saw this in dmesg:
>
> kernel: show_signal_msg: 2 callbacks suppressed
> kernel: GpuWatchdog[20399]: segfault at 0 ip 0000556fd1665ded sp 00007efbf17e46c0 error 6 in chrome[556fcd72a000+7171000]
> kernel: Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d
> kernel: nvme nvme0: I/O 202 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 203 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 204 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 205 QID 6 timeout, aborting
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: I/O 202 QID 6 timeout, reset controller
> kernel: nvme nvme0: 12/0/0 default/read/poll queues
>
> While writing this bug report, the system froze again, and this time
> it didn't recover. After a cold reset I didn't see any other
> GpuWatchdog messages in journalctl.
>
> Ubuntu applied a BIOS firmware update before the first freeze, so my
> BIOS was updated as part of the cold reset I did. Not sure if this is
> relevant to reproducing the freeze.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1861294/+subscriptions

Revision history for this message
Bogdan Harjoc (harjoc-gmail) wrote :

No crashes in the last 4 days using 5.0.0-23-generic, will try with 5.3.0-24.

Revision history for this message
Bogdan Harjoc (harjoc-gmail) wrote :

5.3.0-24 crashed after 2 regular work days.

Revision history for this message
Bogdan Harjoc (harjoc-gmail) wrote :

Issue still reproduces on 5.3.0-40. Currently trying https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.5.6/

Revision history for this message
Alberto Pretto (alberto-pretto) wrote :

Same problem here:

Kernel: GpuWatchdog[19121]: segfault at 0 ip 000055d34d79afa2 sp 00007f3c50a286c0 error 6 in chrome[55d349854000+7287000]
Kernel: Code: 83 c3 e8 75 e9 41 8b 85 00 01 00 00 85 c0 0f 84 99 00 00 00 48 8d 3d f3 60 4b fb be 01 00 00 00 ba 03 00 00 00 e8 be 17 a6 fe <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 fc 76 b9 03 01 80 7d 8f 00

Running up-to-date Ubuntu-18.04 with kernel 5.3.0-40-generic on a Lenovo P50 with an Intel Xeon CPU E3-1505M v5 2.80GHz CPU and Nvidia Quadro M2000M (Nvidia Driver Version: 435.21)
The Gnome environment frozen during normal activities, no way to recover. I was just able to switch to a console with Ctrl - Alt - FX, in order to copy the message above from dmesg.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please test 5.3.0-45.37?

Revision history for this message
Bogdan Harjoc (harjoc-gmail) wrote :

The optiplex-7060 I was testing on is in the office, I'm working from
home and can't test at the moment unfortunately.

On Tue, Mar 31, 2020 at 9:25 AM Kai-Heng Feng
<email address hidden> wrote:
>
> Can you please test 5.3.0-45.37?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1861294
>
> Title:
> Gpu watchdog segfault and video+kbd+mouse freeze on optiplex 7060
> intel gpu
>
> Status in linux package in Ubuntu:
> Incomplete
>
> Bug description:
> Running up-to-date Ubuntu-18.04.3 with kernel 5.3.0-26 on a Dell
> Optiplex 7060 with an i7-8700 CPU and Intel UHD Graphics 630
> (Coffeelake 3x8 GT2).
>
> I had chrome, slack and vmware-player running in Gnome. While doing
> some git clone, screen+mouse+keyboard froze for 2 minutes after which
> xorg and everything else recovered. I saw this in dmesg:
>
> kernel: show_signal_msg: 2 callbacks suppressed
> kernel: GpuWatchdog[20399]: segfault at 0 ip 0000556fd1665ded sp 00007efbf17e46c0 error 6 in chrome[556fcd72a000+7171000]
> kernel: Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d
> kernel: nvme nvme0: I/O 202 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 203 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 204 QID 6 timeout, aborting
> kernel: nvme nvme0: I/O 205 QID 6 timeout, aborting
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: Abort status: 0x0
> kernel: nvme nvme0: I/O 202 QID 6 timeout, reset controller
> kernel: nvme nvme0: 12/0/0 default/read/poll queues
>
> While writing this bug report, the system froze again, and this time
> it didn't recover. After a cold reset I didn't see any other
> GpuWatchdog messages in journalctl.
>
> Ubuntu applied a BIOS firmware update before the first freeze, so my
> BIOS was updated as part of the cold reset I did. Not sure if this is
> relevant to reproducing the freeze.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1861294/+subscriptions

Revision history for this message
andrej (4ndrej) wrote :

@alberto-pretto: any news? My wife's Lenovo with Intel GPU started to freeze from time to time...

Revision history for this message
andrej (4ndrej) wrote :

It seems it's related to kernel version. Latest kernel-5.6.14-300 crashed (at least) daily.
Booting kernel-5.0.9-301 helps a lot - no crash for 2 days.
Before upgrade I used kernel-5.3.11-100 for 5 months and it was stable with no GpuWatchdog segfault in chrome errors at all.

Revision history for this message
Lucas Teske (teske) wrote :

I'm also experiencing this problem.

Linux nblucas 5.4.0-33-generic #37-Ubuntu SMP Thu May 21 12:53:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[44262.937772] GpuWatchdog[1602155]: segfault at 0 ip 00005616ee370587 sp 00007f75ed8b64d0 error 6 in mqtt-explorer.bin[5616eb195000+53d8000]
[44262.937777] Code: 7d b7 00 79 09 48 8b 7d a0 e8 05 51 d3 fe 8b 83 00 01 00 00 85 c0 0f 84 91 00 00 00 48 8b 03 48 89 df be 01 00 00 00 ff 50 68 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 57 33 70 02 01 80 7d 87 00

With a Nvidia GTX 980m and oficial nvidia drivers 440.64. Wasn't happening before with ubuntu 18.04 but happens with ubuntu 20.04.

Also some side notes:

* It only happens with chrome-like instances (like Google Chrome, Chromium, Brave, Electron) open
* Sometimes it does recover after 10 minutes frozen (very few times, but it does sometimes)
* Happens at least once per day (I'm using for two weeks now, and everyday it freezes at least once, and sometimes it gets really annoying)
* I use XFCE4 with Compton

The laptop is an Avell (Clevo Rebrand) with an i7 6820HK. The temperatures are always low before freezing, but after it freezes all coolers get maximum (and gpu temp reache about 70 degrees celsius)

Revision history for this message
Stefano Forli (ntropia) wrote :

Very similar behavior and error message on Xeon E5-1620 and NVIDIA card (Quadro M2000).
Kernel 4.19.0-9-amd (Debian)

Revision history for this message
Mohammad Ali Toufighi (alitou) wrote :

Same issue with 5.4.0-58-generic kernel on Ubuntu 20.04 and Intel UHD graphics.

This can be seen in the logs:
kernel: [58329.813068] GpuWatchdog[9872]: segfault at 0 ip 00005623a08ff439 sp 00007f46ae892680 error 6 in code[56239d2cf000+57ee000]
kernel: [58329.813084] Code: 00 79 09 48 8b 7d c0 e8 45 3d c0 fe c7 45 c0 aa aa aa aa 0f ae f0 41 8b 84 24 e0 00 00 00 89 45 c0 48 8d 7d c0 e8 97 50 9d fc <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e

Revision history for this message
Kienyew (kienyew) wrote :

This affect me many times, I'm using Acer Swift-3 laptop with Intel UHD graphics.

Jan 15 17:37:22 KY-PC kernel: GpuWatchdog[82704]: segfault at 0 ip 000055c640e6f7a9 sp 00007f1f32e4c4e0 error 6 in Typora[55c63d51f000+5cbc000]
Jan 15 17:37:22 KY-PC kernel: Code: 00 79 09 48 8b 7d c0 e8 d5 f6 bd fe c7 45 c0 aa aa aa aa 0f ae f0 41 8b 84 24 e0 00 00 00 89 45 c0 48 8d 7d c0 e8 67 5c 6b fc <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 38 5b 41 5c 41 5d 41 5e
Jan 15 17:37:22 KY-PC kernel: audit: type=1701 audit(1610703442.770:463): auid=1000 uid=1000 gid=1000 ses=2 pid=82684 comm="GpuWatchdog" exe="/usr/share/typora/Typora" sig=11 res=1
Jan 15 17:37:22 KY-PC audit[82684]: ANOM_ABEND auid=1000 uid=1000 gid=1000 ses=2 pid=82684 comm="GpuWatchdog" exe="/usr/share/typora/Typora" sig=11 res=1

Revision history for this message
Kienyew (kienyew) wrote :

It seems like a bug of drivеr in xf86-video-intel package, once I uninstalled it, I never encoutered it аgain,

Revision history for this message
Ken Sharp (kennybobs) wrote :

Are you still having this issue?

tags: added: amd64 bionic
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.