kernel panic in Radeon driver while screen blank

Bug #1763273 reported by Chris Darroch on 2018-04-12
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
xorg (Ubuntu)
Undecided
Unassigned

Bug Description

I maintain a set of older kiosks running a mix of stock Ubuntu 14.04 LTS and 16.04. As we have gradually transitioned them to 16.04, we have noticed that the machines running 16.04.1 now regularly exhibit problems restoring from the blank screen which appears after a period of inactivity.

This only occurs on the machines where we've installed 16.04. All of the 16.04.1 installations are "fresh", that is, complete re-installs from scratch, and we install all security and other updates on a regular basis. This problem has persisted right from the beginning when we started using 16.04.

Following a period of user inactivity, the screen goes blank; this is, of course, expected. When the user tries to restore the session by typing a key or moving the mouse, one of three outcomes occurs.

Sometimes, the session restores normally. Other times, the screen remains blank regardless of all normal keyboard input, until a magic SysRq sequence is performed. After Alt-SysRq-k, either the virtual console resets to a login screen (as expected after this SysRq sequence), or we get a kernel panic text screen (see attached screen photo) and can only restore via a hard power-cycle reboot.

I will continue to try to capture a better trace output from one of these conditions, perhaps from /var/log/kern.log or by installing linux-crashdump (although these old machines may not have enough memory for the stock package). But in the meantime, I'm attaching a photo of the kernel panic we see, which suggests a problem may reside in the Radeon driver ... which would seem possible, given the general blank-screen no-resume problem on these systems.

Any advice on how to further capture any other needed details would be appreciated. Thanks very much!

Chris Darroch (cdarroch) on 2018-04-12
description: updated
Chris Darroch (cdarroch) on 2018-04-12
description: updated
Simon Quigley (tsimonq2) on 2018-04-12
no longer affects: ubuntu

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1763273

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Chris Darroch (cdarroch) wrote :

I'm only on-site with these kiosks once a week, but the next time I'm there I will run apport-collect on one of the affected systems, and see if we can't capture the problem. I'll also investigate whether there are any messages in kernel log files, as best I can.

RBL Admin (rbladmin) wrote :

Further details from apport-bug are in #1764211; I'll copy them into this account using my "cdarroch" personal Launchpad account in a few hours, when I get home.

RBL Admin (rbladmin) wrote :

Further debugging today while I'm on-site with the kiosks. The system froze, as usual, after normal idle screen blanking. I was able to login via ssh and retrieve some details.

First, it appears that the basic screen blanking problem is somehow related to the Radeon driver. I will attach the Xorg.0.log file and a gdb backtrace on the running but non-responsive Xorg process. No keyboard or mouse activity will un-blank the screen.

Here's the output from lspci | grep VGA:

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV516
[Radeon X1300/X1550 Series]

I'll also attach the PCI ROM dump from the /sys/devices rom file for this device.

This suggests that when we use Alt-SysRq-k to restore the system, sometimes we trigger *another* bug in the driver which results in the overall system crash. See the attached apr1.full.kern.log file for what details I could recover from the crash on April 1st; that corresponds to the screen photo I took that day (the radeon_kernel_panic1.jpg attachment).

The Xorg "hang" bug which precedes this appears similar, but not identical, to what is documented in #1664979:

https://bugs.launchpad.net/ubuntu/+source/xorg/+bug/1664979

I will continue to try to study the live Xorg process to see if I can determine where its hung, but I may not have time this evening. Let me know if there are specific further debugging steps I can take to examine either the Xorg hang or the subsequent crash which (sometimes) follows on attempting to recover with Alt-SysRq-k. (And note that I'll copy in the files apport-bug generated from #1764211 later tonight, using my "cdarroch" personal account, and then I'll try to close that extra bug report.)

Thanks very much,
Chris.

RBL Admin (rbladmin) wrote :

While logged in via ssh to the machine with the "hung" Xorg (screen blank, no way to recover except Alt-SysRq-k), I dumped the output of Alt-SysRq-t into the syslog -- see attached file. Not sure this is much use, however. Just trying to be complete while I'm on-site with a hung machine.

Chris Darroch (cdarroch) wrote :
Download full text (3.2 KiB)

Here are the details from apport-bug, as promised (copied from #1764211 because I had to use a different Launchpad account):

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: xorg 1:7.7+13ubuntu3
ProcVersionSignature: Ubuntu 4.4.0-119.143-generic 4.4.114
Uname: Linux 4.4.0-119-generic x86_64
.tmp.unity_support_test.0:

ApportVersion: 2.20.1-0ubuntu2.15
Architecture: amd64
BootLog: /dev/sda1: clean, 257395/9510912 files, 1925432/38014720 blocks
CompizPlugins: No value set for `/apps/compiz-1/general/screen0/options/active_plugins'
CompositorRunning: compiz
CompositorUnredirectDriverBlacklist: '(nouveau|Intel).*Mesa 8.0'
CompositorUnredirectFSW: true
Date: Sun Apr 15 18:28:07 2018
DistUpgraded: Fresh install
DistroCodename: xenial
DistroVariant: ubuntu
EcryptfsInUse: Yes
ExtraDebuggingInterest: Yes
GpuHangFrequency: Several times a week
GpuHangReproducibility: Occurs more often under certain circumstances
GpuHangStarted: Immediately after installing this version of Ubuntu
GraphicsCard:
 Advanced Micro Devices, Inc. [AMD/ATI] RV516 [Radeon X1300/X1550 Series] [1002:7187] (prog-if 00 [VGA controller])
   Subsystem: PC Partner Limited / Sapphire Technology RV516 [Radeon X1300/X1550 Series] [174b:e020]
   Subsystem: PC Partner Limited / Sapphire Technology RV516 [Radeon X1300/X1550 Series] (Secondary) [174b:e021]
InstallationDate: Installed on 2017-07-24 (265 days ago)
InstallationMedia: Ubuntu 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
ProcEnviron:
 LANGUAGE=en_CA:en
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_CA.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-119-generic root=UUID=fe381ac7-304a-4c28-b497-bdefa6602251 ro quiet splash vt.handoff=7
SourcePackage: xorg
Symptom: display
Title: Xorg freeze
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/24/2007
dmi.bios.vendor: LENOVO
dmi.bios.version: 2JKT30AUS
dmi.board.name: LENOVO
dmi.chassis.vendor: LENOVO
dmi.modalias: dmi:bvnLENOVO:bvr2JKT30AUS:bd01/24/2007:svnLENOVO:pn8811BJ8:pvrThinkCentreM55:rvnLENOVO:rnLENOVO:rvr:cvnLENOVO:ct7:cvr:
dmi.product.name: 8811BJ8
dmi.product.version: ThinkCentre M55
dmi.sys.vendor: LENOVO
version.compiz: compiz 1:0.9.12.2+16.04.20160823-0ubuntu1
version.ia32-libs: ia32-libs N/A
version.libdrm2: libdrm2 2.4.83-1~16.04.1
version.libgl1-mesa-dri: libgl1-mesa-dri 17.2.8-0ubuntu0~16.04.1
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 17.2.8-0ubuntu0~16.04.1
version.xserver-xorg-core: xserver-xorg-core 2:1.18.4-0ubuntu0.7
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.10.1-1ubuntu2
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:7.7.0-1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20160325-1ubuntu1.2
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.12-1build2
xserver.bootTime: Sun Apr 15 17:45:51 2018
xserver.configfile: default
xserver.devices:
 input Power Button KEYBOARD, id 6
 input Power Button KEYBOARD, id 7
 input Dell Dell USB Keyboard KEYBOARD, id 8
 input USB Optical Mouse MOUSE, id 9
xserver.errors:

xserver.logfile: /var/log/Xorg.0.log
xser...

Read more...

tags: added: amd64 apport-bug compiz-0.9 freeze ubuntu xenial
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xorg (Ubuntu):
status: New → Confirmed
Chris Darroch (cdarroch) wrote :

OK ... I think that's all I can do for today. I've uploaded as attachments to this original report all the files generated by apport-bug today, and marked the other ticket as a duplicate.

(I apologize for that, but I can't use my personal Launchpad account from the kiosks, and I'm only there once a week, so I used the "RBL Admin" account to generate the requested apport-bug files. I've marked the extra ticket as a duplicate of this one now that I've copied over all the files.)

As I noted earlier when I had direct access to the machines, this seems to be a pair of issues, both possibly in the Radeon driver. The "unable to unblank after idle" problem is shown in the Xorg.0.log file from today, and also in the XorgLogOld.txt file from apport-bug -- namely, Xorg errors reporting keyboard and mouse events are being dropped.

And, when we do a "hard resume" by issuing Alt-SysRq-k, that sometimes works to get back to the login screen, and sometimes generates a kernel crash in the radeon driver -- as shown in the apr1.full.kern.log (all the /var/log/kern.log* entries from April 1st) and the accompanying kernel stacktrace photo in radeon_kernel_panic1.jpg.

Please let me know if there's any further data I can try to collect next week when I'm on-site again. Thanks very much,

Chris.

Chris Darroch (cdarroch) wrote :

One further historical detail: when these machines were running 14.04.3 and we made the mistake of accepting the HardWare Enablement (HWE) upgrade:

https://wiki.ubuntu.com/1404_HWE_EOL

which replaced the old fglrx driver with the radeon one, that's when we first encountered this unable-to-unblank problem.

Fortunately, at that time we were able to roll back Xorg to 1.16 and keep things sort-of working:

https://askubuntu.com/questions/815591/ubuntu-14-04-5-16-04-and-newer-on-amd-graphics
https://askubuntu.com/questions/676216/downgrade-xorg-server

When we performed a clean, full re-install with 16.04 we were hoping this problem would be resolved, but as I've documented here it continues to affect the systems.

Chris Darroch (cdarroch) wrote :

I think I've filled in all the necessary info -- and I have another screenshot to upload too, once I get it off my phone, from another kernel panic in the same context.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Chris Darroch (cdarroch) wrote :

Hit another kernel panic this weekend, again after having the screen go blank and be unblankable via any normal method, then trying Alt-SysRq-k and getting this panic (with a different stacktrace than the other, but also going through the radeon driver and radeon_crtc_handle_flip).

I'll try to get the kern.log file matching this panic next weekend.

Kai-Heng Feng (kaihengfeng) wrote :

Can you try latest mainline kernel in [0]?

You need these two binaries:
linux-image-unsigned*generic*.deb
linux-modules*generic*.deb

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc2/

Chris Darroch (cdarroch) wrote :

Yes, I can test those this weekend ... if you have any specific advice on updating the kernel, let me know, otherwise I'll use the notes here:

https://wiki.ubuntu.com/Kernel/MainlineBuilds

Thanks very much!

To post a comment you must log in.