Comment 6 for bug 708998

Revision history for this message
Bryce Harrington (bryce) wrote :

Hi Rick,

We've had a bunch of bug reports about GPU freezes lately, which I've been focusing on most of this past week. This particular report doesn't have the instrumentation data needed to diagnose it, but I figured I can give you a run down of all the freezes I know about at this point.

By and large, all of the freezes are actually kernel bugs in the drm code, so it requires the kernel team to fix them. But we've been assisting on the X side with the triaging and where possible identifying kernel patches to make the kernel team's life easier. But ultimately it's going to be a matter of waiting for newer more stable kernel releases. The GPU hangs may magically go away (or be replaced by exciting new ones) as the kernel gets new RC's.

Anyway, here's some of the common "classes" of freezes I know about:

1. vesafb conflict causing lockup during boot. Bug #702090. Basically, the kernel has a generic video driver loaded as a fallback for plymouth and boot prettiness. When it comes time to load the intel kernel driver, the Intel driver locks up. Interestingly, most of the time the kernel is able to reset the GPU and keep on truckin' but it's enough to trigger apport. So it's sort of a false positive in those cases, however apw and I suspect this could also be triggering other issues so needs to be sorted out. The way to tell it's this kind of bug is because there is a line in dmesg like "ERROR* EIR stuck: 0x00000010, masking".

2. Freeze with black screen switching from Plymouth to X during boot. Bug #712173. Removing 'quiet/splash' makes things work right. So this is probably Plymouth messing up the GPU and leaving it in a busted state, so when it comes time for X to load, it can't and faults. Seems not to be terribly widespread but has come up in ISO testing. Really makes me wish we didn't have to use Plymouth. ;-)

3. ESR 0x00000001 random freeze during usage. I don't know what leads to this freeze, it seems to be of the "random lockup" variety. The distinguishing characteristic is that the GPU dump shows a 0x00000001 for the ESR parameter. I don't know if this means they're all dupes of the same root cause or just very similar kinds of failures.

4. There are also a few one-off freezes that appear to be unique to specific individuals and their hardware, and perhaps to a given kernel version. At least, the dmesg errors and GPU dumps don't match up to anyone else's bug reports. A lot of times these mysteriously go away after an update or two.

Generally, we have fair luck at pinpointing things when we get GPU dumps when apport catches the hang. Sometimes just seeing the dmesg from when it is hung is enough. Unfortunately in the case of this bug report the dmesg seems to be from when the system was working ok, so doesn't indicate what the freeze was.

Are you still experiencing the freeze? If you are, and apport isn't collecting the bug, then what we need collected is the output of 'intel_gpu_dump' and 'dmesg', and the /sys/kernel/debug/dri/0/i915_error_state file if it exists. dmesg usually includes a drm error message in this case, and the gpu dump has error codes that point to how the driver was locked up.