nouveau hard lockup in nouveau_gem_ioctl_cpu_fini

Bug #539655 reported by Stefan Bader
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Invalid
High
Unassigned
linux (Ubuntu)
Invalid
High
Manoj Iyer
Lucid
Won't Fix
High
Unassigned

Bug Description

Binary package hint: plymouth

Single CPU system 64bit

After upgrading today, I see the plymouth boot screen with the little dots moving, then a flicker after which the screen stays with all 4 dots filled and no more activity. Keyboard is not responding anymore. But ssh login works and shows a plymoutd process (in state D), the call to stop it and an X process running.
I was unable to kill the plymouthd process, but after killing the X process, both were gone and gdm starts up with the login screen (system is configured to login automatically normally)

01:00.0 VGA compatible controller [0300]: nVidia Corporation NV41.1 [GeForce 6800] [10de:00c1] (rev a2)
 Subsystem: LeadTek Research Inc. Device [107d:2a0c]

Revision history for this message
Stefan Bader (smb) wrote :
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

The apport bug is sadly unrelated

Changed in plymouth (Ubuntu):
status: New → Incomplete
tags: removed: apport-bug shutdown-hang
description: updated
Changed in plymouth (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Jumping to the four dots means "plymouth deactivate" has been called

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

<Keybuk> could you attach this "ps" output to the bug
 also I don't suppose you could gdb attach to plymouthd and get a backtrace?
<smb> Keybuk, Hm, not using gdb too often... How would I break to a gdb command line after attaching to the process
<Keybuk> does it not break?
<Keybuk> strace instead of gdb?
<smb> No it just tells me attached to process x and stays there
<smb> I don't think it will show much as it seems to be stuck

<smb> Keybuk, strace seems to lockup too, but maybe /proc/x/wchan = nouveau_gem_ioctl_cpu_fini helps?

summary: - Boot screen locks up while X starts
+ nouveau hard lockup in nouveau_gem_ioctl_cpu_fini
affects: plymouth (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

I think that it's X that's hanging here btw; since plymouth is deactivated (all four dots showing) X has the ball with the drm

Revision history for this message
Stefan Bader (smb) wrote :

This is the ps -ax while the hang is occuring. There is nothing going on with strace or gdb but the wchan in proc is saying "nouveau_gem_ioctl_cpu_fini".

tags: added: kernel-series-unknown
Revision history for this message
Stefan Bader (smb) wrote :

I think this would need someone from the nouveau developers to look. The issue happens to me on every boot with plymouth enabled. With plymouth disabled, there is a chance that either the vt switch does not work or X comes up and is ended on the first strike of <enter>.
The attached log was taken with plymouth used and I added the contents of wchan of the X process.

tags: added: lucid
removed: kernel-series-unknown
Revision history for this message
Stefan Bader (smb) wrote :

To clarify, since a while the lockup moved to ttm_bo_wait_cpu (as seen in the last Xorg.wedge.log. I hacked a bit around with function tracing and got the following trace (which is limited atm to the two functions):

# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
       plymouthd-316 [000] 16.812883: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
       plymouthd-316 [000] 16.812938: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860119: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860316: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860319: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860320: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860322: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860324: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860326: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860333: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860335: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860337: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860338: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860340: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860342: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.860384: nouveau_gem_ioctl_cpu_fini <-drm_ioctl
           <...>-1069 [000] 17.880233: ttm_bo_wait_cpu <-validate_init

Revision history for this message
Robert Hooker (sarvatt) wrote :

Can you boot with drm.debug=0x05 and attach the dmesg?

Robert Hooker (sarvatt)
description: updated
Revision history for this message
Stefan Bader (smb) wrote :
description: updated
Revision history for this message
Stefan Bader (smb) wrote :

I added hooks to catch ttm_bo_synccpu_write_grab and ttm_bo_synccpu_write_release as well (including the path they are called). There seems to be some imbalance there. Though it looks a bit like plymouth and X alike are going through cpu_prep more often than they go to cpu_fini...

Revision history for this message
Stefan Bader (smb) wrote :

So lets drop in the info of the latest experiment. Scott suggested to try making the plymouth drm renderer unavailable (by moving the library). This came up with a splash screen mode that seemed to be higher than the area actually drawn to. But it came up without problems. The differences in the resolutions actually might be a hint as it seems the card always seems to report a connected TV output with a max resolution of 1024x768 and the DVI port which has the monitor attached that has 1280x1024.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Or, to put simply, your card pretends to have more than one output at all times ;-)

This proves, to me, that this bug is the *cause* of bug #533135

Steve Langasek (vorlon)
Changed in linux (Ubuntu Lucid):
importance: Medium → High
Steve Langasek (vorlon)
Changed in ubuntu-release-notes:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Steve Langasek (vorlon) wrote :

Worked around for plymouth in release by falling back from drm to framebuffer on nouveau; no need to release note, we just need the actual bug fixed. :)

Changed in ubuntu-release-notes:
status: Triaged → Invalid
tags: added: kernel-graphics kernel-reviewed
Revision history for this message
Manoj Iyer (manjo) wrote :

Stefan, can you pls check if you are able to recreate this bug with latest kernel upgrade ?

Changed in linux (Ubuntu):
assignee: nobody → Manoj Iyer (manjo)
status: Triaged → Incomplete
Revision history for this message
Stefan Bader (smb) wrote :

The fatal effect of locking up is verified to be gone in current Lucid and Maverick (daily live CD 2010/07/12). If I understand the comments before correctly, this is done by automatically not using DRM rendering, when there are more than one active output present. The visual result on this particular hardware is that the boot logo is rendered in the wrong size but not hard lockups occur.

So I would say the hard lockup is solved and I would close this bug, but there is a still is a much less critical issue of fixing the case of two outputs on boot (likely when not having the same resolution like being in mirror mode). But that should be a different bug.

Changed in linux (Ubuntu Lucid):
status: Triaged → Fix Released
Revision history for this message
Stefan Bader (smb) wrote :

Changed my mind. For the linux package it is probably rather "won't fix" at the moment than "fix released". There was no change in the kernel to address any of the issue. Rather the change was in plymouth to not use the DRM renderer.

Changed in linux (Ubuntu Lucid):
status: Fix Released → Won't Fix
Revision history for this message
Steve Langasek (vorlon) wrote :

For the lucid kernel package that's ok, but the changes to plymouth to avoid the DRM driver are a gross hack that needs to be reverted in the long term; so it would be great if we could get to the bottom of the real bug.

Revision history for this message
Colin Watson (cjwatson) wrote :

I've changed plymouth 0.8.2-2ubuntu18 in natty so that, instead of disabling drm on nouveau at compile-time, it now accepts a 'plymouth:force-drm' parameter on the kernel command line: if that's omitted, then the behaviour is as in lucid and maverick (i.e. fall back to the plymouth frame-buffer renderer), but if it's present then plymouth will use the drm renderer again.

I'd greatly appreciate it if people who were affected by this bug back in lucid could try with the new plymouth and this kernel parameter, and report whether it still breaks. If it turns out that it's been fixed in the kernel in the meantime, then we can switch back to the drm renderer.

Revision history for this message
Stefan Bader (smb) wrote :

I have done a bit of testing today with the installation fully updated. Using "plymouth;force-drm" at least does not lock up the system anymore. However the output does not look too well. While I get an (incorrectly scaled) logo with the fb renderer, there is only a corrupted output with the drm renderer (black/white vertical stripes of various sizes).
Might be that my particular graphics card is specially broken as it always claims to have a TV connected. But in the end this should just give that mode (two screens attached on boot) testing.

Revision history for this message
dino99 (9d9) wrote :

This version has expired

Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.