Weston on Nouveau displays black screen on Nouveau due to page flip issues

Bug #1330785 reported by Nerdopolis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Hi.

On Ubuntu 13.04, it seems that Weston displays a black screen, and the screen stays black, due to page flipping issues in the Nouveau driver as reported here https://bugs.freedesktop.org/show_bug.cgi?id=75761

It seems that these two patches do apply to 3.13, and fix the pageflip issues, and it worked on my NV36 card: (One file I had to slightly alter myself)

http://cgit.freedesktop.org/~airlied...751f25ebe60f28
http://cgit.freedesktop.org/~airlied...5b438629d6858a

I haven't been able to test on newer Nvidia cards because I don't have any, but I might be hearing back from others that do have newer caard.

Is it possible that these can be officially backported in Ubuntu 13.04?

Revision history for this message
In , Soundx94 (soundx94) wrote :

Created attachment 95103
The log of weston after quiting.

Hi.

I'm trying to start weston with weston-launch in archLinux with the latest packages from the official repositories.
I've got an Nvidia card with Nouveau and nouveau-fw from the arch user repository and the latest mesa packages and KMS enabled.
Weston-launch seems to start correctly but the weston interface doesn't show up, the output is only a black screen. I can quit weston using Ctrl Alt backspace but nothing else.

When i'm trying to start just weston from an X session everything works fine but weston-launch is failing somehow.

Weston log attached.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

Could you check if weston-simple-shm gets anything to screen?

You can try it first when you run weston under X to see what it is supposed to show, and then when using weston-launch. VT switching should work, but if it does not, you might need to use ssh for access from remote or use a delay to start weston-simple-shm in another VT.

If weston-simple-shm seems to run, but still nothing on screen, then we should start suspecting an issue with Nouveau.

Did you check the kernel log for Nouveau errors?

You are not trying to use weston-launch from an X terminal, right?

Revision history for this message
In , Soundx94 (soundx94) wrote :

So I tried running weston-simple-shm under weston in an X session and it worked fine. (Guessing this multi-colored circle stuff is expected to show up)
After quiting the X session and opening another VT with weston-launch in the first and weston-simple-shm in the second the first VT was still black and there is nothing nasty in the log.
Of course i'm NOT trying to run weston-launch from X.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

Ok, it sounds like it would have something to do with Nouveau then. I vaguely recall a few people having such problems with Nouveau in the past. Maybe look into Nouveau's bugzilla, if there are any reports about Weston?

Revision history for this message
In , Soundx94 (soundx94) wrote :

Okai, thanks for your help :)
If anything turns up or a solution is found i'll post it here.

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

I can confirm this bug. Also Arch Linux.

I tried downgrading mesa to 10.0.3, but it did not help, as described in this post http://www.maui-project.org/news/2014/04/02/how-to-fix-mesa-on-arch-linux/

Under X I get following nouveau error (not in tty)

libEGL debug: driver does not expose __driDriverGetExtensions_nouveau(): /usr/lib/xorg/modules/dri/nouveau_dri.so: undefined symbol: __driDriverGetExtensions_nouveau

Revision history for this message
In , Zachcook1991 (zachcook1991) wrote :

Also can confirm, running Arch with nouveau, on linux 3.14.

I just installed linux-lts (currently at version 3.10.x) and it runs weston fine on a tty, so this is clearly a nouveau kernel issue from a fairly recent change.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

Alright, thanks Zach. Anything relevant in kernel logs? I assume linux-lts refers just to the kernel?

Would anyone be up for bisecting the kernel? You can probably limit the bisection to the drivers/gpu/drm/nouveau directory.

We could reassign this bug to Nouveau, but I think they would just ask for a bisection as the first thing, since this is apparently a regression.

Lubosz, the libEGL message is harmless.

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

Kernel 3.10 also works for me. Tested on NVC8 and NV50. I will bisect the kernel.

Revision history for this message
In , Kristian Hoegsberg (krh-bitplanet) wrote :

Thanks for investigating. Closing as NOTOURBUG for now, please reopen if it turns out to be a weston issues after all.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

Trying to move this to Nouveau, BZ is being difficult...

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

Reopening as Nouveau bug...

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

Setting status to NEW, since this is a new Nouveau bug report, moved from Weston.
Said to be a kernel regression after 3.10.

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

My bisection process so far is pretty frustrating,
since the majority of the commits between 3.12 and 3.13 do not boot.
I get corrupted output before being able to login.

The screen looks like so:

http://i.imgur.com/C517qpP.jpg

My current progress:

# v3.12 good
# 0fef9d8a59abcd699761cb054b6c37a2bea9e31a skip / won't boot
# 48ae0b355f21533145133002854de89a0537408d skip / won't boot
# c17f5bb529221c7f4c0736e69ceb614da0df2838 bad
# 8df1d0c07f18bd84ea7d8c7bc2cff45ba2b09680 skip / won't boot
# 16c4f227ffc556a4851518092e2b5979da1280c1 skip / won't boot
# 5fa7543041cbc2d3139e8d2178df61a33ac3f9ac skip / won't boot
# 6d8d163132d7df6ca701efcde7832046ecb2f040 skip / won't boot
# 98706ea99f2da8afdac69686de4ff982aca6a5c7 skip / won't boot
# v3.13 bad
# v3.14 bad

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

I am only bisecting the nouveau directory, like so:

git bisect start v3.14 v3.10 -- drivers/gpu/drm/nouveau

Revision history for this message
In , Ilia Mirkin (imirkin) wrote :

(In reply to comment #13)
> My bisection process so far is pretty frustrating,
> since the majority of the commits between 3.12 and 3.13 do not boot.
> I get corrupted output before being able to login.

What HW are you using?

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

Chipset: GF110 (NVC8)

Revision history for this message
In , Ilia Mirkin (imirkin) wrote :

(In reply to comment #16)
> Chipset: GF110 (NVC8)

There are known issues with NVC8 that go away with blob pgraph fw, perhaps worth trying with that. (See bug #54437.)

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

I was able to boot all commits with following boot option:
nouveau.config=NvMSI=0

Thanks to xexaxo on the IRC for pointing out this option.

f074d733866628973eca0ddb0c534ef4561da9e0 is the first bad commit
commit f074d733866628973eca0ddb0c534ef4561da9e0
Author: Maarten Lankhorst <email address hidden>
Date: Wed Nov 20 15:14:31 2013 +1000

    drm/nouveau/kms: send timestamp data for correct head in flip completion events

    Signed-off-by: Ben Skeggs <email address hidden>

:040000 040000 99b978c435e8bb3e7bd91e30daf953940dcfe99c dcff6c807e51e53d4de0064678c46bfd68e0a1ca M drivers

Canonical broke Weston for me ;)

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

Reverting the patch on current Linux master (4a4389abdd9822fdf3cc2ac6ed87eb811fd43acc) fixes the issue for me.

Apperently s->crtc needs to be hardcoded to -1 for the screen not to stay black when launching weston.

Revision history for this message
In , Soundx94 (soundx94) wrote :

Thanks for bisecting the kernel. Reverting that patch on the latest kernel also fixes the issue for me.
(Using NVE6: GTX 660)

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

(In reply to comment #20)
> Thanks for bisecting the kernel. Reverting that patch on the latest kernel
> also fixes the issue for me.
> (Using NVE6: GTX 660)

Hey, wait! Why did you close this bug?

Has upstream merged a fix, or did this problem never occur on upstream (hard to imagine)?

It's not enough that you personally find a hack that fixes it for you alone, we want it fixed for everybody.

Revision history for this message
In , Soundx94 (soundx94) wrote :

I thought WORKSFORME is the proper status because it works for me at the moment with this modification and it can get merged into upstream. Sorry if i misunderstood the status label, do you want me to reopen it ?

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

(In reply to comment #22)
> I thought WORKSFORME is the proper status because it works for me at the
> moment with this modification and it can get merged into upstream. Sorry if
> i misunderstood the status label, do you want me to reopen it ?

WORKSFORME means, that maintainers or developers cannot reproduce the bug at all, and cannot find anything wrong.

I reopened this for you.

Revision history for this message
In , Lubosz Sarnecki (lubosz) wrote :

Since Weston and Wayland 1.5.0 were released today, I wanted to ask about the status of this bug.

Does the patch which causes the regression have enough legitimacy to stay and break Weston?

Does someone understand the problem on a large scale?

Revision history for this message
In , Ilia Mirkin (imirkin) wrote :

(In reply to comment #24)
> Since Weston and Wayland 1.5.0 were released today, I wanted to ask about
> the status of this bug.
>
> Does the patch which causes the regression have enough legitimacy to stay
> and break Weston?

It seems like a pretty correct fix, at first blush. And was necessary for something else, IIRC.

>
> Does someone understand the problem on a large scale?

It'll take someone with KMS API and Wayland experience to work out what's going on. Probably would have to get a Wayland developer involved to figure this one out.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

What did the -1 as crtc mean?

I'm not sure how this is weston-specific, though. Weston programs a pageflip on a certain crtc, and then relies on that particular crtc to send back a pageflip event as requested.

So if that's the culprit, is Nouveau on some particular hardware sending back pageflip events on the wrong crtc? Or maybe mixing up crtcs so that no event gets sent?

This would be easy to test by adding a weston_log("got pageflip event\n") to src/compositor-drm.c:745, in function page_flip_handler(), and another in line 723 in vblank_handler(). The vblank_handler should be unused atm. I think.

If Weston never gets the pageflip event, its repaint loop would be stuck, which would explain the symptoms. A side effect would be that clients would never get frame callbacks, which means that e.g. weston-simple-shm would not be consuming any CPU. Normally it should take at least few percent of one core. Otherwise everything would seem to just work, since Weston does not hang but only not repaint.

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

I pulled out my old Dell desktop with a Pentium 4 processor, and AGB Nvidia card (NV36), and it does the same thing. Mark45 on Phoronix said this issue affected all 3 of his Nvidia cards when he tried to use my Wayland Live CD.

I added the "got pageflip event" lines as suggested, and it seems that they appear multiple times in the log, and seem to keep going until I tty switch, but the screen is still black. Using a vanilla Ubuntu kernel, and passing nouveau.config=NvMSI=0 did not work for me, or for Mark45 on any of his Nvidia cards.

Weston-desktop-shell starts fine, and I also tried weston-simple-shm, which did not appear, but it did run.

It can't be a mesa issue, as Weston is using pixman here in this case...

I wish I had tested this sooner, but I wasn't sure if my results on the older desktop would reflect on newer Nvidia hardware.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

I wonder if there are two different bugs in play here.

If you do get the vblank events right, like it sounds that Nerdopolis does indeed get, then does reverting the blamed nouveau kernel patch make any difference? I would be surprised if it does make a difference.

This other bug would be, that when Weston changes the video mode, it does not call drmModeCrtcSetGamma. IIRC Michel Dänzer told me, that not calling that may lead to black output, if the color translations tables are all zero due not being explicitly set. So the main question here is, does Weston change the video mode, or does it use the mode that is already set for fbcon?

Weston code base already has a call to drmModeCrtcSetGamma(), but it is used only for color managed outputs I think. Yet, it is an example of how to call it, so if someone having this problem could test that, it would be very useful. If missing this call is indeed the problem, I think it should be a new Weston bug report, as the Nouveau patch blamed here should not make a difference AFAIU.

One more note: if you test reverting the blamed Nouveau patch, make sure your *both* kernels with and without are built by you and launched the same way, to make sure that the fbcon video mode is really the same on both. If the fbcon video mode is different, then that could mask the drmModeCrtcSetGamma issue.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

Hmm, a third remote possibility came to mind. If the timestamp delivered with the DRM pageflip event does not advance, I suppose Weston's fade-in animation might not advance either and therefore it never fades in. However this is hard for me to believe, as I think it would have caused problems with Xorg, too... wouldn't it?

Or maybe Xorg is only looking at the vblank counters, and you'd actually need a program using GLX_OML_sync_control to even see the timestamps?

One could check the timestamps by:
weston_log("%s: time %u.%06u frame %u\n", __func__, sec, usec, frame);
in compositor-drm.c page_flip_handler() function.

Mm, I'd like to see a weston log with that when you have the black screen problem.

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

The only interesting thing that I can find in the dmesg logs is this:

[ 1.509268] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 1.509269] [drm] No driver support for vblank timestamp query.

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

Created attachment 100199
weston log with pixman

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

(In reply to comment #31)
> Created attachment 100199 [details]
> weston log with pixman

Something is seriously wrong here:
[09:58:29.402] queueing pageflip failed: Permission denied

But I see you have another weston session afterwards, which seems to run, but:
[12:26:36.371] page_flip_handler: time 0.000000 frame 603
[12:26:36.371] got pageflip event
[12:26:36.388] page_flip_handler: time 0.000000 frame 603
[12:26:36.388] got pageflip event
[12:26:36.404] page_flip_handler: time 0.000000 frame 603
[12:26:36.404] got pageflip event
[12:26:36.421] page_flip_handler: time 0.000000 frame 603
[12:26:36.421] got pageflip event
[12:26:36.437] page_flip_handler: time 0.000000 frame 603
[12:26:36.438] got pageflip event
[12:26:36.454] page_flip_handler: time 0.000000 frame 603

That means that the kernel DRM driver does not provide timestamps and it does not even advance the vblank counter(?) if I understood right.

So at least for you, the problem is that the kernel DRM does not report proper pageflip timestamps. I would consider this as a kernel driver bug, but if the kernel developers insist, I suppose we could also hack around it in Weston by providing fake timestamps, disabling the Presentation extension (somewhat similar X11 Present, not yet merged) and yelling loudly. But I'd rather the driver got fixed.

As far as I see, bad timestamps are enough to make Weston stay black (fade-in animation not advancing), yet appear to work just fine. So at least we (Weston devs) should detect when the kernel driver is faulty, and abort Weston.

Weston, and also the intention in Wayland, is to rely on the proper pageflip timestamps from the kernel for timing... well, everything visual, so we really would like it to work.

Revision history for this message
In , Ppaalanen (ppaalanen) wrote :

I filed bug #79502 for adding the detection to Weston.

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

Just an FYI, I think the

[09:58:29.402] queueing pageflip failed: Permission denied

came from when I killed the previous Weston session, to have the waylandloginmanager reload Weston after I recompiled it with the diag lines on my live session. I was in another TTY

That was the only time I ever seen that happen so far

Revision history for this message
In , Ilia Mirkin (imirkin) wrote :

A few pageflip-related commits landed in the repo:

http://cgit.freedesktop.org/~darktama/nouveau/

http://cgit.freedesktop.org/~darktama/nouveau/commit/?id=ff527544b0048d41a56097a66ff954fb9b0a2c75

Seems related to what was going on here. Give it a shot... (you'll need to build it against the latest drm-next tree from http://cgit.freedesktop.org/~airlied/linux/log/?h=drm-next). Or try to apply the patches to whatever kernel tree you're on -- should probably work.

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

I'm trying to compile the module... it seems that it's gone from the log now???

How do I disable werror too?

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

I found out how to turn off werror, but I guess I need to know what patches I need to apply... Is that the only commit that I need to try the fix?

Revision history for this message
In , Ilia Mirkin (imirkin) wrote :

All the patches are now in drm-next, just use that.

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

I would at least like to try to just try to apply them against the 3.13 nouveau module, instead of trying to rebuild the whole kernel first? So is that the only patch I would need?

I'm still on 3.13 (which is also affected), as I have been doing all of my testing on the Weston side of things...

Revision history for this message
In , Ilia Mirkin (imirkin) wrote :

(In reply to comment #39)
> I would at least like to try to just try to apply them against the 3.13
> nouveau module, instead of trying to rebuild the whole kernel first? So is
> that the only patch I would need?
>
> I'm still on 3.13 (which is also affected), as I have been doing all of my
> testing on the Weston side of things...

Pretty sure that's the only patch. There were 2 other patches, but they only matter for pre-nv50 hardware, which I assume you're not using (GeForce 7 and earlier).

Revision history for this message
In , Nerdopolis1 (nerdopolis1) wrote :

I found the second patch too, and I applied that as well. It works now, even after I applied them to 3.13, and built just the Nouveau module.

I have an older Nvidia card, I think it's NV36.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1330785

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Nerdopolis (bluescreen-avenger) wrote :

Hi. I am unable to run the command, but this issue has been confirmed upstream https://bugs.freedesktop.org/show_bug.cgi?id=75761

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
cecilpierce (piercececil-deactivatedaccount) wrote :

Affects me to

Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

13.04 is now EOL. I see the following commit has been cc'd to upstream stable:

mc/nv50-: fix kms pageflip events by reordering irq handling order.

The fix will make it's way into trusty through the normal stable update process.

tags: added: kernel-da-key trusty
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
Nerdopolis (bluescreen-avenger) wrote :

Is http://cgit.freedesktop.org/~airlied/linux/patch/?id=af4870e406126b7ac0ae7c7ce5751f25ebe60f28 in there to be merged into the Trusty kernel?

Sorry when I said 13.04, I made a mistake. I meant 14.04

Revision history for this message
In , Ilia Mirkin (imirkin) wrote :

All those patches should have made it to 3.16 (and were cc'd to earlier stable kernels). Does 3.16 work OK for people?

Revision history for this message
In , Martin-peres (martin-peres) wrote :

Works for me on the 3.16 and 3.17. Finally :)

Changed in linux:
status: Confirmed → Fix Released
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.