Libdrm compiled with gcc 4.8 makes card hang on resume from s2disk

Bug #1247607 reported by Ronald on 2013-11-03
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Nouveau Xorg driver
Fix Released
Medium
gcc
New
Undecided
Unassigned
libdrm (Ubuntu)
High
Maarten Lankhorst
Saucy
High
Maarten Lankhorst

Bug Description

[Impact]
 * Relocations on nv04-nv4f are completely broken, any kind of memory stress will result in userspace failure.

[Test Case]
 * Suspend on a system with an affected nvidia card.
 * Resume should complete without any errors in FIFO.
 * Other test is to run max-texture-size from piglit, system should survive.
 * For m-a same change: confirm libdrm-dev:amd64 and libdrm-dev:i386 coinstall.

[Regression Potential]
 * Original bug report was caused by undefined behavior in gcc:
   - *push->cur++ = pushbuf_krel(push)
   - pushbuf_krel uses (push->cur - push->bgn) to calculate offset.
   - It's undefined whether push->cur is incremented before calling pushbuf_krel or not.
   - New gcc changed the order, causing the bug.
   - Fixed by calling push->cur++ in a separate statement.
 * Bug reports will be watched for any new problems that may pop up.

[Original bug report]
My initial bugreport was here: https://bugs.freedesktop.org/show_bug.cgi?id=71116

- Recompiling libdrm with gcc 4.7 fixed the bug.

Upstream GCC mail thread here: http://gcc.gnu.org/ml/gcc-help/2013-07/msg00103.html

This was fixed thanks to Emil Velikov from the nouveau project.

Created attachment 88462
Attempt to resume

[ 0.197865] nouveau [ DEVICE][0000:01:00.0] Chipset: NV34 (NV34)
[ 0.197869] nouveau [ DEVICE][0000:01:00.0] Family : NV30

After *only upgrading userspace* from Ubuntu 13.04 to Ubuntu 13.10 the card fails to properly resume from s2disk.

Important upgraded packages:

xserver-xorg-core:i386 2:1.13.3-0ubuntu6.2 -> 2:1.14.3-3ubuntu2
xserver-xorg-video-nouveau:i386 1:1.0.7-0ubuntu1 -> 1:1.0.9-2ubuntu1
libdrm2:i386 2.4.43-0ubuntu1.1 -> 2.4.46-1
libpciaccess0:i386 0.13.1-2 -> 0.13.2-1

Created attachment 88463
Attempt to resume (nouveau.agpmode=4)

Suggested by imirkin. No dice.

Hi Ronald

AFAICS there are two routes you/we can take this
* Find out which package update caused the breakage
* Disable AGP with nouveau (similar to what you did with the blob)

If you're going for the first one, here is a list of packages that you'll need to check (other may not me related):
* xserver-xorg-core
* xserver-xorg-video-nouveau
* All *mesa*

The above can be rather messy and hard, as you'll need to track that the dependencies are met and/or rebuild packages.

The latter route is much easier
* Append nouveau.agpmode=0 to your kernel command line (i.e. grub/lilo)

Created attachment 88546
2_Attempt to resume (nouveau.agpmode=0)

Survives one cycle since I did this from SSH with `pm-hibernate`. Apparently Xorg only locksup on resume because of xscreensaver if I press the button in XFCE.

(In reply to comment #6)
> Created attachment 88546 [details]
> 2_Attempt to resume (nouveau.agpmode=0)
>
> Survives one cycle since I did this from SSH with `pm-hibernate`. Apparently
> Xorg only locksup on resume because of xscreensaver if I press the button in
> XFCE.
I had a strange feeling that mesa is involved :\

If you're up for it you can {re,}move nouveau_dri.so (the 3g/gl "driver") and confirm it that fixes your s2ram problems.
Note your CPU usage may sky-rocket (the system will fall back to software GL rendering).

(In reply to comment #9)
[...]
> I'm not totally against bisecting these source packages, but that will then
> be a longterm thing. Furthermore, is all of mesa required? (i.e. I'm not
> using a compositing desktop for that matter (just XFCE)).
>
Before bisecting anything I would suggest that you absolutely make sure that you know which package broke things - mesa, xserver-xorg-core, xserver-xorg-video-nouveau and/or other.

When going through mesa, make sure that you keep all the packages at the same version.

Here is a crash course of what the different mesa packages provide
Essential parts
* libgl1-mesa-glx -> libGL provider
* libgl1-mesa-dri -> provides the "hardware accelerated drivers"
* libglapi-mesa -> dispatch library

Extra (your system may require one or both depending on what software you have)
* libgbm1
* libegl1-mesa -> libEGL provider

Other
* libopenvg1-mesa -> libOpenVG, afaik noone uses this
* mesa-common-dev -> headers

Not part of mesa (but have mesa in the package name :P)
* libglu1-mesa -> minor suspect
* mesa-utils -> highly unlikely to be related

Have fun :)

I relocated /usr/lib/i386-linux-gnu/dri/nouveau_dri.so to nouveau_dri_off.so and restarted the computer with AGP. I confirmed the AIGLX error in that it switches to swrast.

Did not fix the s2disk issue (dmesg attached).

So, it's not mesa nor the kernel. All that remains is (just...) the X server, xf86-video-nouveau and libdrm. Right?

(In reply to comment #11)
> I relocated /usr/lib/i386-linux-gnu/dri/nouveau_dri.so to nouveau_dri_off.so
> and restarted the computer with AGP. I confirmed the AIGLX error in that it
> switches to swrast.
>
> Did not fix the s2disk issue (dmesg attached).
>
> So, it's not mesa nor the kernel. All that remains is (just...) the X
> server, xf86-video-nouveau and libdrm. Right?

Sounds like it.
Do you know which gcc version is/was used to compile the new libdrm ? Might it be related to this http://gcc.gnu.org/ml/gcc-help/2013-07/msg00103.html?

13.10 (saucy) is using 4.8. 4.7 seems available, rebuilding libdrm with the source package should be easy. Just need to modify CC var I think...

Ronald all the thanks goes to the Andreas Radke (Arch developer) for finding and bisecting the issue. FWIW Arch has been using clang for ~3 months to build libdrm and it seems to work great.

Ideally one of the Ubuntu/GCC devs can take a look and provide more bit more clarity than
"This hints at something depending on undefined evaluation order between sequence points" [1]

Would that be an issue with nouveau code, gcc and/or both will hopefully be determined soon.

Cheers
Emil
[1] http://gcc.gnu.org/ml/gcc-help/2013-07/msg00151.html

Changed in nouveau:
importance: Unknown → Medium
status: Unknown → Confirmed

Wow this is definitely a weird interesting bug.

It affects suspend-to-mem on my pci-e nv43 too.

<after resume from suspend to memory>
[ 40.107810] nouveau E[ PGRAPH][0000:01:00.0] ERROR nsource: DATA_ERROR nstatus: BAD_ARGUMENT
[ 40.108059] nouveau E[ PGRAPH][0000:01:00.0] ch 3 [0x000f7000 compiz[1713]] subc 2 class 0x0039 mthd 0x0314 data 0x0051f000
(error repeats)

I can confirm that rebuilding only libdrm-nouveau2 with the raring toolchain fixes it.

So, we have nv34, nv43 and nv44 affected by this. I updated the bugreport title so hopefully more people will stumble on it earlier (hope that is okay).

- nv44: GCC thread: During normal operation.
- nv43: Resuming from suspend-to-mem
- nv34: Resuming from suspend-to-disk

Created attachment 88817
hack to force defined behavior

As I feared, recompiling only pushbuf.c with gcc-4.7 fixes this bug. So something in pushbuf.c is miscompiled. Unfortunately it's also the biggest abuser of post increment ops.

A simple workaround seems to be this patch.

Compiled code with this patch:

00000000000015b0 <nouveau_pushbuf_reloc>:
    15b0: 55 push %rbp
    15b1: 48 8b 6f 30 mov 0x30(%rdi),%rbp
    15b5: 53 push %rbx
    15b6: 48 89 fb mov %rdi,%rbx
    15b9: e8 42 ea ff ff callq 0 <pushbuf_krel>
    15be: 89 45 00 mov %eax,0x0(%rbp)
    15c1: 48 83 43 30 04 addq $0x4,0x30(%rbx)
    15c6: 5b pop %rbx
    15c7: 5d pop %rbp
    15c8: c3 retq

Without:
0000000000001650 <nouveau_pushbuf_reloc>:
    1650: 48 89 5c 24 f0 mov %rbx,-0x10(%rsp)
    1655: 48 89 6c 24 f8 mov %rbp,-0x8(%rsp)
    165a: 48 83 ec 10 sub $0x10,%rsp
    165e: 48 8b 6f 30 mov 0x30(%rdi),%rbp
    1662: 48 89 fb mov %rdi,%rbx
    1665: e8 96 e9 ff ff callq 0 <pushbuf_krel>
    166a: 89 45 00 mov %eax,0x0(%rbp)
    166d: 48 83 c5 04 add $0x4,%rbp
    1671: 48 89 6b 30 mov %rbp,0x30(%rbx)
    1675: 48 8b 1c 24 mov (%rsp),%rbx
    1679: 48 8b 6c 24 08 mov 0x8(%rsp),%rbp
    167e: 48 83 c4 10 add $0x10,%rsp
    1682: c3 retq

It seems that the nouveau_pushbuf_reloc difference is harmless. Code is equivalent, except that without the patch applied it uses another temp variable causing less optimal code generation.

Remaining candidates: pushbuf_validate, nouveau_pushbuf_data, nouveau_pushbuf_del.

Oops, so gcc is broken here after all, look at this..

0000000000003860 <nouveau_pushbuf_reloc>:
    3860: 53 push %rbx
    3861: 48 8b 5f 30 mov 0x30(%rdi),%rbx
    3865: 48 8d 43 04 lea 0x4(%rbx),%rax
    3869: 48 89 47 30 mov %rax,0x30(%rdi)
    386d: e8 3e ea ff ff callq 22b0 <pushbuf_krel>
    3872: 89 03 mov %eax,(%rbx)
    3874: 5b pop %rbx
    3875: c3 retq

Source:

void
nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo,
        uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor)
{
 *push->cur++ = pushbuf_krel(push, bo, data, flags, vor, tor);
}

gcc-4.8 evaluates it as:

void
nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo,
        uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor)
{
 push->cur++;
 *push->cur[-1] = pushbuf_krel(push, bo, data, flags, vor, tor);
}

while gcc-4.7 evaluates it as:

void
nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo,
        uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor)
{
 *push->cur = pushbuf_krel(push, bo, data, flags, vor, tor);
 push->cur++;
}

Because the function also increments the push->cur pointer, it's a subtle but real difference:

gcc-4.7:

u32 ret = func(); // May change push->cur ptr
*push->cur = ret;
push->cur++;

gcc-4.8:
u32 *ptr = push->cur;
push->cur++;

*ptr = func(); // Already sees the push->cur ptr

I'm not a language expert, so no idea if I'm right, but it seems the updated gcc-4.8 behavior is wrong here.

Well, pedantically gcc-4.7 behavior is this:

u32 *ptr = push->cur;
*ptr = func();
push->cur++;

But since func only reads push->cur it's the same.

tags: added: saucy trusty
Changed in libdrm (Ubuntu):
assignee: nobody → Maarten Lankhorst (mlankhorst)
Changed in libdrm (Ubuntu Saucy):
assignee: nobody → Maarten Lankhorst (mlankhorst)
Changed in libdrm (Ubuntu):
importance: Undecided → High
Changed in libdrm (Ubuntu Saucy):
importance: Undecided → High

I am pretty sure the order of execution is considered undefined behavior here in C, thus the nouveau code is wrong (it is a side effect). Check out chapter 2.12 in "The C Programming Language" if you have it.

Maarten Lankhorst (mlankhorst) wrote :

Fixed with the upload to 2.4.46-4

Changed in libdrm (Ubuntu):
status: New → Fix Committed
Changed in libdrm (Ubuntu):
status: Fix Committed → Fix Released
Changed in libdrm (Ubuntu Saucy):
status: New → In Progress
Changed in nouveau:
status: Confirmed → Fix Released
description: updated

Hello Ronald, or anyone else affected,

Accepted libdrm into saucy-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libdrm/2.4.46-1ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libdrm (Ubuntu Saucy):
status: In Progress → Fix Committed
tags: added: verification-needed
Maarten Lankhorst (mlankhorst) wrote :

Confirmed fixed on saucy-amd64. :)

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libdrm - 2.4.46-1ubuntu1

---------------
libdrm (2.4.46-1ubuntu1) saucy-proposed; urgency=low

  [ Colin Watson ]
  * Declare libdrm-dev Multi-Arch: same.

  [ Maarten Lankhorst ]
  * Cherry-pick upstream patch to fix relocations for all cards <nv50.
    (LP: #1247607)
 -- Maarten Lankhorst <email address hidden> Mon, 11 Nov 2013 13:02:28 +0100

Changed in libdrm (Ubuntu Saucy):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for libdrm has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Hello Ronald, or anyone else affected,

Accepted libdrm into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libdrm/2.4.46-1ubuntu0.0.0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
Chris Halse Rogers (raof) wrote :

Hello Ronald, or anyone else affected,

Accepted libdrm into quantal-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libdrm/2.4.46-1ubuntu0.0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Chris Halse Rogers (raof) wrote :

Hello Ronald, or anyone else affected,

Accepted libdrm into raring-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/libdrm/2.4.46-1ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Maarten Lankhorst (mlankhorst) wrote :

I verified that suspend still works on raring. This wasn't a bug on precise/quantal/raring, but it was mentioned when updating the changelogs to the saucy version.

tags: added: verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.