[nvidia] Xorg crashed with SIGBUS in _dl_fixup() from _dl_runtime_resolve_xsavec() from create_bits_picture() from image_from_pict_internal() from wfb_image_from_pict()

Bug #1760450 reported by Sosha
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Fedora
Unknown
Unknown
xorg-server (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

after login session(suspend mode)

ProblemType: Crash
DistroRelease: Ubuntu 18.04
Package: xserver-xorg-core 2:1.19.6-1ubuntu3
ProcVersionSignature: Ubuntu 4.15.0-13.14-generic 4.15.10
Uname: Linux 4.15.0-13-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.9-0ubuntu2
Architecture: amd64
Date: Sun Apr 1 20:34:29 2018
DistroCodename: bionic
DistroVariant: ubuntu
ExecutablePath: /usr/lib/xorg/Xorg
InstallationDate: Installed on 2017-09-02 (211 days ago)
InstallationMedia: Ubuntu-GNOME 17.04 "Zesty Zapus" - Release amd64 (20170412)
ProcCmdline: /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
ProcEnviron:

Signal: 7
SourcePackage: xorg-server
StacktraceTop:
 _dl_fixup (l=0x559c10faa5a0, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:84
 _dl_runtime_resolve_xsavec () at ../sysdeps/x86_64/dl-trampoline.h:125
 ?? () from /usr/lib/xorg/modules/libwfb.so
 wfbComposite () from /usr/lib/xorg/modules/libwfb.so
 ?? () from /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
Title: Xorg crashed with signal 7 in _dl_fixup()
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip libvirt lpadmin plugdev sambashare sudo

Revision history for this message
Sosha (soshaw) wrote :
information type: Private → Public
Revision history for this message
Apport retracing service (apport) wrote :

StacktraceTop:
 _dl_fixup (l=0x559c10faa5a0, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:84
 _dl_runtime_resolve_xsavec () at ../sysdeps/x86_64/dl-trampoline.h:125
 create_bits_picture (yoff=0x7fff0a6c9594, xoff=0x7fff0a6c9590, has_clip=1, pict=0x559c129a29d0) at ../../../../fb/fbpict.c:325
 image_from_pict_internal (pict=pict@entry=0x559c129a29d0, has_clip=has_clip@entry=1, xoff=xoff@entry=0x7fff0a6c9590, yoff=yoff@entry=0x7fff0a6c9594, is_alpha_map=is_alpha_map@entry=0) at ../../../../fb/fbpict.c:457
 wfb_image_from_pict (pict=pict@entry=0x559c129a29d0, has_clip=has_clip@entry=1, xoff=xoff@entry=0x7fff0a6c9590, yoff=yoff@entry=0x7fff0a6c9594) at ../../../../fb/fbpict.c:487

Revision history for this message
Apport retracing service (apport) wrote : Stacktrace.txt
Revision history for this message
Apport retracing service (apport) wrote : StacktraceSource.txt
Revision history for this message
Apport retracing service (apport) wrote : ThreadStacktrace.txt
Changed in xorg-server (Ubuntu):
importance: Undecided → Medium
tags: removed: need-amd64-retrace
Revision history for this message
Alan Jenkins (aj504) wrote : Re: Xorg crashed with signal 7 in _dl_fixup()

Hi Ubuntu users! Signal 7 is SIGBUS. SIGBUS should be relatively unusual on x86 [1].

[1] https://stackoverflow.com/questions/2089167/debugging-sigbus-on-x86-linux

I'm excited to inform you that Fedora Linux users also started seeing the same root problem. It is tied to the upgrade from kernel v4.14 to v4.15.

Fedora bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Arch Linux independently identified this as caused by the kernel upgrade:
https://bbs.archlinux.org/viewtopic.php?id=235027

It can happen after resume from suspend, not every time but maybe once every three days. We have reports for both Xwayland and Xorg getting a fatal SIGBUS in _dl_fixup(). (While this is actually a secondary crash in xorg_backtrace(), we have a load of SIGBUS traces that have the same primary trace as each other).

Notice the specific faulting instruction in disassembly you captured: it is not performing a memory access!

=> 0x559c102a4060 <ErrorFSigSafe>: sub $0xd8,%rsp

Instead, notice that this is the first instruction in the function ErrorFSigSafe. This is a big common factor in our traces. (We actually have several different traces captured, with the failing function varying, often along the same call chain).

What's happening is a fault on the instruction fetch. You should be able to confirm this if you look at the address which generates the fault. (si_addr field of struct siginfo. I don't know where the Ubuntu crash collector saves this information)

The kernel failed to load in the page which holds the program code at this point. That's the real problem: some sort of transient IO error during wakeup. Users sometimes see other symptoms of these IO errors as well:

PM: resume devices took 1.017 seconds
Restarting tasks ...
Read-error on swap-device (253:1:836184)
PM: suspend exit
systemd-coredump[755]: Process 1356 (Xwayland) of user 42 dumped core.

and

PM: suspend exit
EXT4-fs error (device dm-2): ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Buffer I/O error on dev dm-2, logical block 0, lost sync page write
WARNING: CPU: 1 PID: 748 at fs/buffer.c:1108 mark_buffer_dirty+0xd4/0xe0
 (and a kernel backtrace)

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xorg-server (Ubuntu):
status: New → Confirmed
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: [nvidia] Xorg crashed with signal 7 in _dl_fixup() from _dl_runtime_resolve_xsavec() called from nvidia_drv.so

I found it hard to follow your log files because of so many errors from extension "<email address hidden>". Maybe try removing your extensions if they're that buggy/noisy.

Also, what version of the nvidia driver are you using?

summary: - Xorg crashed with signal 7 in _dl_fixup()
+ Xorg crashed with signal 7 in _dl_fixup() from
+ _dl_runtime_resolve_xsavec()
summary: - Xorg crashed with signal 7 in _dl_fixup() from
- _dl_runtime_resolve_xsavec()
+ [nvidia] Xorg crashed with signal 7 in _dl_fixup() from
+ _dl_runtime_resolve_xsavec() called from nvidia_drv.so
tags: added: nvidia
Revision history for this message
Alan Jenkins (aj504) wrote :

Uh, if anyone else is affected by this, there's a trivial fix upstream already (and a workaround). Hop to it, Ubuntu. gregkh is looking disappointed at you :-). I checked, and it looks like you didn't apply it to you 4.15 tree. See end for links to the fix etc.

For users: The workaround is to add "scsi_mod.scan=sync" on the kernel command line (i.e. edit /etc/default/grub and run `update-grub`).

Please note

1. AFAICT this is near-universal.
   It affects all desktop users of kernel 4.15/4.16 who use suspend
   (and whose workloads use all their RAM).
   It could be avoided by not using SCSI, but it does affect all systems with root on SATA.

2. Although this is horrible when it happens (X crash) and can happen on a near-daily basis,
   it can be quite difficult for users to analyze and report. For example, the crash doesn't
   have one specific backtrace in Xorg. It tends to generate several different backtraces,
   non-deterministicly. Sometimes, making a coredump fails, presumably due to the same bug
   that causes the crash.

   I remember that Sosha had to make two attempts at reporting this bug
   (though I don't remember what was wrong with the first one).

   Also, it's triggered by a medium-term SIGALRM timer in Xorg.
   This made it really annoying to reproduce, at the time when I didn't know the root cause.
   I was able to reproduce the memory pressure needed, but it didn't happen
   when testing suspend+resume... only when I broke for lunch and left the machine
   suspended for long enough :).

Fix: "block: do not use interruptible wait anywhere"

in kernel 4.17: https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428

in kernel 4.16.8: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6

lack of fix in 4.15.0-23.25 (ubuntu bionic): https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-23.25#n856

Revision history for this message
Sosha (soshaw) wrote :

@Daniel

i use nvidia-driver-396.

Revision history for this message
Julien Olivier (julo) wrote :

I confirm that adding "scsi_mod.scan=sync" does indeed fix this bug.

Revision history for this message
Alan Jenkins (aj504) wrote :

Thanks for your confirmation, Julien. I have asked Ubuntu to import the proper fix from upstream and they responded very promptly. See:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776887

They have posted a test kernel. I don't have a ubuntu install to test it on - only a VM which cannot suspend. It *might* be useful if someone wants to volunteer to try using the test kernel.

You could test that your normal suspend still works, and test the command mentioned in the commit. I.e.

$ sudo -i
# dd if=/dev/sda of=/dev/null iflag=direct & \
  while killall -SIGUSR1 dd; do sleep 0.1; done & \
  echo mem > /sys/power/state ; \
  sleep 5; killall dd # stop after 5 seconds

On a "bad" kernel, any time you run this command it should show a message about an IO error. On a "good" kernel, the system will appear to suspend and resume, but there should be no IO error.

Revision history for this message
Vedant Bhatia (vedant19) wrote :

Hi. I faced a similar issue and when I tried to add "scsi_mod.scan=sync" to /etc/default/grub and then run update-grub I got the following error:
/usr/sbin/grub-mkconfig: 36: /etc/default/grub: scsi_mod.scan=sync: not found

Would appreciate any help, thanks.

Revision history for this message
Vedant Bhatia (vedant19) wrote :

The problem hasn't occurred so far after adding "csi_mod.scan=sync".

Revision history for this message
Alan Jenkins (aj504) wrote :

Hi Vedant. Change the line in /etc/default/grub e.g.

GRUB_CMDLINE_LINUX_DEFAULT="quiet"

to

GRUB_CMDLINE_LINUX_DEFAULT="quiet scsi_mod.scan=sync"

and then re-run update-grub.

summary: - [nvidia] Xorg crashed with signal 7 in _dl_fixup() from
- _dl_runtime_resolve_xsavec() called from nvidia_drv.so
+ [nvidia] Xorg crashed with SIGBUS in _dl_fixup() from
+ _dl_runtime_resolve_xsavec() from create_bits_picture() from
+ image_from_pict_internal() from wfb_image_from_pict()
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.