[nvidia] Xorg crashed with SIGBUS in _dl_fixup() from _dl_runtime_resolve_xsavec() from create_bits_picture() from image_from_pict_internal() from wfb_image_from_pict()

Bug #1760450 reported by Sosha on 2018-04-01
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Fedora
Unknown
Unknown
xorg-server (Ubuntu)
Medium
Unassigned

Bug Description

after login session(suspend mode)

ProblemType: Crash
DistroRelease: Ubuntu 18.04
Package: xserver-xorg-core 2:1.19.6-1ubuntu3
ProcVersionSignature: Ubuntu 4.15.0-13.14-generic 4.15.10
Uname: Linux 4.15.0-13-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.9-0ubuntu2
Architecture: amd64
Date: Sun Apr 1 20:34:29 2018
DistroCodename: bionic
DistroVariant: ubuntu
ExecutablePath: /usr/lib/xorg/Xorg
InstallationDate: Installed on 2017-09-02 (211 days ago)
InstallationMedia: Ubuntu-GNOME 17.04 "Zesty Zapus" - Release amd64 (20170412)
ProcCmdline: /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
ProcEnviron:

Signal: 7
SourcePackage: xorg-server
StacktraceTop:
 _dl_fixup (l=0x559c10faa5a0, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:84
 _dl_runtime_resolve_xsavec () at ../sysdeps/x86_64/dl-trampoline.h:125
 ?? () from /usr/lib/xorg/modules/libwfb.so
 wfbComposite () from /usr/lib/xorg/modules/libwfb.so
 ?? () from /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
Title: Xorg crashed with signal 7 in _dl_fixup()
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip libvirt lpadmin plugdev sambashare sudo

Sosha (soshaw) wrote :
information type: Private → Public

StacktraceTop:
 _dl_fixup (l=0x559c10faa5a0, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:84
 _dl_runtime_resolve_xsavec () at ../sysdeps/x86_64/dl-trampoline.h:125
 create_bits_picture (yoff=0x7fff0a6c9594, xoff=0x7fff0a6c9590, has_clip=1, pict=0x559c129a29d0) at ../../../../fb/fbpict.c:325
 image_from_pict_internal (pict=pict@entry=0x559c129a29d0, has_clip=has_clip@entry=1, xoff=xoff@entry=0x7fff0a6c9590, yoff=yoff@entry=0x7fff0a6c9594, is_alpha_map=is_alpha_map@entry=0) at ../../../../fb/fbpict.c:457
 wfb_image_from_pict (pict=pict@entry=0x559c129a29d0, has_clip=has_clip@entry=1, xoff=xoff@entry=0x7fff0a6c9590, yoff=yoff@entry=0x7fff0a6c9594) at ../../../../fb/fbpict.c:487

Changed in xorg-server (Ubuntu):
importance: Undecided → Medium
tags: removed: need-amd64-retrace

Hi Ubuntu users! Signal 7 is SIGBUS. SIGBUS should be relatively unusual on x86 [1].

[1] https://stackoverflow.com/questions/2089167/debugging-sigbus-on-x86-linux

I'm excited to inform you that Fedora Linux users also started seeing the same root problem. It is tied to the upgrade from kernel v4.14 to v4.15.

Fedora bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Arch Linux independently identified this as caused by the kernel upgrade:
https://bbs.archlinux.org/viewtopic.php?id=235027

It can happen after resume from suspend, not every time but maybe once every three days. We have reports for both Xwayland and Xorg getting a fatal SIGBUS in _dl_fixup(). (While this is actually a secondary crash in xorg_backtrace(), we have a load of SIGBUS traces that have the same primary trace as each other).

Notice the specific faulting instruction in disassembly you captured: it is not performing a memory access!

=> 0x559c102a4060 <ErrorFSigSafe>: sub $0xd8,%rsp

Instead, notice that this is the first instruction in the function ErrorFSigSafe. This is a big common factor in our traces. (We actually have several different traces captured, with the failing function varying, often along the same call chain).

What's happening is a fault on the instruction fetch. You should be able to confirm this if you look at the address which generates the fault. (si_addr field of struct siginfo. I don't know where the Ubuntu crash collector saves this information)

The kernel failed to load in the page which holds the program code at this point. That's the real problem: some sort of transient IO error during wakeup. Users sometimes see other symptoms of these IO errors as well:

PM: resume devices took 1.017 seconds
Restarting tasks ...
Read-error on swap-device (253:1:836184)
PM: suspend exit
systemd-coredump[755]: Process 1356 (Xwayland) of user 42 dumped core.

and

PM: suspend exit
EXT4-fs error (device dm-2): ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Buffer I/O error on dev dm-2, logical block 0, lost sync page write
WARNING: CPU: 1 PID: 748 at fs/buffer.c:1108 mark_buffer_dirty+0xd4/0xe0
 (and a kernel backtrace)

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xorg-server (Ubuntu):
status: New → Confirmed

I found it hard to follow your log files because of so many errors from extension "<email address hidden>". Maybe try removing your extensions if they're that buggy/noisy.

Also, what version of the nvidia driver are you using?

summary: - Xorg crashed with signal 7 in _dl_fixup()
+ Xorg crashed with signal 7 in _dl_fixup() from
+ _dl_runtime_resolve_xsavec()
summary: - Xorg crashed with signal 7 in _dl_fixup() from
- _dl_runtime_resolve_xsavec()
+ [nvidia] Xorg crashed with signal 7 in _dl_fixup() from
+ _dl_runtime_resolve_xsavec() called from nvidia_drv.so
tags: added: nvidia
Alan Jenkins (aj504) wrote :

Uh, if anyone else is affected by this, there's a trivial fix upstream already (and a workaround). Hop to it, Ubuntu. gregkh is looking disappointed at you :-). I checked, and it looks like you didn't apply it to you 4.15 tree. See end for links to the fix etc.

For users: The workaround is to add "scsi_mod.scan=sync" on the kernel command line (i.e. edit /etc/default/grub and run `update-grub`).

Please note

1. AFAICT this is near-universal.
   It affects all desktop users of kernel 4.15/4.16 who use suspend
   (and whose workloads use all their RAM).
   It could be avoided by not using SCSI, but it does affect all systems with root on SATA.

2. Although this is horrible when it happens (X crash) and can happen on a near-daily basis,
   it can be quite difficult for users to analyze and report. For example, the crash doesn't
   have one specific backtrace in Xorg. It tends to generate several different backtraces,
   non-deterministicly. Sometimes, making a coredump fails, presumably due to the same bug
   that causes the crash.

   I remember that Sosha had to make two attempts at reporting this bug
   (though I don't remember what was wrong with the first one).

   Also, it's triggered by a medium-term SIGALRM timer in Xorg.
   This made it really annoying to reproduce, at the time when I didn't know the root cause.
   I was able to reproduce the memory pressure needed, but it didn't happen
   when testing suspend+resume... only when I broke for lunch and left the machine
   suspended for long enough :).

Fix: "block: do not use interruptible wait anywhere"

in kernel 4.17: https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428

in kernel 4.16.8: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6

lack of fix in 4.15.0-23.25 (ubuntu bionic): https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-23.25#n856

Sosha (soshaw) wrote :

@Daniel

i use nvidia-driver-396.

Julien Olivier (julo) wrote :

I confirm that adding "scsi_mod.scan=sync" does indeed fix this bug.

Alan Jenkins (aj504) wrote :

Thanks for your confirmation, Julien. I have asked Ubuntu to import the proper fix from upstream and they responded very promptly. See:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776887

They have posted a test kernel. I don't have a ubuntu install to test it on - only a VM which cannot suspend. It *might* be useful if someone wants to volunteer to try using the test kernel.

You could test that your normal suspend still works, and test the command mentioned in the commit. I.e.

$ sudo -i
# dd if=/dev/sda of=/dev/null iflag=direct & \
  while killall -SIGUSR1 dd; do sleep 0.1; done & \
  echo mem > /sys/power/state ; \
  sleep 5; killall dd # stop after 5 seconds

On a "bad" kernel, any time you run this command it should show a message about an IO error. On a "good" kernel, the system will appear to suspend and resume, but there should be no IO error.

Vedant Bhatia (vedant19) wrote :

Hi. I faced a similar issue and when I tried to add "scsi_mod.scan=sync" to /etc/default/grub and then run update-grub I got the following error:
/usr/sbin/grub-mkconfig: 36: /etc/default/grub: scsi_mod.scan=sync: not found

Would appreciate any help, thanks.

Vedant Bhatia (vedant19) wrote :

The problem hasn't occurred so far after adding "csi_mod.scan=sync".

Alan Jenkins (aj504) wrote :

Hi Vedant. Change the line in /etc/default/grub e.g.

GRUB_CMDLINE_LINUX_DEFAULT="quiet"

to

GRUB_CMDLINE_LINUX_DEFAULT="quiet scsi_mod.scan=sync"

and then re-run update-grub.

summary: - [nvidia] Xorg crashed with signal 7 in _dl_fixup() from
- _dl_runtime_resolve_xsavec() called from nvidia_drv.so
+ [nvidia] Xorg crashed with SIGBUS in _dl_fixup() from
+ _dl_runtime_resolve_xsavec() from create_bits_picture() from
+ image_from_pict_internal() from wfb_image_from_pict()
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.