Ubuntu
xorg-server package

[nvidia] Xorg crashed with SIGBUS in _dl_fixup() from _dl_runtime_resolve_xsavec() from create_bits_picture() from image_from_pict_internal() from wfb_image_from_pict()

Bug #1760450 reported by Sosha on 2018-04-01

This bug affects 8 people

Affects		Status	Importance	Assigned to	Milestone
	Fedora	Unknown	Unknown	redhat-bugs #1553979
	xorg-server (Ubuntu)	Confirmed	Medium	Unassigned

Bug Description

after login session(suspend mode)

ProblemType: Crash
DistroRelease: Ubuntu 18.04
Package: xserver-xorg-core 2:1.19.6-1ubuntu3
ProcVersionSignature: Ubuntu 4.15.0-13.14-generic 4.15.10
Uname: Linux 4.15.0-13-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.9-0ubuntu2
Architecture: amd64
Date: Sun Apr 1 20:34:29 2018
DistroCodename: bionic
DistroVariant: ubuntu
ExecutablePath: /usr/lib/xorg/Xorg
InstallationDate: Installed on 2017-09-02 (211 days ago)
InstallationMedia: Ubuntu-GNOME 17.04 "Zesty Zapus" - Release amd64 (20170412)
ProcCmdline: /usr/lib/xorg/Xorg vt2 -displayfd 3 -auth /run/user/1000/gdm/Xauthority -background none -noreset -keeptty -verbose 3
ProcEnviron:

Signal: 7
SourcePackage: xorg-server
StacktraceTop:
_dl_fixup (l=0x559c10faa5a0, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:84
_dl_runtime_resolve_xsavec () at ../sysdeps/x86_64/dl-trampoline.h:125
?? () from /usr/lib/xorg/modules/libwfb.so
wfbComposite () from /usr/lib/xorg/modules/libwfb.so
?? () from /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
Title: Xorg crashed with signal 7 in _dl_fixup()
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip libvirt lpadmin plugdev sambashare sudo

Tags:

Revision history for this message

Sosha (soshaw) wrote on 2018-04-01:

Dependencies.txt Edit (3.7 KiB, text/plain; charset="utf-8")
Disassembly.txt Edit (981 bytes, text/plain; charset="utf-8")
JournalErrors.txt Edit (158.3 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.2 KiB, text/plain; charset="utf-8")
ProcMaps.txt Edit (30.3 KiB, text/plain; charset="utf-8")
ProcStatus.txt Edit (1.3 KiB, text/plain; charset="utf-8")
Registers.txt Edit (839 bytes, text/plain; charset="utf-8")
Stacktrace.txt Edit (1.1 KiB, text/plain; charset="utf-8")
ThreadStacktrace.txt Edit (2.1 KiB, text/plain; charset="utf-8")

information type:

Private → Public

Revision history for this message

Apport retracing service (apport) wrote on 2018-04-01:

StacktraceTop:
_dl_fixup (l=0x559c10faa5a0, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:84
_dl_runtime_resolve_xsavec () at ../sysdeps/x86_64/dl-trampoline.h:125
create_bits_picture (yoff=0x7fff0a6c9594, xoff=0x7fff0a6c9590, has_clip=1, pict=0x559c129a29d0) at ../../../../fb/fbpict.c:325
image_from_pict_internal (pict=pict@entry=0x559c129a29d0, has_clip=has_clip@entry=1, xoff=xoff@entry=0x7fff0a6c9590, yoff=yoff@entry=0x7fff0a6c9594, is_alpha_map=is_alpha_map@entry=0) at ../../../../fb/fbpict.c:457
wfb_image_from_pict (pict=pict@entry=0x559c129a29d0, has_clip=has_clip@entry=1, xoff=xoff@entry=0x7fff0a6c9590, yoff=yoff@entry=0x7fff0a6c9594) at ../../../../fb/fbpict.c:487

Revision history for this message

Apport retracing service (apport) wrote on 2018-04-01: Stacktrace.txt

Stacktrace.txt Edit (3.0 KiB, text/plain)

Revision history for this message

Apport retracing service (apport) wrote on 2018-04-01: StacktraceSource.txt

StacktraceSource.txt Edit (4.1 KiB, text/plain)

Revision history for this message

Apport retracing service (apport) wrote on 2018-04-01: ThreadStacktrace.txt

ThreadStacktrace.txt Edit (15.5 KiB, text/plain)

Changed in xorg-server (Ubuntu):
importance:	Undecided → Medium
tags:	removed: need-amd64-retrace

Revision history for this message

Alan Jenkins (aj504) wrote on 2018-04-04: Re: Xorg crashed with signal 7 in _dl_fixup()

Hi Ubuntu users! Signal 7 is SIGBUS. SIGBUS should be relatively unusual on x86 [1].

[1] https://stackoverflow.com/questions/2089167/debugging-sigbus-on-x86-linux

I'm excited to inform you that Fedora Linux users also started seeing the same root problem. It is tied to the upgrade from kernel v4.14 to v4.15.

Fedora bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Arch Linux independently identified this as caused by the kernel upgrade:
https://bbs.archlinux.org/viewtopic.php?id=235027

It can happen after resume from suspend, not every time but maybe once every three days. We have reports for both Xwayland and Xorg getting a fatal SIGBUS in _dl_fixup(). (While this is actually a secondary crash in xorg_backtrace(), we have a load of SIGBUS traces that have the same primary trace as each other).

Notice the specific faulting instruction in disassembly you captured: it is not performing a memory access!

=> 0x559c102a4060 <ErrorFSigSafe>: sub $0xd8,%rsp

Instead, notice that this is the first instruction in the function ErrorFSigSafe. This is a big common factor in our traces. (We actually have several different traces captured, with the failing function varying, often along the same call chain).

What's happening is a fault on the instruction fetch. You should be able to confirm this if you look at the address which generates the fault. (si_addr field of struct siginfo. I don't know where the Ubuntu crash collector saves this information)

The kernel failed to load in the page which holds the program code at this point. That's the real problem: some sort of transient IO error during wakeup. Users sometimes see other symptoms of these IO errors as well:

PM: resume devices took 1.017 seconds
Restarting tasks ...
Read-error on swap-device (253:1:836184)
PM: suspend exit
systemd-coredump[755]: Process 1356 (Xwayland) of user 42 dumped core.

and

PM: suspend exit
EXT4-fs error (device dm-2): ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Buffer I/O error on dev dm-2, logical block 0, lost sync page write
WARNING: CPU: 1 PID: 748 at fs/buffer.c:1108 mark_buffer_dirty+0xd4/0xe0
(and a kernel backtrace)

Hi Ubuntu users!  Signal 7 is SIGBUS.  SIGBUS should be relatively unusual on x86 [1].

[1] https://stackoverflow.com/questions/2089167/debugging-sigbus-on-x86-linux

I'm excited to inform you that Fedora Linux users also started seeing the same root problem.  It is tied to the upgrade from kernel v4.14 to v4.15.

Fedora bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Arch Linux independently identified this as caused by the kernel upgrade:
https://bbs.archlinux.org/viewtopic.php?id=235027

It can happen after resume from suspend, not every time but maybe once every three days.  We have reports for both Xwayland and Xorg getting a fatal SIGBUS in _dl_fixup().  (While this is actually a secondary crash in xorg_backtrace(), we have a load of SIGBUS traces that have the same primary trace as each other).

Notice the specific faulting instruction in disassembly you captured: it is not performing a memory access!

=> 0x559c102a4060 <ErrorFSigSafe>:	sub    $0xd8,%rsp

Instead, notice that this is the first instruction in the function ErrorFSigSafe.  This is a big common factor in our traces.  (We actually have several different traces captured, with the failing function varying, often along the same call chain).

What's happening is a fault on the instruction fetch.  You should be able to confirm this if you look at the address which generates the fault.  (si_addr field of struct siginfo.  I don't know where the Ubuntu crash collector saves this information)

The kernel failed to load in the page which holds the program code at this point.  That's the real problem: some sort of transient IO error during wakeup.  Users sometimes see other symptoms of these IO errors as well:

PM: resume devices took 1.017 seconds
Restarting tasks ...
Read-error on swap-device (253:1:836184)
PM: suspend exit
systemd-coredump[755]: Process 1356 (Xwayland) of user 42 dumped core.

and

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-05-29:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xorg-server (Ubuntu):
status:	New → Confirmed

Revision history for this message

Daniel van Vugt (vanvugt) wrote on 2018-05-29: Re: [nvidia] Xorg crashed with signal 7 in _dl_fixup() from _dl_runtime_resolve_xsavec() called from nvidia_drv.so

I found it hard to follow your log files because of so many errors from extension "<email address hidden>". Maybe try removing your extensions if they're that buggy/noisy.

Also, what version of the nvidia driver are you using?

summary:	- Xorg crashed with signal 7 in _dl_fixup() + Xorg crashed with signal 7 in _dl_fixup() from + _dl_runtime_resolve_xsavec()
summary:	- Xorg crashed with signal 7 in _dl_fixup() from - _dl_runtime_resolve_xsavec() + [nvidia] Xorg crashed with signal 7 in _dl_fixup() from + _dl_runtime_resolve_xsavec() called from nvidia_drv.so
tags:	added: nvidia

Revision history for this message

Alan Jenkins (aj504) wrote on 2018-05-29:

Uh, if anyone else is affected by this, there's a trivial fix upstream already (and a workaround). Hop to it, Ubuntu. gregkh is looking disappointed at you :-). I checked, and it looks like you didn't apply it to you 4.15 tree. See end for links to the fix etc.

For users: The workaround is to add "scsi_mod.scan=sync" on the kernel command line (i.e. edit /etc/default/grub and run `update-grub`).

Please note

1. AFAICT this is near-universal.
   It affects all desktop users of kernel 4.15/4.16 who use suspend
   (and whose workloads use all their RAM).
   It could be avoided by not using SCSI, but it does affect all systems with root on SATA.

2. Although this is horrible when it happens (X crash) and can happen on a near-daily basis,
   it can be quite difficult for users to analyze and report. For example, the crash doesn't
   have one specific backtrace in Xorg. It tends to generate several different backtraces,
   non-deterministicly. Sometimes, making a coredump fails, presumably due to the same bug
   that causes the crash.

I remember that Sosha had to make two attempts at reporting this bug
(though I don't remember what was wrong with the first one).

   Also, it's triggered by a medium-term SIGALRM timer in Xorg.
   This made it really annoying to reproduce, at the time when I didn't know the root cause.
   I was able to reproduce the memory pressure needed, but it didn't happen
   when testing suspend+resume... only when I broke for lunch and left the machine
   suspended for long enough :).

Fix: "block: do not use interruptible wait anywhere"

in kernel 4.17: https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428

in kernel 4.16.8: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6

lack of fix in 4.15.0-23.25 (ubuntu bionic): https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-23.25#n856

Uh, if anyone else is affected by this, there's a trivial fix upstream already (and a workaround).  Hop to it, Ubuntu.  gregkh is looking disappointed at you :-).  I checked, and it looks like you didn't apply it to you 4.15 tree.  See end for links to the fix etc.

For users: The workaround is to add "scsi_mod.scan=sync" on the kernel command line (i.e. edit /etc/default/grub and run `update-grub`).

Please note

2. Although this is horrible when it happens (X crash) and can happen on a near-daily basis,
   it can be quite difficult for users to analyze and report.  For example, the crash doesn't
   have one specific backtrace in Xorg.   It tends to generate several different backtraces,
   non-deterministicly.  Sometimes, making a coredump fails, presumably due to the same bug
   that causes the crash.

I remember that Sosha had to make two attempts at reporting this bug
   (though I don't remember what was wrong with the first one).

Fix: "block: do not use interruptible wait anywhere"

in kernel 4.17: https://github.com/torvalds/linux/commit/1dc3039bc87ae7d19a990c3ee71cfd8a9068f428

in kernel 4.16.8: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-4.16.y&id=7859056bc73dea2c3714b00c83b253d4c22bf7b6

lack of fix in 4.15.0-23.25 (ubuntu bionic): https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/block/blk-core.c?id=Ubuntu-4.15.0-23.25#n856

Revision history for this message

Sosha (soshaw) wrote on 2018-05-31:

#10

@Daniel

i use nvidia-driver-396.

Revision history for this message

Julien Olivier (julo) wrote on 2018-06-14:

#11

I confirm that adding "scsi_mod.scan=sync" does indeed fix this bug.

Revision history for this message

Alan Jenkins (aj504) wrote on 2018-06-15:

#12

Thanks for your confirmation, Julien. I have asked Ubuntu to import the proper fix from upstream and they responded very promptly. See:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776887

They have posted a test kernel. I don't have a ubuntu install to test it on - only a VM which cannot suspend. It *might* be useful if someone wants to volunteer to try using the test kernel.

You could test that your normal suspend still works, and test the command mentioned in the commit. I.e.

$ sudo -i
# dd if=/dev/sda of=/dev/null iflag=direct & \
  while killall -SIGUSR1 dd; do sleep 0.1; done & \
  echo mem > /sys/power/state ; \
  sleep 5; killall dd # stop after 5 seconds

On a "bad" kernel, any time you run this command it should show a message about an IO error. On a "good" kernel, the system will appear to suspend and resume, but there should be no IO error.

Revision history for this message

Vedant Bhatia (vedant19) wrote on 2018-08-06:

#13

Hi. I faced a similar issue and when I tried to add "scsi_mod.scan=sync" to /etc/default/grub and then run update-grub I got the following error:
/usr/sbin/grub-mkconfig: 36: /etc/default/grub: scsi_mod.scan=sync: not found

Would appreciate any help, thanks.

Revision history for this message

Vedant Bhatia (vedant19) wrote on 2018-08-07:

#14

The problem hasn't occurred so far after adding "csi_mod.scan=sync".

Revision history for this message

Alan Jenkins (aj504) wrote on 2018-08-07:

#15

Hi Vedant. Change the line in /etc/default/grub e.g.

GRUB_CMDLINE_LINUX_DEFAULT="quiet"

GRUB_CMDLINE_LINUX_DEFAULT="quiet scsi_mod.scan=sync"

and then re-run update-grub.

Daniel van Vugt (vanvugt) on 2018-09-17

summary:

- [nvidia] Xorg crashed with signal 7 in _dl_fixup() from
- _dl_runtime_resolve_xsavec() called from nvidia_drv.so
+ [nvidia] Xorg crashed with SIGBUS in _dl_fixup() from
+ _dl_runtime_resolve_xsavec() from create_bits_picture() from
+ image_from_pict_internal() from wfb_image_from_pict()

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

redhat-bugs #1553979
[NEW] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntuxorg-server package

[nvidia] Xorg crashed with SIGBUS in _dl_fixup() from _dl_runtime_resolve_xsavec() from create_bits_picture() from image_from_pict_internal() from wfb_image_from_pict()

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
xorg-server package