bionic desktop does not boot with external monitor attached - [drm:ironlake_crtc_enable [i915]] *ERROR* mode set failed: pipe A stuck / vblank wait timed out on crtc 1

Bug #1751268 reported by Jean-Baptiste Lallement
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Linux
Won't Fix
Medium
linux (Ubuntu)
Incomplete
High
Unassigned
Bionic
Incomplete
High
Unassigned

Bug Description

Ubuntu Desktop Bionic up to date

System doesn't boot with an external monitor attached.

There is this message in the journal (full journal attached for this boot)

fbcon: inteldrmfb (fb0) is primary device
[drm:ironlake_crtc_enable [i915]] *ERROR* mode set failed: pipe A stuck
vblank wait timed out on crtc 1
------------[ cut here ]------------
WARNING: CPU: 3 PID: 201 at /build/linux-UKCsxy/linux-4.13.0/drivers/gpu/drm/drm_vblank.c:1090 drm_wait_one_vblank+0x19b/0x1b0 [drm]
Modules linked in: hid_lenovo uas usb_storage rtsx_usb_sdmmc rtsx_usb hid_generic usbhid hid i915 mxm_wmi i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse ahci libahci drm wmi video
CPU: 3 PID: 201 Comm: kworker/u8:6 Tainted: G U 4.13.0-32-generic #35-Ubuntu
Hardware name: ASUSTeK COMPUTER INC. UX32VD/UX32VD, BIOS UX32VD.214 01/29/2013
Workqueue: events_unbound async_run_entry_fn
task: ffff9760d81cae80 task.stack: ffffb01dc1fb8000
RIP: 0010:drm_wait_one_vblank+0x19b/0x1b0 [drm]
RSP: 0018:ffffb01dc1fbb7c8 EFLAGS: 00010282
RAX: 000000000000001f RBX: ffff9760d7080000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000246
RBP: ffffb01dc1fbb828 R08: 000000000000001f R09: 000000000002af84
R10: ffffb01dc1fbb7c8 R11: 000000000000041b R12: 0000000000000001
R13: 0000000000000170 R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff9760eef80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f156d6d68f0 CR3: 000000024660a002 CR4: 00000000001606e0
Call Trace:
 ? wait_woken+0x80/0x80
 ironlake_crtc_enable+0x477/0xc00 [i915]
 ? gen6_write8+0x190/0x190 [i915]
 intel_update_crtc+0x4b/0xe0 [i915]
 intel_update_crtcs+0x5b/0x80 [i915]
 intel_atomic_commit_tail+0x254/0xf90 [i915]
 ? __schedule+0x293/0x880
 intel_atomic_commit+0x3d5/0x490 [i915]
 ? drm_atomic_check_only+0x37b/0x540 [drm]
 drm_atomic_commit+0x4b/0x50 [drm]
 restore_fbdev_mode+0x15e/0x270 [drm_kms_helper]
 drm_fb_helper_restore_fbdev_mode_unlocked+0x2e/0x80 [drm_kms_helper]
 drm_fb_helper_set_par+0x2d/0x60 [drm_kms_helper]
 intel_fbdev_set_par+0x1a/0x70 [i915]
 fbcon_init+0x484/0x650
 visual_init+0xd6/0x130
 do_bind_con_driver+0x1fc/0x410
 do_take_over_console+0x82/0x1a0
 do_fbcon_takeover+0x5c/0xb0
 fbcon_event_notify+0x587/0x780
 notifier_call_chain+0x4a/0x70
 blocking_notifier_call_chain+0x43/0x60
 fb_notifier_call_chain+0x1b/0x20
 register_framebuffer+0x24d/0x360
 drm_fb_helper_initial_config+0x249/0x400 [drm_kms_helper]
 intel_fbdev_initial_config+0x18/0x30 [i915]
 async_run_entry_fn+0x36/0x150
 process_one_work+0x1e7/0x410
 worker_thread+0x4b/0x420
 kthread+0x125/0x140
 ? process_one_work+0x410/0x410
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x1f/0x30
Code: ff e8 da 3b 4c d6 48 8b 7d a0 48 8d 75 a8 e8 dd 93 50 d6 45 85 ff 0f 85 0c ff ff ff 44 89 e6 48 c7 c7 d8 eb 3d c0 e8 b6 46 52 d6 <0f> ff e9 f6 fe ff ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00
---[ end trace a1db7ccd0b5e4dfc ]---

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-10-generic 4.15.0-10.11
ProcVersionSignature: Ubuntu 4.15.0-10.11-generic 4.15.3
Uname: Linux 4.15.0-10-generic x86_64
ApportVersion: 2.20.8-0ubuntu10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC2: j-lallement 3671 F.... pulseaudio
 /dev/snd/controlC0: j-lallement 3671 F.... pulseaudio
 /dev/snd/controlC1: j-lallement 3671 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Fri Feb 23 13:58:39 2018
InstallationDate: Installed on 2013-09-03 (1633 days ago)
InstallationMedia: Ubuntu 13.10 "Saucy Salamander" - Alpha amd64 (20130902)
MachineType: ASUSTeK COMPUTER INC. UX32VD
ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-10-generic root=UUID=1004226d-a9db-46c7-bd28-eca0806c12f2 ro pcie_aspm=force drm.vblankoffdelay=1 i915.semaphores=1 init=/lib/systemd/systemd-bootchart
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-10-generic N/A
 linux-backports-modules-4.15.0-10-generic N/A
 linux-firmware 1.171
SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2018-01-26 (27 days ago)
dmi.bios.date: 01/29/2013
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: UX32VD.214
dmi.board.asset.tag: ATN12345678901234567
dmi.board.name: UX32VD
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: 1.0
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: ASUSTeK COMPUTER INC.
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvrUX32VD.214:bd01/29/2013:svnASUSTeKCOMPUTERINC.:pnUX32VD:pvr1.0:rvnASUSTeKCOMPUTERINC.:rnUX32VD:rvr1.0:cvnASUSTeKCOMPUTERINC.:ct10:cvr1.0:
dmi.product.family: UX
dmi.product.name: UX32VD
dmi.product.version: 1.0
dmi.sys.vendor: ASUSTeK COMPUTER INC.

Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :
summary: bionic desktop does not boot with external monitor attached -
[drm:ironlake_crtc_enable [i915]] *ERROR* mode set failed: pipe A stuck
- vblank wait timed out on crtc 1
+ / vblank wait timed out on crtc 1
tags: added: rls-bb-incoming
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16-rc3

Changed in linux (Ubuntu):
importance: Undecided → Medium
importance: Medium → High
tags: added: kernel-key
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

4.13.0-32.35: Boot
4.15.0-10.11: Does not boot
4.16.0-041600rc3.201802261351: Does not boot

tags: added: kernel-bug-exists-upstream
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

We can perform a kernel bisect to identify the commit that introduced this regression. We first need to identify the last kernel that did not have this bug and the first that did.

Can you next test the following kernels:

v4.14-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc1/
v4.14 Final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14/
v4.15-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc1/

You don't have to test all of them. Just until you hit the first version that exhibits the bug.

Changed in linux (Ubuntu Bionic):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
status: In Progress → Incomplete
tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 138431
A photo of some of the kernel spew.

On my Fedora 27 Thinkpad T430, I have been unable to boot any of the 4.15.x kernels (4.14.18 and previous work just fine). The kernel crashes at the initial kms only 3s into the boot, before any logging to disk happens, making this one hard to provide information on. Adding "nomodeset" to the boot lets the boot process go further: but as soon as a buffer flush tries to write to the system disk, it hangs (so again, no syslog info beyond a string of ^@'s). kdump doesn't help: that needs to write to disk, and this bug seems to be taking out that ability as it flails.

Complicating matters further is that the screen's backlight is turned off at the start of the kms (in a normal boot it comes back on later), rendering console messages hard to read. Luckily, I found out that with an external monitor, it will eventually (a minute or two later) light up, resulting in the attached screenshot which points to the i915 driver.

The initial bug was posted at Fedora's bugzilla:

  https://bugzilla.redhat.com/show_bug.cgi?id=1557416

and they redirected me here (delayed by travel, sorry).

I see this Ubuntu bug, which shares a similar backtrace:

  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1751268

although my crash is not dependent on an external monitor.

Another similar backtrace is on your site here:

  https://bugs.freedesktop.org/show_bug.cgi?id=104573

but that one doesn't seem fatal.

Ideas? bios update to the laptop didn't change anything. xorg update didn't change anything.

F27 is now on xorg-x11-drv-intel-2.99.917-31-20171025. All 4.15.x kernels they've put out do the same thing, the latest being 4.15.12-301

Thanks in advance - this has been bugging me for months, and all my usual debugging techniques have been stymied!

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
In , Jani-saarinen-g (jani-saarinen-g) wrote :

HI, can you try with latest drm-tip (https://cgit.freedesktop.org/drm-tip and please add drm.debug=14 module parameter, and attach dmesg from boot to reproducing the problem. Thanks.

Revision history for this message
In , Ahabig (ahabig) wrote :

Will try. Am not optimistic we'll get much, however, as the crash occurs before syslogd can write anything.

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 138444
dmesg from initial boot into single user mode

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 138445
dmesg from boot into runlevel 3 after ntfs filesystems removed

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 138446
dmesg including logging after X started

same boot as dmesg-booted.tx, but with some additional logging after X was started.

Revision history for this message
In , Ahabig (ahabig) wrote :

ok, boot success! First try fell into single user mode due to ntfs filesystems the drm-tip kernel didn't like. (the first dmesg attachement).

Commenting those out from fstab got me into runlevel three (the second dmesg attachment). X started just fine (I command line that using startx), additional debugging logs are appended (the 3rd dmesg file).

Note that it didn't crash, so these logs, as happy as I am to get them, don't illuminate the problem (although they surely have lots of info about the intel chip in the system with the problem!). Potential reasons include: Maybe 4.16 fixed the problem; or maybe the drm-tip kernel doesn't include some offending but of code running in the Fedora kernel (eg, it obviously doesn't have ntfs support enabled).

Revision history for this message
In , Ahabig (ahabig) wrote :

Narrowed down the possibilities: just fished the 4.16.0-rc7 rpms for f28 out of koji: same crash, same lack of any entries into /var/log/messages whilst doing so. So, the problem isn't fixed in the newer kernel: it them must be something that's in the Fedora kernel and not in the drm-tip kernel.

Ideas for how to narrow this down?

Revision history for this message
In , Jani-saarinen-g (jani-saarinen-g) wrote :

Jani, any advice here?

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

[ 3.562059] bbswitch: loading out-of-tree module taints kernel.
[ 3.562266] bbswitch: version 0.8
[ 3.562272] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[ 3.562278] bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG_.VID_
[ 3.562286] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20180105/nsarguments-100)
[ 3.562634] bbswitch: detected an Optimus _DSM function
[ 3.562644] pci 0000:01:00.0: enabling device (0000 -> 0003)
[ 3.562676] bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on

Can you disable the discrete card and run without out-of-tree stuff?

Revision history for this message
In , Ahabig (ahabig) wrote :

Good idea Jani!

Removed bumblebee and the proprietary nvidia driver. This didn't fix things, but it did allow me to go further before things locked up. Including actual system logging!

Attached a dmesg output I got before things fell over. Upon booting, it seems the nouveau driver kms fails, then it falls over to i915. You can see nouveau initializing around 2.2 seconds in the log, then timing out at around 16 seconds. The first actual kernel crashes come later, starting with ieee80211 complaints. I was able to dmesg to a logfile and scp it away before things really unraveled. Hitting ctl-alt-del (all on console) induces a hard hang.

blacklisting nouveau and rebuilding the initfs, it's back to the original problem of not even booting happily. Disabling the discrete card in bios does not change this behavior. The tail of its problems again shows up eventually on my second monitor, looking similar to the initial attachments where drm_atomic_helper is timing out. Wait... as I type this it eventually got to single user mode, second dmesg dump retrieved before it hang (again on ctl-alt-del). But, dumping that to a text file didn't make it to disk, the crash didn't flush the disk write buffer :(

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 139108
dmesg from boot with nouveau instead of proprietary driver.

Revision history for this message
In , Ahabig (ahabig) wrote :

Slight correction - my first attempt to boot with discrete graphics enabled was foiled by there being a second bios setting that tries to autoset the graphics cards, despite me having just manually set it. REALLY forcing discrete graphics off got me back to the exact same "hang, doesn't write to syslog, but I could take a photo of it" problem as started this thread. Same kernel dump about ironlake and drm.

Revision history for this message
In , Ahabig (ahabig) wrote :

... and, to close the loop: 4.14.18 continues to work great (well, without bumblebee now, because it's uninstalled), as does windows 7 (dual-booted).

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

(In reply to Alec Habig from comment #11)
> Created attachment 139108 [details]
> dmesg from boot with nouveau instead of proprietary driver.

Please add drm.debug=14 module parameter to get detailed debugging, and get the dmesg again. The attached dmesg doesn't seem to have anything out of the ordinary in graphics, the only warns come from wifi.

Changed in linux:
status: Confirmed → Incomplete
Revision history for this message
In , Ahabig (ahabig) wrote :

ok. First, a note that yesterday and today's test were with kernel-4.15.17-300.fc27.x86_64

Couldn't get past a few seconds into the boot (the original failure mode) till I remembered that I'd taken the nouveau modules out of the initfs yesterday. Put it back, booted with drm.debug-14.

It certainly seems that without nouveau taking the hit for whatever goes wrong in the initial kms, things are toast. nouveau times out, hands off to i915, which finishes the kms. However, if there's no nouveau in the kernel, i915 fails the initial mode swich entirely.

Anyway, this boot went clear to runlevel 3 with no errors this time, so I logged on console as root, saved a dmesg (attached). Then exit'd from root's bash.... and things hung. A bit later these errors came to console:

pr 26 14:47:24 entropy kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:37:pipe A] flip_done timed out
Apr 26 14:47:24 entropy kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:pipe B] flip_done timed out

after which I leaned on the power switch to shut down.

Since it didn't go down cleanly, "journalctl -b -1" can't count its instances correctly. But, things did go to /var/log/messages this time, so the complete /var/log/messages for that session is also attached. To sync the two different time series, the dmesg ends around 14:44:44 in messages.

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 139146
dmesg from drm.debug=14 clean boot (which then hung after this log was copied...)

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 139147
/var/log/messages from that corresponding session, including an unfortunately sparse record of the actual hang. Last line in the dmesg corresponds to 14:44:44 in this full log.

Revision history for this message
In , Jani-saarinen-g (jani-saarinen-g) wrote :

There are similar issues seen on many bugs eg. on IVB: https://bugs.freedesktop.org/show_bug.cgi?id=104573.
Ville, any idea for those?

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

(In reply to Jani Saarinen from comment #18)
> There are similar issues seen on many bugs eg. on IVB:
> https://bugs.freedesktop.org/show_bug.cgi?id=104573.
> Ville, any idea for those?

Nothing similar about that AFAICT.

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

I also don't see an actual crash or hang here anyway. Sure, the display may be toast, but per the logs systemd gracefully shuts down on power button press. Can you ssh into the machine?

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

Finally, the Optimus stuff have always been a pain. Life suddenly feels too short when debugging issues with them. I know it sucks for you as a user, but please understand that they're really hard for us too. We only have control over one piece of the puzzle, without much clues of the whole.

Revision history for this message
In , Ahabig (ahabig) wrote :

sshing in let me continue to work on shell in the ssh session, after the console hung upon exiting. dmesg reports continued "flip timeout" errors as kms flails.

"shutdown -r now" at this point, although booting the ssh session promptly as expected, takes a few minutes to complete, seems to be waiting for the kms errors to complete. This clean boot did allow me to get a journalctl log of the whole process (will attach that).

I understand that optimus is a nasty mess. But, frustratingly, things work just fine with the fedora 4.14.* series, and just fine with your debugging tree kernel. As a user I suppose I can always go back to the bad old days of hand installing kernels rather than just loading them from the yum repository. Last time I did that regularly was going on 20 years ago :)

The fact that the errors happen with either fedora's 4.15.* tree or their 4.16 devel tree is telling us that something in the deltas they apply to their kernel build is the source of the problem. I suppose at this point, I go back to the original fedora bugzilla and put the ball back in their court? bisecting their configs and patches sounds like the next (tedious) step.

Before I go, though: what are these "flip timeout" errors? If I had to guess, any number of things horribly wrong in there could make the i915 module get into a funny state that eventually throws this error: but any insight as I move forward would be appreciated.

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 139175
full journalctl -b -1 -l dump obtained after being able to issue a clean reboot via sssh.

Revision history for this message
In , Ahabig (ahabig) wrote :

I just asked the Fedora bugzilla about what patches they're putting on to their rpms, and they suggested:

  It's probably the opposite, there is something in the Free Desktop tree that
  hasn't yet made it into Linus' tree. If you can narrow down which patch it is
  we can bring it into Fedora.

Revision history for this message
In , Jani-saarinen-g (jani-saarinen-g) wrote :

Alec, any updates from you on this?

Revision history for this message
In , Ahabig (ahabig) wrote :

No update since the Fedora suggestion above.

How different is your development tree from the main kernel release?

Revision history for this message
In , Jani-saarinen-g (jani-saarinen-g) wrote :
Revision history for this message
In , Ahabig (ahabig) wrote :

Interesting development here. I decided to try the latest drm-tip (4.17.0-rc4) and see what happens. It died: but looking at it harder, it seems my working drm-tip (4.16.0-rc7) had a much simpler .config file. I think the difference is I compiled the new one as root, which picked up all the config settings from my currently running fedora 4.14 kernel, while doing it as a normal user for 4.16 did not.

Anyway - copying over the 4.16 .config into the 4.17 build area, and being conservative with all the new config optins the 4.17 build presents, it does boot again: although it does spend 30+seconds at the boot flailing: before succeeding rather than failing.

So: something in fedora's kernel config is triggering my problem. So I can try bisecting kernel options till I find the specific troublemaker. There are a lot of them though.

But: it's not just that. Hopefully the additional debugging info in the 4.17 dmesg (attached) offer clues. You can see the drm module having problems: I do not know the severity of the problems, just that it did the same "spend 30-odd seconds early in the boot timing out" problem as it does when it fails,even though it then succeeds (with a simpler .config).

Revision history for this message
In , Ahabig (ahabig) wrote :

Created attachment 139469
dmesg from a successful boot with drm-tip 4.17.0-rc4, and a simpler .config.

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

Per logs you have selftests enabled:

[ 0.377402] drm_mm: Testing DRM range manger (struct drm_mm), with random_seed=0x9292e37 max_iterations=8192 max_prime=128
[ 0.377756] drm_mm: igt_sanitycheck - ok!
[ 0.377962] igt_debug 0x0000000000000000-0x0000000000000200: 512: free
[ 0.377963] igt_debug 0x0000000000000200-0x0000000000000600: 1024: used
[ 0.377963] igt_debug 0x0000000000000600-0x0000000000000a00: 1024: free
[ 0.377964] igt_debug 0x0000000000000a00-0x0000000000000e00: 1024: used
[ 0.377965] igt_debug 0x0000000000000e00-0x0000000000001000: 512: free
[ 0.377966] igt_debug total: 4096, used 2048 free 2048

Use

CONFIG_DRM_DEBUG_SELFTEST=n
CONFIG_DRM_I915_DEBUG=n

Revision history for this message
In , Ahabig (ahabig) wrote :

Yes, that self-test thing was the delay early in boot with the simply-configured drm-tip kernel. Thanks!

Working on testing various kernel configs to see which combination is the lethal one. This will take a while... (although less now that it boots faster :)

Revision history for this message
In , Btmckee9 (btmckee9) wrote :

I don't know if this will help, but I have an old Centrino i5 mobile running 4.18.1 as a media pc and if I connect an external monitor and boot I get a hard crash of the PC. I managed to get a complete dmesg just before the hang.

This PC is intel only, not dual video.

https://pastebin.com/raw/g5n2J90k

I'm using this PC at the moment, on the external monitor which works fine if I plug it in after X has started.

I'll try to keep tabs on this thread.

This is a Gentoo install.

Revision history for this message
In , Lakshminarayana-vudum (lakshminarayana-vudum) wrote :

Alec, do you still have the boot failure?

Revision history for this message
In , Ahabig (ahabig) wrote :

I do: even with the current Fedora Kernel (4.17.19) and the latest Lenovo BIOS (which got updated recently, presumably for spectre mitigation). Just verified this when booting this morning (accidentally chose the wrong kernel in grub).

I have made little progress on narrowing down which kernel features cause the problem when enabled. Your vanilla testing kernel works, the same version of the fedora configured kernel does not. But there's a lot of parameter space to search, each "tweak/compile/boot/fail" iteration takes an hour, and this is the laptop I use for actual work. So slow progress :(

Revision history for this message
In , Ahabig (ahabig) wrote :

So since last week, kernel-4.18.7-100.fc27.x86_64 came out on the Fedora updates repo. Tried it just now in the process of looking at the next batch of options differences... and it didn't crash! And all the other drivers I need on this laptop are enabled (which wasn't the case with your drm-tip kernel).

Typing at you now in an X session while running the new kernel and its video modules.

Sometimes bigger numbers really are better :)

Will go read release notes and changelogs to see if there are clues to what might have changed (for the good this time), and update this bug for future reference if there's news.

Revision history for this message
In , Ahabig (ahabig) wrote :

ok, all's not wine and roses. When exiting X, it goes into the same style lockup as used to happen on boot. Which includes crashing so hard that often times (but not all the time) the BIOS can't find the boot disk after the next power cycle.

At least the bug moving further back in the chain means I can do some work, and investigate a running system. Thank goodness I didn't uninstall the (still happily working) kernel-4.14.18-300.fc27.x86_64

Revision history for this message
In , Lakshminarayana-vudum (lakshminarayana-vudum) wrote :

(In reply to Alec Habig from comment #36)
> ok, all's not wine and roses. When exiting X, it goes into the same style
> lockup as used to happen on boot. Which includes crashing so hard that
> often times (but not all the time) the BIOS can't find the boot disk after
> the next power cycle.
>
> At least the bug moving further back in the chain means I can do some work,
> and investigate a running system. Thank goodness I didn't uninstall the
> (still happily working) kernel-4.14.18-300.fc27.x86_64

Alec, sorry for the delay...
Do you have the dmesg from boot? Attached logs are old. Have you verified wtih latest drm-tip? Please attach the latest dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.

Revision history for this message
In , Ahabig (ahabig) wrote :

Sorry for the long silence. Going has been slow: it takes a while to do a tweak/test/reboot/hang cycle. Especially since the disk controller gets munged upon the bug happening, even persisting across a power cycle (!) about 50% of the time (I have no idea how such a thing is even possible). And, this device is my "get work done" laptop, using the 4.14 kernel which is perfectly happy.

Anyway: tried a different approach. Threw in a spare disk, and tried a clean install of F29 and Kubuntu 18.04.1, in case cruft was to blame.

F29 installed from dvd... but the kernel installed to disk didn't boot once the first install pass finished.
Kubuntu's install dvd wouldn't even boot past the spinny "it's starting to boot" graphics screen.

In both cases, since things are "graphical boot" and "quiet", there's no feedback as to exactly what went wrong, and the system hangs too hard to switch to a different vtty, but it felt the same as the hangs described above.

Trying with the latest/last F27 kernel (yes, it's EOL now) 4.18.19-100, tried all permutations of integrated, discrete, and optimus bios settings, excluding nouveau drivers, and the nvidia proprietary blob. Sometimes it can get a clean boot - but then the sata controller goes out to lunch as soon as a write cache flush happens. Which makes me think the kms problem which started this thread is a symptom rather than the problem, just the one which usually triggers first. This is consistent with the problems getting log traces of the problems described above, because a lunched sata controller can't log errors.

Went back to drm-tip. See that it's kernel 5.0 now, cool.

This minimally configured kernel continues to work. I've enabled the extra features needed to run the laptop, no problems.

So: the bug is at the very least triggered by one of the (myriad) of enabled kernel options in the distro stock kernels. It feels to me like the old days of unprotected flat memory space where you could POKE random values into random addresses and watch the system fall apart: with the initial kms call being the most sensitive to it, and things unravel into thrashing the disk controller.

Parameter space is too vast for me to find the culprit with intermittent effort and a logging system that's often the first victim of the bug. So, I'm ready to punt, documenting this here in case someone else with more clues googles it is the only remaining thing I can do.

Time to just return to the 1990's and compile my own kernel :( At least git now makes tracking updates easier than it used to be in the Bad Old Days.

Changed in linux (Ubuntu):
assignee: Joseph Salisbury (jsalisbury) → nobody
Changed in linux (Ubuntu Bionic):
assignee: Joseph Salisbury (jsalisbury) → nobody
Revision history for this message
Jonathan Hirschman (jonathan-l) wrote :

Not sure if helpful, but I had similar issue - machine would not boot if (only) using external monitors. My computer is a Dell Latitude 5590; it had previously booted, upon power up, with the machine closed, thus using only external monitors.

In order to diagnose, I started to roll back updates, and struck lucky the first time: I rolled back my BIOS (which had been updated by Ubuntu), and then it all Just Worked like it used to.

Not sure if related, but hopefully helpful.

Revision history for this message
In , Ahabig (ahabig) wrote :

Just accidentally booted into fedora's 4.20.13-200.fc29.x86_64 (have been happily running the drm-tip)... and it worked!

Will try reading changelogs to get some clues.

Revision history for this message
In , Ahabig (ahabig) wrote :

nuts. Going to 4.20.15-200.fc29.x86_64 failed again... and backing up to 4.20.13-200 also failed. Tricked by a race condition into thinking that that version actually worked? Bad interaction with some other update applied since? Dunno.

drm-tip (now at 5.1-rc3) continues to work.

Fedora 29 is now at 5.0.5-200... yep, same problems as always.

Nothing in the changelogs lights me up as an obvious thing, but there's a lot there and I'm not a kernel expert.

At least I can compile my own from drm-tip's clean tree and get work done.

Revision history for this message
In , Jani-nikula (jani-nikula) wrote :

drm-tip works, nothing in dmesg to indicate failure. Pretty much the only way to get progress is to get a bisect result.

Revision history for this message
In , Lakshminarayana-vudum (lakshminarayana-vudum) wrote :

Alec, any updates here? Did you get any chance to bisect the results?
Priority is updated as the issue isn't in drmtip.

Revision history for this message
In , Ahabig (ahabig) wrote :

Bisecting is a manual process, since it's some option (or combination thereof) enabled in Fedora's .config that's the problem rather than a bad commit which git can help with.

You're right in that it's not drm-tip itself. In fact, while the original weirdness manifested itself in drm, it also trashed the disk controller. In the last couple stock fedora kernels (5.1.6 and 5.1.8), the problem has shifted to the keyboard controller (or maye it's usb, haven't disentabled it). So something in there is stomping on things it shouldn't, and for the longest time, that was the video driver.

Since I'm now pretty certain it's not your bug, closing this one is fine: I'll keep poking at it till I solve it, or until I finally upgrade this T430, whichever comes first :)

Revision history for this message
In , Lakshminarayana-vudum (lakshminarayana-vudum) wrote :

Closing this issue as NOTOUTBUG.

Changed in linux:
status: Incomplete → Won't Fix
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.