HP EliteBook 745 G5 (Ryzen 2500U) fails to boot unless `mce=off` is set on command line

Bug #1796443 reported by John Clemens on 2018-10-06
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
Medium
Unassigned
Bionic
Medium
Unassigned
Cosmic
Undecided
Unassigned
Disco
Medium
Unassigned

Bug Description

=== SRU Justification ===
[Impact]
System doesn't boot without "mce=off".

[Fix]
Quote from the commit log:
"Clear the "Counter Present" bit in the Instruction Fetch bank's
MCA_MISC0 register. This will prevent enabling MCA thresholding on this
bank which will prevent the high interrupt rate due to this error."

[Test]
The affected user reported these commits fix the issue.

[Regression Potential]
Low. Upstream stable commits. I don't see any regression on my
unaffected AMD systems.

=== Original Bug Report ===
My new Elitebook, with the latest bios 1.03.01, refuses to boot any kernel later than 4.10 unless mce=off is appended to the kernel command line. As in, there are no kernel messages at all after grub (yes, quiet and splash were removed from the command line). Perhaps it crashes before the efifb kicks in?

System operates fine if mce=off is added to the kernel command line (and iommu=soft, but that's a separate issue, and fails with kernel output in that case).

I opened upstream bug here : https://bugzilla.kernel.org/show_bug.cgi?id=201291

I bisected the problem down to this commit (and the few before it, which also added extra MCE output, but didn't actually crash):

    18807ddb7f88d4ac3797302bafb18143d573e66f is the first bad commit
    commit 18807ddb7f88d4ac3797302bafb18143d573e66f
    Author: Yazen Ghannam <email address hidden>
    Date: Tue Nov 15 15:13:53 2016 -0600

    x86/mce/AMD: Reset Threshold Limit after logging error

    The error count field in MCA_MISC does not get reset by hardware when the
    threshold has been reached. Software is expected to reset it. Currently,
    the threshold limit only gets reset during init or when a user writes to
    sysfs.

    If the user is not monitoring threshold interrupts and resetting
    the limit then the user will only see 1 interrupt when the limit is first
    hit. So if, for example, the limit is set to 10 then only 1 interrupt will
    be recorded after 10 errors even if 100 errors have occurred. The user may
    then assume that only 10 errors have occurred.

There are threads online about this being related to the latest bios. The upstream bug has acpidump attached.

ProblemType: Bug
DistroRelease: Ubuntu 18.10
Package: linux-image-4.18.0-8-generic 4.18.0-8.9
ProcVersionSignature: Ubuntu 4.18.0-8.9-generic 4.18.7
Uname: Linux 4.18.0-8-generic x86_64
ApportVersion: 2.20.10-0ubuntu11
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: john 2015 F.... pulseaudio
 /dev/snd/pcmC1D0p: john 2015 F...m pulseaudio
 /dev/snd/controlC0: john 2015 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Fri Oct 5 23:24:45 2018
InstallationDate: Installed on 2018-09-30 (5 days ago)
InstallationMedia: Ubuntu 18.10 "Cosmic Cuttlefish" - Beta amd64 (20180927)
Lsusb:
 Bus 005 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 004 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 003 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: HP HP EliteBook 745 G5
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.18.0-8-generic root=UUID=5cf73665-d2a3-4203-80fd-659faf1afea4 ro quiet splash iommu=soft mce=off
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-8-generic N/A
 linux-backports-modules-4.18.0-8-generic N/A
 linux-firmware 1.175
RfKill:
 1: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
StagingDrivers: r8822be
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/26/2018
dmi.bios.vendor: HP
dmi.bios.version: Q81 Ver. 01.03.01
dmi.board.name: 83D5
dmi.board.vendor: HP
dmi.board.version: KBC Version 08.47.00
dmi.chassis.asset.tag: 5CG838305Y
dmi.chassis.type: 10
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrQ81Ver.01.03.01:bd07/26/2018:svnHP:pnHPEliteBook745G5:pvr:rvnHP:rn83D5:rvrKBCVersion08.47.00:cvnHP:ct10:cvr:
dmi.product.family: 103C_5336AN HP EliteBook
dmi.product.name: HP EliteBook 745 G5
dmi.product.sku: 2MG23AV
dmi.sys.vendor: HP

Created attachment 278845
ACPI dump

New HP EliteBook 745 G5, BIOS version 1.03.01. Ryzen PRO 2500u.

Booting any modern kernel (4.10+) hangs at boot on this system with no kernel messages displayed unless you disable MCE support (via mce=off).

Knowing Debian's 4.9 kernel boots fine, I bisected Linus's tree, and it appears this commit is the culprit:

    18807ddb7f88d4ac3797302bafb18143d573e66f is the first bad commit
    commit 18807ddb7f88d4ac3797302bafb18143d573e66f
    Author: Yazen Ghannam <email address hidden>
    Date: Tue Nov 15 15:13:53 2016 -0600

    x86/mce/AMD: Reset Threshold Limit after logging error

    The error count field in MCA_MISC does not get reset by hardware when the
    threshold has been reached. Software is expected to reset it. Currently,
    the threshold limit only gets reset during init or when a user writes to
    sysfs.

    If the user is not monitoring threshold interrupts and resetting
    the limit then the user will only see 1 interrupt when the limit is first
    hit. So if, for example, the limit is set to 10 then only 1 interrupt will
    be recorded after 10 errors even if 100 errors have occurred. The user may
    then assume that only 10 errors have occurred.

.. although the previous few commits to this one also are all related to MCE support on AMD systems, so it may be a culmination of a few commits.

Created attachment 278847
dmesg from booting from last good commit

Created attachment 278849
dmesg from normal debian 4.9 boot

John Clemens (clemej) wrote :
John Clemens (clemej) wrote :

Note: bug also effect 18.04. Debian stable works, as it's based on 4.9.

description: updated

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

Hello,

Original Report:
https://bugs.launchpad.net/bugs/1796443

Best regards,
--
Cristian Aravena Romero (caravena)

Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.19 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc7

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
tags: added: kernel-da-key
John Clemens (clemej) wrote :

Kernel

linux-image-unsigned-4.19.0-999-generic_4.19.0-999.201810062200_amd64 still has this issue.

John Clemens (clemej) wrote :

Changing as requested.

Changed in linux (Ubuntu):
status: Triaged → Confirmed
tags: added: kernel-bug-exists-upstream

I think it's better to mail to the patch author and cc x86 mailing list.

*** Bug 201213 has been marked as a duplicate of this bug. ***

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed

https://marc.info/?l=linux-edac&m=154331383121359&w=2

[PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them

A proper fix has been provided by Borislav:

https://marc.info/?t=154334682000003&r=1&w=2

[PATCH] x86/MCE/AMD: Fix the thresholding machinery initialization order

Fixed in Linus's tree with commit 60c8144afc28 ("x86/MCE/AMD: Fix the thresholding machinery initialization order").

Changed in linux:
status: Confirmed → Fix Released

Fix became part of the following releases:
1) 4.20-rc5 (commit 60c8144afc28)
2) 4.19.7 (commit 00f91adf52af)
3) 4.14.86 (commit 855eefd9124a)

Jon Grimm (jgrimm) wrote :

Just ran across this bug in LP.

Note: 60c8144afc28 is only masking what the real issue is; it's a real bug, but the reason it's getting hit at all in this specific instance is because of an AMD CPU erratum which is causing spurious MCEs early enough to hit the bug this commit fixes.

However, while the crash is fixed, the thresholding interrupts are still going to be coming in fast and furious, better to disable them on affected CPUs as fixed by the following:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=45d4b7b9cb88526f6d5bd4c03efab88d75d10e4f

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=71a84402b93e5fbd8f817f40059c137e10171788

If the above 2 commits are in place, 60c8144afc28 becomes less critical, as you should no longer hit that condition.

Jon Grimm (jgrimm) wrote :

Adding dmesg/serial from failing system (AMD test board) as another example of failing system besides the original reporter.

Kai-Heng Feng (kaihengfeng) wrote :
Jon Grimm (jgrimm) wrote :

Thank you! Will do.

Jon Grimm (jgrimm) wrote :

All 3 test kernels look good. Tested on internal ryzen development boards that were showing this symptom previously.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
description: updated
Stefan Bader (smb) on 2019-07-09
Changed in linux (Ubuntu Cosmic):
status: New → Won't Fix
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Disco):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
status: New → Fix Committed

Please note that all commits requested for Disco have already been applied as part of LP: #1836614 ("Disco update: 5.0.18 upstream stable release").

Changed in linux (Ubuntu Disco):
status: New → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Tom Lendacky (tlendacky) on 2019-07-25
tags: added: verification-done-bionic
removed: verification-needed-bionic

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Tom Lendacky (tlendacky) on 2019-08-07
tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (171.3 KiB)

This bug was fixed in the package linux - 4.15.0-58.64

---------------
linux (4.15.0-58.64) bionic; urgency=medium

  * unable to handle kernel NULL pointer dereference at 000000000000002c (IP:
    iget5_locked+0x9e/0x1f0) (LP: #1838982)
    - Revert "ovl: set I_CREATING on inode being created"
    - Revert "new primitive: discard_new_inode()"

linux (4.15.0-57.63) bionic; urgency=medium

  * CVE-2019-1125
    - x86/cpufeatures: Carve out CQM features retrieval
    - x86/cpufeatures: Combine word 11 and 12 into a new scattered features word
    - x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
    - x86/speculation: Enable Spectre v1 swapgs mitigations
    - x86/entry/64: Use JMP instead of JMPQ
    - x86/speculation/swapgs: Exclude ATOMs from speculation through SWAPGS

  * Packaging resync (LP: #1786013)
    - update dkms package versions

linux (4.15.0-56.62) bionic; urgency=medium

  * bionic/linux: 4.15.0-56.62 -proposed tracker (LP: #1837626)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] update helper scripts

  * CVE-2019-2101
    - media: uvcvideo: Fix 'type' check leading to overflow

  * hibmc-drm Causes Unreadable Display for Huawei amd64 Servers (LP: #1762940)
    - [Config] Set CONFIG_DRM_HISI_HIBMC to arm64 only
    - SAUCE: Make CONFIG_DRM_HISI_HIBMC depend on ARM64

  * Bionic: support for Solarflare X2542 network adapter (sfc driver)
    (LP: #1836635)
    - sfc: make mem_bar a function rather than a constant
    - sfc: support VI strides other than 8k
    - sfc: add Medford2 (SFC9250) PCI Device IDs
    - sfc: improve PTP error reporting
    - sfc: update EF10 register definitions
    - sfc: populate the timer reload field
    - sfc: update MCDI protocol headers
    - sfc: support variable number of MAC stats
    - sfc: expose FEC stats on Medford2
    - sfc: expose CTPIO stats on NICs that support them
    - sfc: basic MCDI mapping of 25/50/100G link speeds
    - sfc: support the ethtool ksettings API properly so that 25/50/100G works
    - sfc: add bits for 25/50/100G supported/advertised speeds
    - sfc: remove tx and MCDI handling from NAPI budget consideration
    - sfc: handle TX timestamps in the normal data path
    - sfc: add function to determine which TX timestamping method to use
    - sfc: use main datapath for HW timestamps if available
    - sfc: only enable TX timestamping if the adapter is licensed for it
    - sfc: MAC TX timestamp handling on the 8000 series
    - sfc: on 8000 series use TX queues for TX timestamps
    - sfc: only advertise TX timestamping if we have the license for it
    - sfc: simplify RX datapath timestamping
    - sfc: support separate PTP and general timestamping
    - sfc: support second + quarter ns time format for receive datapath
    - sfc: support Medford2 frequency adjustment format
    - sfc: add suffix to large constant in ptp
    - sfc: mark some unexported symbols as static
    - sfc: update MCDI protocol headers
    - sfc: support FEC configuration through ethtool
    - sfc: remove ctpio_dmabuf_start from stats
    - sfc: stop the TX queue before pushing new buffers

  * [18.04 FEAT] zKVM: Add hardwar...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.