CPU call trace on AMD Raven Ridge after S3

Bug #1732894 reported by Timo Aaltonen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Timo Aaltonen
Artful
Fix Released
Medium
Unassigned

Bug Description

[Impact]

Dmesg shows a call trace after S3 cycle.

This has been fixed in v4.14-rc1 with

commit 9662d43f523dfc0dc242ec29c2921c43898d6ae5
Author: Yazen Ghannam <email address hidden>
Date: Mon Jul 24 12:12:28 2017 +0200

    x86/mce/AMD: Allow any CPU to initialize the smca_banks array

    Current SMCA implementations have the same banks on each CPU with the
    non-core banks only visible to a "master thread" on each die. Practically,
    this means the smca_banks array, which describes the banks, only needs to
    be populated once by a single master thread.

    CPU 0 seemed like a good candidate to do the populating. However, it's
    possible that CPU 0 is not enabled in which case the smca_banks array won't
    be populated.

    Rather than try to figure out another master thread to do the populating,
    we should just allow any CPU to populate the array.

    Drop the CPU 0 check and return early if the bank was already initialized.
    Also, drop the WARNing about an already initialized bank, since this will
    be a common, expected occurrence.

    The smca_banks array is only populated at boot time and CPUs are brought
    online sequentially. So there's no need for locking around the array.

    If the first CPU up is a master thread, then it will populate the array
    with all banks, core and non-core. Every CPU afterwards will return
    early. If the first CPU up is not a master thread, then it will populate
    the array with all core banks. The first CPU afterwards that is a master
    thread will skip populating the core banks and continue populating the
    non-core banks.

    Signed-off-by: Yazen Ghannam <email address hidden>
    Signed-off-by: Borislav Petkov <email address hidden>
    Acked-by: Jack Miller <email address hidden>
    Cc: Linus Torvalds <email address hidden>
    Cc: Peter Zijlstra <email address hidden>
    Cc: Thomas Gleixner <email address hidden>
    Cc: Tony Luck <email address hidden>
    Cc: linux-edac <email address hidden>
    Link: http://<email address hidden>
    Signed-off-by: Ingo Molnar <email address hidden>

[Test case]

Suspend/resume Raven Ridge, see that it doesn't show the call trace any more.

[Regression potential]

4.14 released with it, and it fixes code which "for now" checked only CPU 0, so regressions potential is slim to none.

Timo Aaltonen (tjaalton)
Changed in linux (Ubuntu):
assignee: nobody → Timo Aaltonen (tjaalton)
status: New → Confirmed
Stefan Bader (smb)
Changed in linux (Ubuntu Artful):
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
Khaled El Mously (kmously) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-artful' to 'verification-done-artful'. If the problem still exists, change the tag 'verification-needed-artful' to 'verification-failed-artful'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-artful
Revision history for this message
Zhang Junwei (zjunweihit) wrote :

Hi all,

Verified with hwe-edge-4.13.0-18 and hwe-edge-4.13.0-19.
Both of them fix the CPU(mce) calltrace during S3.

Jerry

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hi @zjunweihit,

Thank you for doing the verification!

tags: added: verification-done-artful
removed: verification-needed-artful
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (15.0 KiB)

This bug was fixed in the package linux - 4.13.0-19.22

---------------
linux (4.13.0-19.22) artful; urgency=low

  * linux: 4.13.0-19.22 -proposed tracker (LP: #1736118)

  * CVE-2017-1000405
    - mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()

linux (4.13.0-18.21) artful; urgency=low

  * linux: 4.13.0-18.21 -proposed tracker (LP: #1733530)

  * NVMe timeout is too short (LP: #1729119)
    - nvme: update timeout module parameter type

  * CPU call trace on AMD Raven Ridge after S3 (LP: #1732894)
    - x86/mce/AMD: Allow any CPU to initialize the smca_banks array

  * Set PANIC_TIMEOUT=10 on Power Systems (LP: #1730660)
    - [Config]: Set PANIC_TIMEOUT=10 on ppc64el

  * Cannot pair BLE remote devices when using combo BT SoC (LP: #1731467)
    - Bluetooth: increase timeout for le auto connections

  * enable CONFIG_SND_SOC_INTEL_BYT_CHT_NOCODEC_MACH easily confuse users
    (LP: #1732627)
    - [Config] CONFIG_SND_SOC_INTEL_BYT_CHT_NOCODEC_MACH=n

  * Plantronics P610 does not support sample rate reading (LP: #1719853)
    - ALSA: usb-audio: Add sample rate quirk for Plantronics P610

  * Allow drivers to use Relaxed Ordering on capable root ports (LP: #1721365)
    - Revert commit 1a8b6d76dc5b ("net:add one common config...")
    - net: ixgbe: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

  * support GICv3 ITS save/restore & migration (LP: #1710019)
    - KVM: arm/arm64: vgic-its: Fix return value for device table restore

  * Device hotplugging with MPT SAS cannot work for VMWare ESXi (LP: #1730852)
    - scsi: mptsas: Fixup device hotplug for VMWare ESXi

  * Artful update to 4.13.13 stable release (LP: #1732726)
    - netfilter: nat: Revert "netfilter: nat: convert nat bysrc hash to
      rhashtable"
    - netfilter: nft_set_hash: disable fast_ops for 2-len keys
    - workqueue: Fix NULL pointer dereference
    - crypto: ccm - preserve the IV buffer
    - crypto: x86/sha1-mb - fix panic due to unaligned access
    - crypto: x86/sha256-mb - fix panic due to unaligned access
    - KEYS: fix NULL pointer dereference during ASN.1 parsing [ver #2]
    - ACPI / PM: Blacklist Low Power S0 Idle _DSM for Dell XPS13 9360
    - ARM: 8720/1: ensure dump_instr() checks addr_limit
    - ALSA: timer: Limit max instances per timer
    - ALSA: usb-audio: support new Amanero Combo384 firmware version
    - ALSA: hda - fix headset mic problem for Dell machines with alc274
    - ALSA: seq: Fix OSS sysex delivery in OSS emulation
    - ALSA: seq: Avoid invalid lockdep class warning
    - MIPS: Fix CM region target definitions
    - MIPS: BMIPS: Fix missing cbr address
    - MIPS: AR7: Defer registration of GPIO
    - MIPS: AR7: Ensure that serial ports are properly set up
    - KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT
      updates
    - Input: elan_i2c - add ELAN060C to the ACPI table
    - rbd: use GFP_NOIO for parent stat and data requests
    - drm/vmwgfx: Fix Ubuntu 17.10 Wayland black screen issue
    - Revert "x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo"
    - can: sun4i: handle overrun in RX FIFO
    - can: peak: Add support for new PCIe/M2 CAN FD interfaces
    - can: ifi: Fix transmitter del...

Changed in linux (Ubuntu Artful):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.0 KiB)

This bug was fixed in the package linux - 4.13.0-25.29

---------------
linux (4.13.0-25.29) artful; urgency=low

  * linux: 4.13.0-25.29 -proposed tracker (LP: #1741955)

  * CVE-2017-5754
    - Revert "UBUNTU: [Config] updateconfigs to enable PTI"
    - [Config] Enable PTI with UNWINDER_FRAME_POINTER

linux (4.13.0-24.28) artful; urgency=low

  * linux: 4.13.0-24.28 -proposed tracker (LP: #1741745)

  * CVE-2017-5754
    - x86/cpu, x86/pti: Do not enable PTI on AMD processors

linux (4.13.0-23.27) artful; urgency=low

  * linux: 4.13.0-23.27 -proposed tracker (LP: #1741556)

  [ Kleber Sacilotto de Souza ]
  * CVE-2017-5754
    - x86/mm: Add the 'nopcid' boot option to turn off PCID
    - x86/mm: Enable CR4.PCIDE on supported systems
    - x86/mm: Document how CR4.PCIDE restore works
    - x86/entry/64: Refactor IRQ stacks and make them NMI-safe
    - x86/entry/64: Initialize the top of the IRQ stack before switching stacks
    - x86/entry/64: Add unwind hint annotations
    - xen/x86: Remove SME feature in PV guests
    - x86/xen/64: Rearrange the SYSCALL entries
    - irq: Make the irqentry text section unconditional
    - x86/xen/64: Fix the reported SS and CS in SYSCALL
    - x86/paravirt/xen: Remove xen_patch()
    - x86/traps: Simplify pagefault tracing logic
    - x86/idt: Unify gate_struct handling for 32/64-bit kernels
    - x86/asm: Replace access to desc_struct:a/b fields
    - x86/xen: Get rid of paravirt op adjust_exception_frame
    - x86/paravirt: Remove no longer used paravirt functions
    - x86/entry: Fix idtentry unwind hint
    - x86/mm/64: Initialize CR4.PCIDE early
    - objtool: Add ORC unwind table generation
    - objtool, x86: Add facility for asm code to provide unwind hints
    - x86/unwind: Add the ORC unwinder
    - x86/kconfig: Consolidate unwinders into multiple choice selection
    - objtool: Upgrade libelf-devel warning to error for CONFIG_ORC_UNWINDER
    - x86/ldt/64: Refresh DS and ES when modify_ldt changes an entry
    - x86/mm: Give each mm TLB flush generation a unique ID
    - x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
    - x86/mm: Rework lazy TLB mode and TLB freshness tracking
    - x86/mm: Implement PCID based optimization: try to preserve old TLB entries
      using PCID
    - x86/mm: Factor out CR3-building code
    - x86/mm/64: Stop using CR3.PCID == 0 in ASID-aware code
    - x86/mm: Flush more aggressively in lazy TLB mode
    - Revert "x86/mm: Stop calling leave_mm() in idle code"
    - kprobes/x86: Set up frame pointer in kprobe trampoline
    - x86/tracing: Introduce a static key for exception tracing
    - x86/boot: Add early cmdline parsing for options with arguments
    - mm, x86/mm: Fix performance regression in get_user_pages_fast()
    - x86/asm: Remove unnecessary \n\t in front of CC_SET() from asm templates
    - objtool: Don't report end of section error after an empty unwind hint
    - x86/head: Remove confusing comment
    - x86/head: Remove unused 'bad_address' code
    - x86/head: Fix head ELF function annotations
    - x86/boot: Annotate verify_cpu() as a callable function
    - x86/xen: Fix xen head ELF annotations
    - x86/xen: Add unwind hint anno...

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.