Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty)

Bug #1758869 reported by Po-Hsu Lin on 2018-03-26
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Kamal Mostafa
Xenial
Undecided
Kamal Mostafa

Bug Description

This is a POTENTIAL REGRESSION.

This issue only occurs on one of the AWS testing instance "c3.large"

Reproduce rate: 100% (All jenkins deployment job failed with this instance)

Steps:
  1. Deploy this instance with AWS kernel (4.4.0-1052)
  2. Enable -proposed, upgrade it and reboot.

Result:
  * System can boot with 4.4.0-1052, but it won't be accessible after rebooting for 4.4.0-1053, from "aws ec2 get-console-output" command indicates it hang with kernel panic issue on boot. It looks like the intel-microcode is causing this issue.

Output from the command: https://pastebin.ubuntu.com/p/3JvWNk5CTs/ (it's a bit difficult to get full output, the instance will keep rebooting itself)

Po-Hsu Lin (cypressyew) on 2018-03-26
description: updated
description: updated

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1758869

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Po-Hsu Lin (cypressyew) on 2018-03-26
tags: added: xenial

Also affection Trusty + AWS 4.4.0-1015 in proposed, log here:
https://pastebin.ubuntu.com/p/6CZdpPnMKw/

summary: - Kernel panic with AWS 4.4.0-1053
+ Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty)
Changed in linux (Ubuntu):
importance: Undecided → High
status: Incomplete → Triaged
tags: added: kernel-da-key
Changed in linux (Ubuntu):
assignee: nobody → Kamal Mostafa (kamalmostafa)
status: Triaged → In Progress
Kamal Mostafa (kamalmostafa) wrote :

This is caused by this commit from mainline:

linux-aws: 9f182bd x86/microcode/intel: Extend BDW late-loading further with LLC size check
mainline: 7e702d1 x86/microcode/intel: Extend BDW late-loading further with LLC size check

which adds a check which involves this computation:
+ do_div(llc_size, c->x86_max_cores);

But dmesg on a c3.large instance yields this interesting line:
[ 0.156084] smpboot: x86_max_cores == zero !?!?

Kamal Mostafa (kamalmostafa) wrote :

Issue and patches described here:

https://lkml.org/lkml/2018/2/6/320

Kamal Mostafa (kamalmostafa) wrote :

Also affects Xenial generic kernel (4.4.0-117.141).

Kamal Mostafa (kamalmostafa) wrote :

Attached backport of mainline a15a7535 fixes the problem (tested on c3.large instances) for Xenial generic and linux-aws 4.4-based kernels.

tags: added: patch
Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
assignee: nobody → Kamal Mostafa (kamalmostafa)

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Kamal Mostafa (kamalmostafa) wrote :

Verified fixed in proposed kernels:
  linux (4.4.0-118.142)
  linux-aws (4.4.0-1054.63)

tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (56.9 KiB)

This bug was fixed in the package linux - 4.4.0-119.143

---------------
linux (4.4.0-119.143) xenial; urgency=medium

  * linux: 4.4.0-119.143 -proposed tracker (LP: #1760327)

  * Dell XPS 13 9360 bluetooth scan can not detect any device (LP: #1759821)
    - Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"

linux (4.4.0-118.142) xenial; urgency=medium

  * linux: 4.4.0-118.142 -proposed tracker (LP: #1759607)

  * Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty) (LP: #1758869)
    - x86/microcode/AMD: Do not load when running on a hypervisor

  * CVE-2018-8043
    - net: phy: mdio-bcm-unimac: fix potential NULL dereference in
      unimac_mdio_probe()

linux (4.4.0-117.141) xenial; urgency=medium

  * linux: 4.4.0-117.141 -proposed tracker (LP: #1755208)

  * Xenial update to 4.4.114 stable release (LP: #1754592)
    - x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
    - usbip: prevent vhci_hcd driver from leaking a socket pointer address
    - usbip: Fix implicit fallthrough warning
    - usbip: Fix potential format overflow in userspace tools
    - x86/microcode/intel: Fix BDW late-loading revision check
    - x86/retpoline: Fill RSB on context switch for affected CPUs
    - sched/deadline: Use the revised wakeup rule for suspending constrained dl
      tasks
    - can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
    - can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
    - PM / sleep: declare __tracedata symbols as char[] rather than char
    - time: Avoid undefined behaviour in ktime_add_safe()
    - timers: Plug locking race vs. timer migration
    - Prevent timer value 0 for MWAITX
    - drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
    - drivers: base: cacheinfo: fix boot error message when acpi is enabled
    - PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID
    - PCI: layerscape: Fix MSG TLP drop setting
    - mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version
    - fs/select: add vmalloc fallback for select(2)
    - hwpoison, memcg: forcibly uncharge LRU pages
    - cma: fix calculation of aligned offset
    - mm, page_alloc: fix potential false positive in __zone_watermark_ok
    - ipc: msg, make msgrcv work with LONG_MIN
    - x86/ioapic: Fix incorrect pointers in ioapic_setup_resources()
    - ACPI / processor: Avoid reserving IO regions too early
    - ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
    - ACPICA: Namespace: fix operand cache leak
    - netfilter: x_tables: speed up jump target validation
    - netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed
      in 64bit kernel
    - netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags
    - netfilter: nf_ct_expect: remove the redundant slash when policy name is
      empty
    - netfilter: nfnetlink_queue: reject verdict request from different portid
    - netfilter: restart search if moved to other chain
    - netfilter: nf_conntrack_sip: extend request line validation
    - netfilter: use fwmark_reflect in nf_send_reset
    - ext2: Don't clear SGID when inheriting ACLs
    - reiserfs: fix race in prealloc discard
    - re...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers