qemu-efi-aarch64 in >= artful can't boot xenial cloud images

Bug #1744754 reported by dann frazier on 2018-01-22
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-images
Undecided
Unassigned
edk2 (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
linux (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
dann frazier

Bug Description

[Impact]
After upgrading an Ubuntu/arm64 KVM host past xenial, your xenial-based guests will fail to boot.

[Test Case]
Boot a xenial cloud image with qemu-efi-aarch64 from artful/bionic.

[Regression Risk]
I've tested booting a xenial cloud image in bionic (ACPI mode), and regression tested w/ xenial's qemu-efi (DTB mode). I've regression tested on a Cavium ThunderX CRB1S, Caviumt ThunderX CRB2S and an APM X-Gene 2 Merlin board.

Patches 1-5 change only code in the GICv3 driver. The xenial GA kernel only supported 2 GICv3 systems - the 1 socket and 2 socket variants of the Cavium ThunderX CRB - and I've regression tested on those systems.

Patch 6 only adds new macro definitions.

Patch 7 is restricted to devicetree code, except for a change to earlycon.c:param_setup_earlycon(). In the case that 'earlycon' is passed on the cmdline (vs. earlycon=something), this function used to return 0 - but now it will return -ENODEV on non-devicetree systems, which is a subtle API change. However, according to kernel-parameters.txt (and the code itself), 'earlycon' by itself is only valid on devicetree systems. Just to be sure, I booted an x86 system up w/ 'earlycon' with and without this series, and observed no difference.

Patch 8 adds the SPCR table parser, but no caller to it yet. It also modifies the same earlycon code as Patch 7 - here it avoids earlycon init in the case that the devicetree-specific 'earlycon' was passed. As mentioned in my analysis Patch 7, this codepath is only supported for devicetree systems, and has been regression tested on x86.

Patch 9 turns on CONFIG_ACPI_SPCR_TABLE - however, this driver will only be built for arm64. TBH, I'm not 100% sure how Kconfig knows not to build this for other archs - but I checked the logs, and there's no spcr.o built on other archs. (Not that that should be a problem - they would just grow a bit of unused code).

Patch 10 only touches arm64-specific code, adding the call to parse_spcr(), so risk is limited to arm64.

Patch 11 adds a new match method to the ARM-specific pl011 console driver, so regression risk to other architectures is negligible.

dann frazier (dannf) wrote :

I bisected this down to the following edk2 commit:

commit 78c41ff519b187d8979cda7074f007a6323f9acd (refs/bisect/bad)
Author: Ard Biesheuvel <email address hidden>
Date: Thu Mar 9 16:59:34 2017 +0100

    ArmVirtPkg/FdtClientDxe: make DT table installation !ACPI dependent

    Instead of having a build time switch to prevent the FDT configuration
    table from being installed, make this behavior dependent on whether we
    are passing ACPI tables to the OS. This is done by looking for the
    ACPI 2.0 configuration table, and only installing the FDT one if the
    ACPI one cannot be found.

Which makes sense - arm64/ACPI support wasn't baked yet upstream in v4.4.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1744754

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Xenial):
status: New → Incomplete
tags: added: xenial
dann frazier (dannf) wrote :

Here are the 3 options for dealing with this I've come up with:

1) Disable ACPI in edk2 builds. That'd be an easy fix - but switching away from upstream defaults isn't very future-proof. It also seems like a pretty big regression risk for existing artful/bionic users who may be depending on ACPI.

2) Switch the xenial cloud-images to use the HWE kernel. There's precedent for this (we did it for trusty to add support for GICv3 systems). Existing VMs using 4.4 will still break after upgrading qemu-efi past xenial.

3) Backport the necessary ACPI support to xenial's 4.4? That would fix it everywhere, but I'm not sure how feasible/SRUable that would be.

Robert C Jennings (rcj) wrote :

Dann,

Regarding the 2nd option you listed: "Switch the xenial cloud-images to use the HWE kernel. There's precedent for this (we did it for trusty to add support for GICv3 systems). Existing VMs using 4.4 will still break after upgrading qemu-efi past xenial."

As you know, the prior change to a Trusty images was made after determining that there were no supported systems on the market with GICv2 hardware; the impact of that change was deemed minimal. A change in cloud images to move from the GA to HWE kernel for an arm xenial image could be disruptive and we would need to weigh the risks and impact of that. I think you know that well already, I just wanted to reinforce that position and encourage edk2 and kernel developers to investigate remediation in those areas first. Thanks.

dann frazier (dannf) wrote :

@Robert: Actually, X-Gene is supported on trusty, which is non-GICv3, but your point is taken. I personally think option #3 would be the best option, but wasn't sure of the feasibility. However, I've been working on the necessary kernel backports in the background over the past week, and I now have something working that looks pretty clean/minimal. I'm doing some regression testing now, and plan to submit to the kernel team later this week.

Changed in linux (Ubuntu Xenial):
status: Incomplete → In Progress
Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Changed in edk2 (Ubuntu Xenial):
status: New → Invalid
dann frazier (dannf) on 2018-02-01
description: updated
Changed in linux (Ubuntu Xenial):
assignee: nobody → dann frazier (dannf)
description: updated
dann frazier (dannf) on 2018-02-02
description: updated
tags: added: id-5a674ab10c375ca04060ea9a
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in edk2 (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
dann frazier (dannf) wrote :

My backport for option #3 has been merged, so no need for a change in the cloud images or edk2.

Changed in cloud-images:
status: New → Invalid
Changed in edk2 (Ubuntu):
status: Confirmed → Invalid
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
dann frazier (dannf) wrote :

After updating to the proposed kernel, I was able to boot a xenial cloud image with qemu-efi from both xenial (DTB mode) and bionic (ACPI mode).

dann frazier (dannf) wrote :
tags: added: verification-done-xenial
removed: verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (56.9 KiB)

This bug was fixed in the package linux - 4.4.0-119.143

---------------
linux (4.4.0-119.143) xenial; urgency=medium

  * linux: 4.4.0-119.143 -proposed tracker (LP: #1760327)

  * Dell XPS 13 9360 bluetooth scan can not detect any device (LP: #1759821)
    - Revert "Bluetooth: btusb: fix QCA Rome suspend/resume"

linux (4.4.0-118.142) xenial; urgency=medium

  * linux: 4.4.0-118.142 -proposed tracker (LP: #1759607)

  * Kernel panic with AWS 4.4.0-1053 / 4.4.0-1015 (Trusty) (LP: #1758869)
    - x86/microcode/AMD: Do not load when running on a hypervisor

  * CVE-2018-8043
    - net: phy: mdio-bcm-unimac: fix potential NULL dereference in
      unimac_mdio_probe()

linux (4.4.0-117.141) xenial; urgency=medium

  * linux: 4.4.0-117.141 -proposed tracker (LP: #1755208)

  * Xenial update to 4.4.114 stable release (LP: #1754592)
    - x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
    - usbip: prevent vhci_hcd driver from leaking a socket pointer address
    - usbip: Fix implicit fallthrough warning
    - usbip: Fix potential format overflow in userspace tools
    - x86/microcode/intel: Fix BDW late-loading revision check
    - x86/retpoline: Fill RSB on context switch for affected CPUs
    - sched/deadline: Use the revised wakeup rule for suspending constrained dl
      tasks
    - can: af_can: can_rcv(): replace WARN_ONCE by pr_warn_once
    - can: af_can: canfd_rcv(): replace WARN_ONCE by pr_warn_once
    - PM / sleep: declare __tracedata symbols as char[] rather than char
    - time: Avoid undefined behaviour in ktime_add_safe()
    - timers: Plug locking race vs. timer migration
    - Prevent timer value 0 for MWAITX
    - drivers: base: cacheinfo: fix x86 with CONFIG_OF enabled
    - drivers: base: cacheinfo: fix boot error message when acpi is enabled
    - PCI: layerscape: Add "fsl,ls2085a-pcie" compatible ID
    - PCI: layerscape: Fix MSG TLP drop setting
    - mmc: sdhci-of-esdhc: add/remove some quirks according to vendor version
    - fs/select: add vmalloc fallback for select(2)
    - hwpoison, memcg: forcibly uncharge LRU pages
    - cma: fix calculation of aligned offset
    - mm, page_alloc: fix potential false positive in __zone_watermark_ok
    - ipc: msg, make msgrcv work with LONG_MIN
    - x86/ioapic: Fix incorrect pointers in ioapic_setup_resources()
    - ACPI / processor: Avoid reserving IO regions too early
    - ACPI / scan: Prefer devices without _HID/_CID for _ADR matching
    - ACPICA: Namespace: fix operand cache leak
    - netfilter: x_tables: speed up jump target validation
    - netfilter: arp_tables: fix invoking 32bit "iptable -P INPUT ACCEPT" failed
      in 64bit kernel
    - netfilter: nf_dup_ipv6: set again FLOWI_FLAG_KNOWN_NH at flowi6_flags
    - netfilter: nf_ct_expect: remove the redundant slash when policy name is
      empty
    - netfilter: nfnetlink_queue: reject verdict request from different portid
    - netfilter: restart search if moved to other chain
    - netfilter: nf_conntrack_sip: extend request line validation
    - netfilter: use fwmark_reflect in nf_send_reset
    - ext2: Don't clear SGID when inheriting ACLs
    - reiserfs: fix race in prealloc discard
    - re...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Po-Hsu Lin (cypressyew) wrote :

I can reproduce this in Bionic ARM64, bug 1765668

dann frazier (dannf) wrote :

This works fine for me now, see attached. bug 1765668 is a different issue, unrelated to the kernel.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers