[VMs on] arm64 eMAG system often failing to reboot with "IRQ Exception at 0x000000009BC11F38"

Bug #1926955 reported by Iain Lane
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
edk2 (Ubuntu)
New
Undecided
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned
qemu (Ubuntu)
New
Undecided
Unassigned

Bug Description

We recently installed some new arm64 Ampere eMAG hardware in the ScalingStack region which we use to run autopkgtests.

Guests (currently impish VMs on OpenStack) which we spawn often fail to reboot, repeating this over and over:

  IRQ Exception at 0x000000009BC11F38

Steps to reproduce:

1. Boot a current impish daily cloud image (have not confirmed other releases) in a VM on an Ampere eMAG (have not verified bare metal).
2. Wait for it to come up, and then issue `sudo reboot`
3. If the machine comes back up again, repeat #2 a few times. This is intermittent.
3a. If it fails, take a look at "openstack console log <machine id>" for the IRQ exception which repeats indefinitely.

(stand by, I will get something to attach shortly)
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 May 3 12:45 seq
 crw-rw---- 1 root audio 116, 33 May 3 12:45 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu65
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CasperMD5CheckResult: unknown
DistroRelease: Ubuntu 21.10
Ec2AMI: ami-0000fd9f
Ec2AMIManifest: FIXME
Ec2AvailabilityZone: nova
Ec2InstanceType: m1.small
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci-vt:
 -[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
            \-01.0-[01-02]----01.0-[02]--
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t:

Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
MachineType: QEMU KVM Virtual Machine
Package: linux (not installed)
PciMultimedia:

ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.11.0-16-generic root=UUID=6bd452c7-16ff-4771-b2e8-36045e00cc68 ro nohz=off
ProcVersionSignature: Ubuntu 5.11.0-16.17-generic 5.11.12
RelatedPackageVersions:
 linux-restricted-modules-5.11.0-16-generic N/A
 linux-backports-modules-5.11.0-16-generic N/A
 linux-firmware 1.197
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: impish ec2-images
Uname: Linux 5.11.0-16-generic aarch64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: False
acpidump:

dmi.bios.date: 02/06/2015
dmi.bios.release: 0.0
dmi.bios.vendor: EFI Development Kit II / OVMF
dmi.bios.version: 0.0.0
dmi.chassis.type: 1
dmi.chassis.vendor: QEMU
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnEFIDevelopmentKitII/OVMF:bvr0.0.0:bd02/06/2015:br0.0:svnQEMU:pnKVMVirtualMachine:pvr1.0:cvnQEMU:ct1:cvr1.0:
dmi.product.name: KVM Virtual Machine
dmi.product.version: 1.0
dmi.sys.vendor: QEMU

Revision history for this message
Iain Lane (laney) wrote : CRDA.txt

apport information

tags: added: apport-collected ec2-images impish
description: updated
Revision history for this message
Iain Lane (laney) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Iain Lane (laney) wrote : Lspci.txt

apport information

Revision history for this message
Iain Lane (laney) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Iain Lane (laney) wrote : ProcEnviron.txt

apport information

Revision history for this message
Iain Lane (laney) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Iain Lane (laney) wrote : ProcModules.txt

apport information

Revision history for this message
Iain Lane (laney) wrote : UdevDb.txt

apport information

Revision history for this message
Iain Lane (laney) wrote : WifiSyslog.txt

apport information

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: hirsute
Iain Lane (laney)
description: updated
Revision history for this message
Iain Lane (laney) wrote : Re: arm64 eMAG system often failng to reboot with "IRQ Exception at 0x000000009BC11F38"

I captured the console log when the bug happens, was hoping there would be something interesting just before the exception starts spamming the console, but sadly not.

[ OK ] Finished Reboot.
[ OK ] Reached target Reboot.
[ 17.856242] reboot: Restarting system

IRQ Exception at 0x00000000BBC11F38

IRQ Exception at 0x00000000BBC11F38

IRQ Exception at 0x00000000BBC11F38

IRQ Exception at 0x00000000BBC11F38

(keeps spamming this message)

Revision history for this message
Seth Forshee (sforshee) wrote :

I couldn't find anywhere in the kernel that might be printing this message so I went searching, looks like it is probably coming from the firmware:

https://github.com/tianocore/edk2/blob/master/ArmPlatformPkg/PrePeiCore/AArch64/ArchPrePeiCore.c#L29

Iain Lane (laney)
summary: - arm64 eMAG system often failng to reboot with "IRQ Exception at
- 0x000000009BC11F38"
+ [VMs on] arm64 eMAG system often failing to reboot with "IRQ Exception
+ at 0x000000009BC11F38"
Revision history for this message
Iain Lane (laney) wrote :

FTR, I just checked with Canonical IS and the affected host systems are running Xenial.

(We also have some different hardware which is running the same versions too and doesn't suffer from this bug, so I do suspect it is somehow hw specific.)

Revision history for this message
Iain Lane (laney) wrote :

I've just noticed a few failing with "Synchronous Exception at 0x00000000984FA4A8" instead of the IRQ Exception seen earlier.

Revision history for this message
Seth Forshee (sforshee) wrote :

A synchronous exception is a direct result of the executing software -- illegal memory access, invalid instruction, etc. So this looks different than the IRQ exception, which are interrupts from devices external to the CPU. However since this is a VM guest context the exceptions are presumably all generated by the hypervisor, and I'm not currently familiar with exactly how that works.

Revision history for this message
dann frazier (dannf) wrote :

On ARM, accelerated guests have the host interrupt controller somewhat "passed through". I say that while vigorously waving my hands because I don't recall the details well - but I do know that we needed to teach the entire stack (host kernel -> QEMU -> edk2) about GICv3 and that was new in the xenial release. AIUI, the existing hosts in ScalingStack would've been X-Gene w/ older GICv2m controllers. I suspect the root cause of these issues lie somewhere in the GICv3 support in QEMU and/or edk2 - see bug 1675522 for a similar report. I'd definitely suggest upgrading to >= bionic if possible. Or, if not, maybe consider enabling the cloud archive to get a newer virt stack. If neither is possible, we could also of course try and bisect down the fixes for the xenial stack and try and SRU them back - but that will certainly take some time.

Revision history for this message
dann frazier (dannf) wrote :

Note that I can reproduce on a Hi1616 (GICv3) system - I get the same "IRQ Exception at 0x00000000BBC11F38" on some reboots. I'll see if I can use this system to track down the problem - or at least identify the problematic component(s).

Revision history for this message
Iain Lane (laney) wrote : Re: [Bug 1926955] Re: [VMs on] arm64 eMAG system often failing to reboot with "IRQ Exception at 0x000000009BC11F38"

On Wed, May 05, 2021 at 03:12:43PM -0000, dann frazier wrote:
> On ARM, accelerated guests have the host interrupt controller somewhat
> "passed through". I say that while vigorously waving my hands because I
> don't recall the details well - but I do know that we needed to teach
> the entire stack (host kernel -> QEMU -> edk2) about GICv3 and that was
> new in the xenial release. AIUI, the existing hosts in ScalingStack
> would've been X-Gene w/ older GICv2m controllers. I suspect the root
> cause of these issues lie somewhere in the GICv3 support in QEMU and/or
> edk2 - see bug 1675522 for a similar report. I'd definitely suggest
> upgrading to >= bionic if possible. Or, if not, maybe consider enabling
> the cloud archive to get a newer virt stack. If neither is possible, we
> could also of course try and bisect down the fixes for the xenial stack
> and try and SRU them back - but that will certainly take some time.
>

Thanks for the analysis and for helping to look at this, it's really
appreciated.

I suspect getting Scalingstack dist-upgraded to a newer release will be
a difficult task... upgrading the virt stack alone *may* be possible, we
can check that (the cloud archive doesn't contain edk2 though). It's
also running Launchpad buildds though. I would expect reluctance to do
this from their side due to the risks and, annoying as it is, think we
should plan on identifying & doing the backports. Sorry :(

I'll file a ticket with IS for their input...

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message
Iain Lane (laney) wrote :

> I'll file a ticket with IS for their input...

Canonical staff can see this at rt #130839

Revision history for this message
dann frazier (dannf) wrote :

Thanks. I'm pretty confident this is the same issue as bug 1675522, so I'll mark this as a duplicate and update that bug with my progress.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.