Ubuntu
linux package

[VMs on] arm64 eMAG system often failing to reboot with "IRQ Exception at 0x000000009BC11F38"

Bug #1926955 reported by Iain Lane on 2021-05-03

This bug report is a duplicate of: Bug #1675522: arm64: Synchronous Exception at 0x00000000BBC129BC. Edit Remove

This bug affects 1 person

	Status	Importance	Assigned to
edk2 (Ubuntu)	New	Undecided	Unassigned
linux (Ubuntu)	Confirmed	Undecided	Unassigned
qemu (Ubuntu)	New	Undecided	Unassigned

Bug Description

We recently installed some new arm64 Ampere eMAG hardware in the ScalingStack region which we use to run autopkgtests.

Guests (currently impish VMs on OpenStack) which we spawn often fail to reboot, repeating this over and over:

IRQ Exception at 0x000000009BC11F38

Steps to reproduce:

1. Boot a current impish daily cloud image (have not confirmed other releases) in a VM on an Ampere eMAG (have not verified bare metal).
2. Wait for it to come up, and then issue `sudo reboot`
3. If the machine comes back up again, repeat #2 a few times. This is intermittent.
3a. If it fails, take a look at "openstack console log <machine id>" for the IRQ exception which repeats indefinitely.

(stand by, I will get something to attach shortly)
---
ProblemType: Bug
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 May 3 12:45 seq
crw-rw---- 1 root audio 116, 33 May 3 12:45 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu65
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CasperMD5CheckResult: unknown
DistroRelease: Ubuntu 21.10
Ec2AMI: ami-0000fd9f
Ec2AMIManifest: FIXME
Ec2AvailabilityZone: nova
Ec2InstanceType: m1.small
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lspci-vt:
-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
\-01.0-[01-02]----01.0-[02]--
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Lsusb-t:

Lsusb-v: Error: command ['lsusb', '-v'] failed with exit code 1:
MachineType: QEMU KVM Virtual Machine
Package: linux (not installed)
PciMultimedia:

ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.11.0-16-generic root=UUID=6bd452c7-16ff-4771-b2e8-36045e00cc68 ro nohz=off
ProcVersionSignature: Ubuntu 5.11.0-16.17-generic 5.11.12
RelatedPackageVersions:
linux-restricted-modules-5.11.0-16-generic N/A
linux-backports-modules-5.11.0-16-generic N/A
linux-firmware 1.197
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: impish ec2-images
Uname: Linux 5.11.0-16-generic aarch64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: False
acpidump:

dmi.bios.date: 02/06/2015
dmi.bios.release: 0.0
dmi.bios.vendor: EFI Development Kit II / OVMF
dmi.bios.version: 0.0.0
dmi.chassis.type: 1
dmi.chassis.vendor: QEMU
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnEFIDevelopmentKitII/OVMF:bvr0.0.0:bd02/06/2015:br0.0:svnQEMU:pnKVMVirtualMachine:pvr1.0:cvnQEMU:ct1:cvr1.0:
dmi.product.name: KVM Virtual Machine
dmi.product.version: 1.0
dmi.sys.vendor: QEMU

See original description

Tags:

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: CRDA.txt

CRDA.txt Edit (468 bytes, text/plain)

apport information

tags:	added: apport-collected ec2-images impish
description:	updated

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: CurrentDmesg.txt

CurrentDmesg.txt Edit (24.7 KiB, text/plain)

apport information

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: Lspci.txt

Lspci.txt Edit (2.0 KiB, text/plain)

apport information

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: ProcCpuinfoMinimal.txt

ProcCpuinfoMinimal.txt Edit (185 bytes, text/plain)

apport information

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: ProcEnviron.txt

ProcEnviron.txt Edit (248 bytes, text/plain)

apport information

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: ProcInterrupts.txt

ProcInterrupts.txt Edit (650 bytes, text/plain)

apport information

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: ProcModules.txt

ProcModules.txt Edit (2.5 KiB, text/plain)

apport information

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: UdevDb.txt

UdevDb.txt Edit (53.9 KiB, text/plain)

apport information

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: WifiSyslog.txt

WifiSyslog.txt Edit (28.8 KiB, text/plain)

apport information

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2021-05-03: Status changed to Confirmed

#10

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed
tags:	added: hirsute

Iain Lane (laney) on 2021-05-03

description:

updated

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03: Re: arm64 eMAG system often failng to reboot with "IRQ Exception at 0x000000009BC11F38"

#11

I captured the console log when the bug happens, was hoping there would be something interesting just before the exception starts spamming the console, but sadly not.

[ OK ] Finished Reboot.
[ OK ] Reached target Reboot.
[ 17.856242] reboot: Restarting system

IRQ Exception at 0x00000000BBC11F38

(keeps spamming this message)

Revision history for this message

Seth Forshee (sforshee) wrote on 2021-05-03:

#12

I couldn't find anywhere in the kernel that might be printing this message so I went searching, looks like it is probably coming from the firmware:

https://github.com/tianocore/edk2/blob/master/ArmPlatformPkg/PrePeiCore/AArch64/ArchPrePeiCore.c#L29

Iain Lane (laney) on 2021-05-03

summary:

- arm64 eMAG system often failng to reboot with "IRQ Exception at
- 0x000000009BC11F38"
+ [VMs on] arm64 eMAG system often failing to reboot with "IRQ Exception
+ at 0x000000009BC11F38"

Revision history for this message

Iain Lane (laney) wrote on 2021-05-03:

#13

FTR, I just checked with Canonical IS and the affected host systems are running Xenial.

(We also have some different hardware which is running the same versions too and doesn't suffer from this bug, so I do suspect it is somehow hw specific.)

Revision history for this message

Iain Lane (laney) wrote on 2021-05-05:

#14

I've just noticed a few failing with "Synchronous Exception at 0x00000000984FA4A8" instead of the IRQ Exception seen earlier.

Revision history for this message

Seth Forshee (sforshee) wrote on 2021-05-05:

#15

A synchronous exception is a direct result of the executing software -- illegal memory access, invalid instruction, etc. So this looks different than the IRQ exception, which are interrupts from devices external to the CPU. However since this is a VM guest context the exceptions are presumably all generated by the hypervisor, and I'm not currently familiar with exactly how that works.

Revision history for this message

dann frazier (dannf) wrote on 2021-05-05:

#16

On ARM, accelerated guests have the host interrupt controller somewhat "passed through". I say that while vigorously waving my hands because I don't recall the details well - but I do know that we needed to teach the entire stack (host kernel -> QEMU -> edk2) about GICv3 and that was new in the xenial release. AIUI, the existing hosts in ScalingStack would've been X-Gene w/ older GICv2m controllers. I suspect the root cause of these issues lie somewhere in the GICv3 support in QEMU and/or edk2 - see bug 1675522 for a similar report. I'd definitely suggest upgrading to >= bionic if possible. Or, if not, maybe consider enabling the cloud archive to get a newer virt stack. If neither is possible, we could also of course try and bisect down the fixes for the xenial stack and try and SRU them back - but that will certainly take some time.

Revision history for this message

dann frazier (dannf) wrote on 2021-05-05:

#17

Note that I can reproduce on a Hi1616 (GICv3) system - I get the same "IRQ Exception at 0x00000000BBC11F38" on some reboots. I'll see if I can use this system to track down the problem - or at least identify the problematic component(s).

Revision history for this message

Iain Lane (laney) wrote on 2021-05-05: Re: [Bug 1926955] Re: [VMs on] arm64 eMAG system often failing to reboot with "IRQ Exception at 0x000000009BC11F38"

#18

On Wed, May 05, 2021 at 03:12:43PM -0000, dann frazier wrote:
> On ARM, accelerated guests have the host interrupt controller somewhat
> "passed through". I say that while vigorously waving my hands because I
> don't recall the details well - but I do know that we needed to teach
> the entire stack (host kernel -> QEMU -> edk2) about GICv3 and that was
> new in the xenial release. AIUI, the existing hosts in ScalingStack
> would've been X-Gene w/ older GICv2m controllers. I suspect the root
> cause of these issues lie somewhere in the GICv3 support in QEMU and/or
> edk2 - see bug 1675522 for a similar report. I'd definitely suggest
> upgrading to >= bionic if possible. Or, if not, maybe consider enabling
> the cloud archive to get a newer virt stack. If neither is possible, we
> could also of course try and bisect down the fixes for the xenial stack
> and try and SRU them back - but that will certainly take some time.
>

Thanks for the analysis and for helping to look at this, it's really
appreciated.

I suspect getting Scalingstack dist-upgraded to a newer release will be
a difficult task... upgrading the virt stack alone *may* be possible, we
can check that (the cloud archive doesn't contain edk2 though). It's
also running Launchpad buildds though. I would expect reluctance to do
this from their side due to the risks and, annoying as it is, think we
should plan on identifying & doing the backports. Sorry :(

I'll file a ticket with IS for their input...

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Revision history for this message

Iain Lane (laney) wrote on 2021-05-05:

#19

> I'll file a ticket with IS for their input...

Canonical staff can see this at rt #130839

Revision history for this message

dann frazier (dannf) wrote on 2021-05-06:

#20

Thanks. I'm pretty confident this is the same issue as bug 1675522, so I'll mark this as a duplicate and update that bug with my progress.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1675522 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

[VMs on] arm64 eMAG system often failing to reboot with "IRQ Exception at 0x000000009BC11F38"

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package