UEFI in ovmf package causes kernel panic

Bug #1821729 reported by Riccardo Pittau
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
edk2 (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Invalid
Undecided
Unassigned
Cosmic
Invalid
Undecided
Unassigned

Bug Description

UBUNTU info
Description: Ubuntu 18.04.1 LTS
Release: 18.04

PACKAGE info
ovmf:
  Installed: 0~20180205.c0d9813c-2
  Candidate: 0~20180205.c0d9813c-2
  Version table:
 *** 0~20180205.c0d9813c-2 500
        500 http://nova.clouds.archive.ubuntu.com/ubuntu bionic/universe amd64 Packages
        100 /var/lib/dpkg/status

Expected:
Virtual machines loading from the UEFI provided by the ovmf package should work fine.
This is working correctly in Ubuntu Xenial with ovmf package 0~20160408.ffea0a2c-2 that provides EFI v2.60 by EDK II.

What happens:
Virtual machines loading from the UEFI crash with kernel panic.
Provided UEFI version: EFI v2.70 by EDK II.

To be able to correctly boot up a VM on bionic, we downgraded the package.
Here's an example from a succesfull automated devstack job:
https://review.openstack.org/647687 -> ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa

Revision history for this message
dann frazier (dannf) wrote :

Please provide the console log and libvirt xml for a failing VM.

Changed in edk2 (Ubuntu):
status: New → Incomplete
Revision history for this message
Riccardo Pittau (rpittau) wrote :

console log attached

Revision history for this message
Riccardo Pittau (rpittau) wrote :

libvirt xml of virtual node attached

Revision history for this message
dann frazier (dannf) wrote :

@Riccardo: Thanks, I can reproduce by using machine='pc-1.0' in my XML (pc-i440fx-2.12 worked fine). I also found that the version of ovmf in disco works w/ pc-1.0, so I'll try to bisect down what fixed it.

Revision history for this message
dann frazier (dannf) wrote :

Bisection led me to the following fix:
 https://github.com/tianocore/edk2/commit/272658b9971865812f70e494d0198f13df8b841b

I've uploaded test kernels to a PPA, can you verify that the 18.04 ('bionic') one fixes the problem for you?
  https://launchpad.net/~dannf/+archive/ubuntu/test

Changed in edk2 (Ubuntu Cosmic):
status: New → Incomplete
Changed in edk2 (Ubuntu Bionic):
status: New → Incomplete
Revision history for this message
Riccardo Pittau (rpittau) wrote :

@Dann: thanks for checking that, I did some tests and, even if I can't reproduce the kernel panic anymore, I'm seeing another issue that was also happening before with virtio, the vm is not booting on ipxe.

This is the correct output on bionic using virtio + ovmf from xenial:
http://logs.openstack.org/87/647687/5/check/ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa/a16650a/controller/logs/ironic-bm-logs/node-0_no_ansi_2019-03-27-14:18:35_log.txt.gz

This is a test done with virtio and the new package on bionic:
http://logs.openstack.org/07/645507/7/check/ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa/2660087/controller/logs/ironic-bm-logs/node-0_no_ansi_2019-04-01-09:57:23_log.txt.gz

This is the output changing the driver to e1000 and using the default rom on bionic:
http://logs.openstack.org/38/648938/1/check/ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa/c943520/controller/logs/ironic-bm-logs/node-0_no_ansi_2019-04-01-11:01:55_log.txt.gz

This is the output using e1000 driver using the rom from ipxe-qemu on bionic:
http://logs.openstack.org/38/648938/2/check/ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa/1c53407/controller/logs/ironic-bm-logs/node-0_no_ansi_2019-04-01-13:15:39_log.txt.gz

I reproduced all the tests locally to confirm, also changing ipxe-qemu from bionic to xenial to understand if it was a regression in that package, but it didn't work, so I think this is another problem in recent versions of ovmf.

Revision history for this message
dann frazier (dannf) wrote :

@Riccardo: Sorry, I'm not sure how to read your update. Are you stating that you are unable to reproduce the panic *after installing a PPA build*, or for some other reason?

Can you confirm that the ipxe thing is a separate issue? If so, please file a new bug for that instead of overloading this one.

Revision history for this message
Riccardo Pittau (rpittau) wrote :

@Dann: what I'm saying is that I can't reproduce the kernel panic after installing a PPA build because it doesn't load from ipxe anymore.

I can't confirm or deny the issue with the kernel panic is actually fixed.

If you prefer I can open another bug, although it seems a regression to me, but of course a different issue.

Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1821729] Re: UEFI in ovmf package causes kernel panic

On Mon, Apr 1, 2019 at 9:50 AM Riccardo Pittau
<email address hidden> wrote:
>
> @Dann: what I'm saying is that I can't reproduce the kernel panic after
> installing a PPA build because it doesn't load from ipxe anymore.
>
> I can't confirm or deny the issue with the kernel panic is actually
> fixed.
>
> If you prefer I can open another bug, although it seems a regression to
> me, but of course a different issue.

OK, I understand now - thanks for clarifying.

  -dann

Revision history for this message
dann frazier (dannf) wrote :

@Riccardo: The PPA builds I provided had both a fix for this issue and some security issues, some related to networking code. To help determine if it is the security fixes or the fix for this bug causing your iPXE regression, I've uploaded new packages *without* the security fix. Could you test those? They are in the same PPA.

Unfortunately, I'm unable to reproduce the iPXE issue myself. With ovmf 0~20180205.c0d9813c-2ubuntu0.1~ppa.1 from my PPA, I'm able to boot into iPXE just fine:

https://paste.ubuntu.com/p/T5Pp7GCmrH/

I haven't configured iPXE to actually boot a kernel, so it stops there - but yours appears to hang well before that. I've actually only used iPXE a few times - what I'm doing is using the internal virtio PXE support from ovmf to PXE boot the iPXE payload. I don't see any messages in your log prior to iPXE - are you doing the same thing?

Revision history for this message
Riccardo Pittau (rpittau) wrote :

@Dann: thanks for keep looking at this, I tested the new package both locally and in our CI, using virtio and e1000 drivers, but unfortunately I'm still seeing the issue where the vm can't load from ipxe.
The only thing that we change for the test is the ovmf package, the rest of the configuration is exactly the same.
I'm quite puzzled at this point seeing that you were actually able to correctly load from ipxe, in my case it seems the vm hangs before getting an ip during the pxeboot phase.

Revision history for this message
dann frazier (dannf) wrote :

@Riccardo: I found some time to try and reproduce the ipxe issue again today but, unfortunately, I don't remember what the failure looked like and the logs in comment #6 are no longer accessible. Would you be able to attach some logs here?

Revision history for this message
Riccardo Pittau (rpittau) wrote :
Revision history for this message
dann frazier (dannf) wrote :

Thanks Riccardo. I'm still unable to reproduce, but I'm not sure I'm doing the right thing. Can you elaborate on how iPXE is getting loaded in the "good" case? i.e. are you loading it from disk, or chainloading to it from OVMF's built-in PXE client? In your "bad" logs, it looks like you are booting using OVMF's built-in PXE, and that is just hanging. But in your "good" logs, I see no evidence of the OVMF PXE running ("Start PXE over IPv4.") iPXE just seems to start directly.

Also - while I agree this does look like a regression, can you confirm if this is a regression between ovmf/xenial and ovmf/bionic (vs. a regression between ovmf/bionic and ovmf/my ppa)? That is, does the official bionic build have the same problem as my PPA build?

Revision history for this message
dann frazier (dannf) wrote :

On Mon, May 20, 2019 at 2:15 AM Riccardo Pittau
<email address hidden> wrote:
>
> Hi Dann,
>
> Please find recent logs here:
>
> This is the correct output on bionic using virtio + ovmf package from xenial:
> http://logs.openstack.org/90/644590/8/gate/ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa/6232932/controller/logs/ironic-bm-logs/node-0_no_ansi_2019-05-18-12:09:41_log.txt.gz
>
> Job on bionic using virtio + new package from your repo:
> http://logs.openstack.org/07/645507/10/check/ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa/d9a0a64/controller/logs/ironic-bm-logs/node-0_no_ansi_2019-05-18-10:13:23_log.txt.gz
>
> Job on bionic using e1000 + new package from your repo:
> http://logs.openstack.org/38/648938/5/check/ironic-tempest-ipa-partition-uefi-pxe_ipmitool-tinyipa/90c5126/controller/logs/ironic-bm-logs/node-0_no_ansi_2019-05-18-10:08:08_log.txt.gz

(Attaching these logs here for safe keeping)

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for edk2 (Ubuntu Bionic) because there has been no activity for 60 days.]

Changed in edk2 (Ubuntu Bionic):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for edk2 (Ubuntu) because there has been no activity for 60 days.]

Changed in edk2 (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for edk2 (Ubuntu Cosmic) because there has been no activity for 60 days.]

Changed in edk2 (Ubuntu Cosmic):
status: Incomplete → Expired
Revision history for this message
Riccardo Pittau (rpittau) wrote :

Hi Dann,

unfortunately I was diverged to other things and couldn't test again until now.

I don't see your files anymore, but I did some more testing with packages up to eoan.

I can confirm that the regression is between xenial and bionic versions:
0~20160813.de74668f-2 works (tested in bionic)
0~20180205.c0d9813c-2ubuntu0.1 does not work -> kernel panic (tested in bionic)

Changed in edk2 (Ubuntu):
status: Expired → Incomplete
Changed in edk2 (Ubuntu Bionic):
status: Expired → Incomplete
Revision history for this message
Riccardo Pittau (rpittau) wrote :
Revision history for this message
Riccardo Pittau (rpittau) wrote :
Revision history for this message
Riccardo Pittau (rpittau) wrote :
Revision history for this message
Riccardo Pittau (rpittau) wrote :

Hi Dann,

I tried to narrow down where the failure happens and it seems it's between versions 0~20161202.7bbe0b3e-1 (Zesty) and 0~20170911.5dfba97c-1 (Aartful).
Just to clarify, version 0~20161202.7bbe0b3e-1 (Zesty) seems working fine in my environment, while version 0~20170911.5dfba97c-1 (Aartful) starts giving kernel panic.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Setting back to "new" since the requested information was provided.

Changed in edk2 (Ubuntu):
status: Incomplete → New
Changed in edk2 (Ubuntu Bionic):
status: Incomplete → New
Revision history for this message
Riccardo Pittau (rpittau) wrote :

Hi, just wondering if we can provide any more info or help to be able to move forward with the investigation.
Thanks.

Revision history for this message
Bryce Harrington (bryce) wrote :

Hi Riccardo,

Thanks for following up, hopefully Dann will chime in with some pertinent directions. I don't know edk2 myself, but can give some general advice on helping move bugs like this one forward.

You've identified two boundary points for the issue, and at least have a local way to reproduce the bug reliably on your own hardware. Given that, one often effective technique is to do a git bisect search. There appear to be 1577 commits to sort through:

  $ git clone https://github.com/tianocore/edk2.git
  $ cd edk2/
  $ git rev-list 7bbe0b3e..5dfba97c --count
  1577

Details of the commits are here:

  $ git shortlog 7bbe0b3e..5dfba97c

An alternative approach, since this is a compiled C program, might be to set up traces on the process and identify the point in the edk2 codebase that triggers the kernel fault. I am not sure if this is easy or hard to do in your environment, so may or may not be feasible.

Barring that, the next most helpful information is usually to have detailed steps to reproduce, or a simplified test case, that would enable other engineers to trigger the fault and then use one of the aforementioned procedures to isolate the fault and/or fix.

Revision history for this message
dann frazier (dannf) wrote :

I think I've figured out how to simulate the OVMF PXE -> iPXE chaining you are doing, which had me confused in Comment #14. I also see the problem being introduced between the versions you identified, and was able to bisect it down to the following commit:

# first bad commit: [6e5e544f227f031d0b45828b56cec5668dd1bf5b] OvmfPkg: Install BGRT ACPI table

Indeed, only in your "bad" console log does Linux reports the BGRT table:

ACPI: RSDP 0x000000001FBFA014 000024 (v02 INTEL )
ACPI: XSDT 0x000000001FBF90E8 000034 (v01 INTEL EDK2 00000002 01000013)
ACPI: BGRT 0x000000001FBF8000 000038 (v01 INTEL EDK2 00000002 01000013)

It also crashes immediately after initializing the ACPI interpreter, adding more evidence that this is the problem:

ACPI: Core revision 20170728
BUG: unable to handle kernel paging request at ffff88201f015b34
IP: 0xffffffff81311d4c
PGD 2c68067 P4D 2c68067 PUD 0

This led me to wonder if there's just a bug in your kernel dealing with the BGRT. In fact, the very first Linux commit from this search shows a fix for a crash like this one (hard to be sure, this kernel doesn't do symbol resolution for us):

$ git log --reverse --oneline v4.4..origin/master --grep BGRT | head -1
50a0cb565246f x86/efi-bgrt: Fix kernel panic when mapping BGRT data

If you have a way to rebuild your kernel to apply that patch, I'd suggest trying that. I also see that there's a new tinycore kernel out - maybe you can upgrade?

I'll go ahead and close this out as Invalid for edk2 - but feel free to reopen if you find evidence that my conclusion is incorrect.

Changed in edk2 (Ubuntu Cosmic):
status: Expired → Invalid
Changed in edk2 (Ubuntu Bionic):
status: New → Invalid
Changed in edk2 (Ubuntu):
status: New → Invalid
Revision history for this message
Riccardo Pittau (rpittau) wrote :

Hey Dann, thanks a lot for the thorough analysis, the reasoning makes sense.
I'm just a bit sceptical because we use tinycore 9.x at the moment, with kernel 4.14, which includes the patch for the kernel panic you mentioned, it was just moved under drivers/firmware/efi/efi-bgrt.c, so I'm not sure that could be the explanation for the error.
We're in the process of upgrading tinycore to 10.x which will bring in kernel 4.19, although I don't see any substantial changes in he efi-bgrt management.
I wonder if there was a regression at some point, but it's unlikely.
I'll let you know how it goes with tinycore 10.x, in the meantime if you have any other ideas or advice, please let me know.
Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.