Random boot failure with Ubuntu 20.04 / grub 2.04 and Hyper-V 2012r2

Bug #1918265 reported by ben
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
grub2 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi,

We are experiencing some weird boot issues with Ubuntu 20.04 and Hyper-V 2012r2.

In two cases the installation started successfully, as did the VM initial boot and then issues started to arise, where the boot process started to fail.

grub start, the menu pop-up and then just after loading kernel and initrd (tried to put some echo(es) in the config the VM somehow reset and the menu pop-up back again.

After a few of thoses either the VM shut-down or the boot is successfull, It seems completly random..

Here it is captured on video:
  https://www.youtube.com/watch?v=5Bk3S-YGDZk

If i setup a direct EFI-STUB boot of the kernel+initrd the boot process works every times.

I suspected the "save_env/load_env" for a while but a stripped down grub.cfg give the same result..

> insmod gzio
> insmod part_gpt
> insmod ext2
> set root='hd0,gpt2'
> if [ x$feature_platform_search_hint = xy ]; then
> search --no-floppy --fs-uuid --set=root --hint-bios=hd0,gpt2 --hint-efi=hd0,gpt2 --hint-baremetal=ahci0,gpt2 94ebc17e-6aca-4e42-b489-b3eaa8a32d90
> else
> search --no-floppy --fs-uuid --set=root 94ebc17e-6aca-4e42-b489-b3eaa8a32d90
> fi
> linux /vmlinuz-5.8.0-44-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro nosplash elevator=noop
> initrd /initrd.img-5.8.0-44-generic
> boot

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

It would be interesting to know:

- if Secureboot is on or off (if at all supported)

- the shim version installed

- increase grub debugging and capture more detailed debug messages from grub

- vmlinuz and/or initrd are corrupted, or the disk itself is in need of fsck

- if using `linux-azure` yields better results, instead of `linux-generic`

We could also escalate this to Azure / Microsoft as well.

Revision history for this message
ben (benoit+one) wrote :

Hi,

So for secureboot it's off, the system doesn't boot with it on (compatible key not included with 2012R2's Hyper-V i suppose.

The shim used was the one setup from 20.04.2 install image, technically it's not the same as the one in the "shim" package:

78415fb8fb9b909f8029858113f1335f /boot/efi/EFI/ubuntu/shimx64.efi
9bdc83ad343e8745e1f3d55c36cf2df6 /usr/lib/shim/shimx64.efi

But some build information extracted from the binary seem to say otherwise :

$Version: 15 $
$BuildMachine: Linux x86_64 x86_64 x86_64 GNU/Linux $
$Commit: a4a1fbe728c9545fc5647129df0cf1593b953bec $

$Version: 15 $
$BuildMachine: Linux x86_64 x86_64 x86_64 GNU/Linux $
$Commit: a4a1fbe728c9545fc5647129df0cf1593b953bec $

I have attached a still frame captured from grub boot with "set debug=all". It's the frame just before the VM reboot.

As for the linux-azure kernel, i see no differences on the boot process in any of the three following kernels:
  vmlinuz-5.8.0-44-generic
  vmlinuz-5.4.0-1041-azure
  vmlinuz-5.4.0-67-generic

It may be interesting to know however that when the linux-azure-5.4.0.1041.21 kernel was introduced into Ubuntu 18.04 i started to experience weird boot issues causing me to un-install it and fall back to the 5.0.0-1036-azure kernel.

There might be some commun issue here

Revision history for this message
ben (benoit+one) wrote :

Forgot to add the part about data corruption, kernel images and initrd used by grub and systemd-boot are absolutly identicals:

4188eea45cdb76ebe2b313aef0ac9d7d3f2772cdcf536798b2eeef64fd33a810 vmlinuz-5.8.0-44-generic
4188eea45cdb76ebe2b313aef0ac9d7d3f2772cdcf536798b2eeef64fd33a810 efi/1c1f1f7eb0df44baa7b6399299ee251a/5.8.0-44-generic/linux

4fba2fb3cf0a46cfb839687d5431850473bcc702a1334ffc9dfb7aa06bd76694 initrd.img-5.8.0-44-generic
4fba2fb3cf0a46cfb839687d5431850473bcc702a1334ffc9dfb7aa06bd76694 efi/1c1f1f7eb0df44baa7b6399299ee251a/5.8.0-44-generic/initrd

And the file-systems seems clean

root@ubuntu-template:/# e2fsck /dev/disk/by-uuid/94ebc17e-6aca-4e42-b489-b3eaa8a32d90
e2fsck 1.45.5 (07-Jan-2020)
/dev/disk/by-uuid/94ebc17e-6aca-4e42-b489-b3eaa8a32d90: clean, 313/65536 files, 63289/262144 blocks

root@ubuntu-template:/# fsck.vfat /dev/disk/by-uuid/3AA9-F317
fsck.fat 4.1 (2017-01-24)
/dev/disk/by-uuid/3AA9-F317: 34 files, 69881/130812 clusters

Also the issue is happening on another 2012R2 cluster, but on this one i cannot even start the installation process, this is very weird

Revision history for this message
Dmitry Zakharov (yadimaz) wrote :

Hi.
I have the same problems when booting VM Ubuntu 20.04.2 on Hyper-V 2016.
But only when vlan 120 is specified in the VM configuration.
If vlan is not specified, or it differs from 120, the VM is loaded correctly.

Changed in grub2 (Ubuntu):
assignee: nobody → Dmitry Zakharov (yadimaz)
assignee: Dmitry Zakharov (yadimaz) → nobody
Revision history for this message
Chad M (cmmikuta) wrote (last edit ):

Hello,
We are experiencing the same issue running 20.04.2 with the exception of the crashing part. We have fully updated 2012R2 hosts and when we use VLANs 20 or 40, there's a good chance the VM will get stuck in kind of boot loop when we try to restart it. This doesn't happen all the time either which is odd. When we use other VLANs, for example 21 or 41, the VM seems to boot fine every time. I have tried this on different hosts with different hardware as well with no change in behavior. We've also noticed this same behavior in CentOS 8.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in grub2 (Ubuntu):
status: New → Confirmed
Revision history for this message
Paul Chilton (pchilton) wrote :

I'm also getting the same problem. Across 3x Windows 2016 machines running Hyper-V. Had this issue with Ubuntu 18.04 and now 20.04 but only when they are connected to a VLAN. My vlan assignments are 10,40,100,110,120,200 - I'm going to test next with an odd numbered vlan as that seems to be a pattern in all the above reports... I have tested and I get the same behavior if the VLAN is assigned in Hyper-V UI, from Powershell or even when using a default vlan on the network switch and leaving hyper-v unaware of the vlan.

Sometimes you can get the machine to boot by hitting Enter on the Grub boot screen (often 10+ attempts). A sure-fire way to get it to boot is to go into Hyper-V VM settings, untag the vlan from the network interface, hit Enter in the console to get it to boot and then re-add the vlan assignment. This gets very boring very quickly when repeating on multiple machines during maintenance and would love a fix or workaround.

Revision history for this message
Paul Chilton (pchilton) wrote :

Having re-tested this with one particular VM that was failing, it appears to be happy on any VLAN but 10 (which has most of my clients on at the moment, including the Hyper-V server itself).

To add some more detail in case it's relevant, each server I have 2x LAN Teams, one for the Windows 2016 use with a static IP on VLAN 10 and the other dedicated to Hyper-V setup as a trunk port. Each LAN Team has 2x 10Gbps links.

Revision history for this message
Evgeny (evgeny-b) wrote :

Same problem here with Ubuntu server 20.04 and Hyper-v on Windows Server 2016 with VM's gen 2.
We also noticed that if you set virtual switch to "Not connected" for the affected VM, it will boot every time. And you can set back the required swith after boot. But when any other switch is selected, the VM won't boot most of the time or at all with the same symptoms.

Any progress in solving this?

Revision history for this message
Lucas Navarezi (lucas-navarezi) wrote :

We are currently facing the same problems with Linux Virtual Machines running Ubuntu 18.04 and 20.04, both LTS releases.

Upon examining the Event Viewer on one of the Hosts machines, there's a critical error message:

Event ID: 18602
Source: Hyper-V-Worker

<VM_NAME> has encountered a fatal error and a memory dump has been generated. The guest operating system reported that it failed with the following error code: 0x1E. If the problem persists, contact Product Support for the guest operating system. (Virtual machine ID XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX)

These are some fixes we have tried so far:
   - Install linux-azure packages and kernel;
   - Enable Secure boot on Guest Machine;
   - Scan the disk for corruption;
   - Try different kernels.

We couldn't try other fixes like Updating the Hyper-V Server to 2019 or disabling secure boot on Host machines, the reason being that we have around 200 VMs across 8 Hosts.

Current configuration:
   - 2 Clusters;
   - 4 Hosts per cluster;
   - All hosts are running Hyper-V 2016.

Unfortunately, we are not the only ones with the issue, the bug also affects Debian and RHEL distros:

https://access.redhat.com/solutions/4796261
https://docs.microsoft.com/en-us/answers/questions/530961/windows-server-2016-hyper-v-boot-error-w-virtual-s.html

These ones are the same:
https://docs.microsoft.com/en-us/answers/questions/52937/failure-to-boot-on-red-hat-enterprise-linux-rhel-o.html
https://social.technet.microsoft.com/Forums/en-US/1da1f987-52f0-4304-84f1-2c0ab52f3586/failure-to-boot-on-red-hat-enterprise-linux-rhel-or-centos-8-using-hyperv-2016?forum=linuxintegrationservices
https://social.technet.microsoft.com/Forums/en-US/3c48c962-a28d-44bb-bd80-5b7a902404d8/failure-to-boot-on-red-hat-enterprise-linux-rhel-or-centos-8-using-hyperv-2016?forum=winserverhyperv
https://www.reddit.com/r/HyperV/comments/hx2cps/failure_to_boot_on_red_hat_enterprise_linux_rhel/

Revision history for this message
Dave Barnum (dbwycl) wrote :

Same issue with a generation 2 VM, Ubuntu 20.04.3, Hyper-V 2016, linux-azure kernel in use. Does not happen with our generation 1 VMs. Same workarounds of unselecting VLAN ID checkbox and/or setting the NIC to 'Not connected' works for us as well.

Revision history for this message
Damian Kramer (jesuscollege) wrote :

I can confirm same problem has just hit us. Gen 2 VM, linux-generic kernel running on Hyper-V 2016. Had been happily running for many months. Did a reboot and it failed to boot. Disconnected all of the interfaces from the virtual switch, then it booted. This particular VM is our DHCP server, so fairly critical. Think we're going to need to look at something other than Ubuntu now, as need to rely on this in production :(

Revision history for this message
Damian Kramer (jesuscollege) wrote :

The reddit link posted by Lucas above suggested that Hyper-V 2019 doesn't have the issues. Can anyone confirm?

Revision history for this message
Lucas Navarezi (lucas-navarezi) wrote :

Hello everyone,

apparently Debian 11 (5.10.0-9-amd64) solved these issues.
We currently have some VMs running the OS.
I tested with a cronjob to reboot one vm every 5 minutes and it didn't exhibit the 0x1E on event viewer and neither shut itself off.

If anyone can test and confirm this,
maybe we can compare the differences between kernels, packages etc.

Revision history for this message
Manuel Meitinger (meitinger) wrote :

Hello,
since 5.4.0-92 we're also experiencing this issue on Ubuntu 20.04.3 VMs on two Hyper-V 2012r2 servers.
On the same servers there are also other VMs with this kernel (and 5.4.0-94) booting without issues.
As described earlier, booting some times works, sometimes doesn't.

Setting `debug=all` for GRUB and adding `boot_delay=5000 panic=0` to the kernel params reveals that after

`loader/efi/linux.c:96: kernel_addr: 0x2bfa1000 handover_offset: 0x190 params: 0x3ec0f000`

the system just reboots.

I have, however, not found the event that @lucas-navarezi mentioned in the Hyper-V-Worker log.

Revision history for this message
Luiz Agostinho (luiz-agostinho) wrote :

Hi,

I'm having the same problem, when booting generation 2 VM Ubuntu 20.04.3 LTS. But only when using VLAN tag 700, 702 and 703, using VLAN tag 120 VM normal booting.

Testing with kernel 5.4.0-81, 5.4.0-94, 5.4.0-96, 5.4.0-97 and 5.11.0-1028-azure.

Revision history for this message
Matthew Bassett (matthewbassett) wrote :

Agreed. this has been making me crazy. Hyper-V Server 2012R2 up to date. Ubuntu 20.04 VM. If I have VLAN 10 tagged on a NIC it will bootloop. If I untag the VM, it boots fine, and then halfway through the boot I can add back the VLAN 10 tag and it works fine, until it automatically reboots for updates, and then its stuck offline again

Revision history for this message
Hans Schraven (ha100) wrote :

I have the same problem. Is there a fix?

Revision history for this message
ben (benoit+one) wrote :

for my part i've switched to the bootloader from the systemd project (systemd-boot) which is working fine with hyper-v but there is few activity here except the "me too(s)"

Maybe we should try taking this upstream eventualy

Revision history for this message
Austin (austin-intervision) wrote :

I have a client who is exhibiting this same issue, up through Ubuntu 20.4.2 - has there been any remedy or diagnosis for this situation?

We have had success in removing the network connection, and reconnecting it one at the Grub menu. Also we aren't currently using the VLAN function, so that doesn't seem to be a factor for us.

Revision history for this message
ben (benoit+one) wrote :

I don't use this environment anymore, but i ended up switching to the "systemd-boot" bootloader on the ubnutu template and that was working perfectly

Revision history for this message
greg long (gelowe) wrote :

Thank you for posting the fact that efi-stub boot allows for system booting.
This is also a problem with all RHEL 8 or 9 variants on hyper-v 2016
I setup scripts setup EUFI-Stub boot and posted them here
https://github.com/gee456/efistub-boot-ol8

Revision history for this message
Luiz Agostinho (luiz-agostinho) wrote :

Hi,

Greg's solution fixed it in my environment, I used the commands below on an Ubuntu 22.04:

# mkdir /boot/efi/EFI/custom/
# cp /boot/initrd.img-$(uname -r) /boot/efi/EFI/custom/
# cp /boot/vmlinuz-$(uname -r) /boot/efi/EFI/custom/vmlinuz-$(uname -r).efi
# efibootmgr -c -d /dev/sda -p 1 -L "Ubuntu Linux EUFI Direct Boot" -l "\EFI\custom\vmlinuz-$(uname -r).efi" -u "root=/dev/mapper/vg-root ro crashkernel=auto resume=/swap.img rd.lvm.lv=vg/root initrd=\EFI\custom\initrd.img-$(uname -r)"

In Greg's post there is a service that runs to deal with Kernel updates.

Thanks Greg!

Revision history for this message
Rodrigo Huntermann (huntermann) wrote (last edit ):

Hi,

I have the same problem, if i add virtual network card, ubuntu 20.04 not starting, removing virtual card and add after boot is working!

Anyone have the solution?

ps. the solution from Luiz Agostinho not work for me.

Thank you;

Revision history for this message
Givrix (geoffrey-mosini) wrote :

Installing azure kernel ( default version 6.2.0-1012) was a fix in my case.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.