Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal

Bug #1946149 reported by Ian May
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
High
Unassigned

Bug Description

[ Impact ]
The bionic 4.15 kernels are failing to boot on r5.metal instances on AWS. The default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-aws(4.15.0-1113-aws) or bionic/linux (4.15.0-160.168) the machine fails to boot the 4.15 kernel.

This problem only appears on metal instances, which uses NVME instead of XVDA devices.

[ Fix ]
It was discovered that after reverting the following two commits from upstream stable the 4.15 kernels can be booted again on the affected AWS metal instance:

PCI/MSI: Enforce that MSI-X table entry is masked for update
PCI/MSI: Enforce MSI[X] entry updates to be visible

[ Test Case ]
Deploy a r5.metal instance on AWS with a bionic image, which should boot initially with bionic/linux-aws-5.4. Install bionic/linux or bionic/linux-aws (4.15 based) and reboot the system.

[ Where problems could occur ]
These two commits are part of a larger patchset fixing PCI/MSI issues which were backported to some upstream stable releases. By reverting only part of the set we might end up with MSI issues that were not present before the whole set was applied. Regression potential can be minimized by testing the kernels with these two reverted patches on all the platforms available.

[ Original Description ]
When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel.

If I remove these patches the instance correctly boots the 4.15 kernel

https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html

With that being said, after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly.

This problem only appears on metal instances, which uses NVME instead of XVDA devices.

AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' from mount options and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot after installing the 4.15 kernel.

I have been unable to capture a stack trace using 'aws get-console-output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme.

CVE References

Ian May (ian-may)
affects: ubuntu → linux-aws (Ubuntu)
Ian May (ian-may)
description: updated
Ian May (ian-may)
description: updated
description: updated
description: updated
Ian May (ian-may)
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Ian, thanks for the bug report! I'm checking this on AWS.

Revision history for this message
Ian May (ian-may) wrote :

Confirmed it does work to first upgrade bionic/linux-5.4 from 5.4.0-1056-aws to 5.4.0-1058-aws and then update to 4.15.0-1113-aws

Ian May (ian-may)
description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

It looks like it's not a problem with the patchset in general,
maybe it's specific to aws 4.15?

The patchset is in 5.4.0-1058-aws and it booted fine here too.
I'll check the patchset in 4.15.0-1113-aws.

A difference from your comment is that I could _not_ boot it
after 5.4.0-1058-aws, which worked for you. (r5.metal usually
boots in ~15min? and it's been ~20min already and no response;
serial console/screenshot is not availabe on bare metal sadly.)

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

BTW, do you know of the differences between r5.metal and r5.24xlarge?

Per the specs they seem to be the same as in cpu/ram/nic/_nvme_ storage,
but differ in baremetal vs nitro hypervisor?

The reason I ask is because downgrading from 5.4.0-1056-aws to 4.15.0-1113-aws
worked/booted fine on r5.24xlarge, differently from r5.metal.

I got to test it while looking for a similar instance type that had
serial console support, but it didn't repro the problem, actually.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

For the record, 4.15.0-1113-aws works in r5.metal w/ kexec.

Booted it 10 times successfully from both 5.4.0-1058-aws
and 4.15.0-1113-aws (itself.)

(not that it was expected to make a difference as the issue
happens on normal boot, which doesn't have previous kernel.)

Right after that, in the same instance, trying a normal boot
fails.

And it had kdump installed/enabled (ie, crashkernel in cmdline),
w/ which Ian mentioned that he couldn't reproduce the problem.

---

It also works on normal boot w/ r5d.metal (note 'd'), which
should be the same as r5.metal but w/ four local nvme disks.
(still boots from EBS/nvme disk in the same way as r5.metal)

---

Similarly, it works on r4.24xlarge (this is not metal) but
does boot from EBS/nvme disk too.

---

So it seems like there's no problem with the patchset as in
4.15.0-1113-aws as it boots fine in several types w/ approx
the same hardware config, just differing on normal/kexec in
the r5.metal type (problem report.)

- r5.metal: normal boot fails / kexec boot works
- r5d.metal: normal boot works.
- r5.24xlarge: normal boot works.

The kexec boot worked ~20 times, so it wouldn't seem like a
race condition is in place, as that should be enough runs,
considering it failed every time on normal boot.

Also, Ian mentioned that he couldn't reproduce w/ crashdump
installed. Well, I think the only difference it would cause
_before_ mounting the rootfs (assuming that's what doesn't
work/allow machine to boot, as we have no serial console)
is the crashkernel reservation?

---

So, all this is a bit confusing, but seem to indicate again
that there's no problem w/ the patchset per se, but perhaps
something in booting this particular kernel on a particular
instance type (r5.metal) which _might_ be related to normal/
kexec/crashkernel boot differences.

More tomorrow.

Revision history for this message
Ian May (ian-may) wrote :

Thanks for the in-depth update Mauricio! Is there any investigation you'd like me to specifically target?

Revision history for this message
Ian May (ian-may) wrote :

Just want to add an update. I haven't been able to replicate successfully booting 4.15.0-1113-aws from 5.4.0-1058-aws, so I'm questioning whether I made a mistake the time I thought it was successful.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Today I wanted to try and instrument the boot process a bit,
since we have no serial console in the nitro metal instances.

I was looking for pstore_blk (hoping we could panic_on_warn
or panic_on_oops), but it's only available in 5.8+ it seems.)

So I decided to start with grub, and keep a progress variable
in grubenv, and use grub-reboot to boot 4.15.0-1113-aws _once_
(as it's expected to fail), then (force) stop and start again,
and check grubenv in 5.4.0-*-aws (which works.)

Interestingly, in one of such attempts 4.15.0-1113-aws WORKED.

In another attempt, I could see the progress variable for the
4.15 _and_ 5.14 kernels, so it seems that grub booted 4.15
but it didn't make it to the fully booted system. (i.e., grub
seems to be working correctly.)

In the other attempts I noticed that once we try to boot 4.15,
the system seems to become weird and not react quickly even
to the 'Force stop' method (after you try 'Stop' that doesn't
work.)

...

So, since 4.15 worked/booted once, and the systems seem weird,
and Ian just posted that he had a different result/questioned
previous result (ie, it might well be a _different_ result),
I wonder if somehow this particular instance type is acting up.

Given that 4.15 worked/booted ~20 times under kexec, it's not
unreasonable to consider there might be something going on in
normal boot.

I think we should probably engage AWS Support to try and ask
for a console log using an internally available method (seen
it elsewhere iirc), and also to clarify differences in boot
disk among instace types r5.metal (fail), r5d.metal (works),
and r5d.24xlarge (works) -- they all have EBS/nvme as '/'.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Ian,

Do you/team have contacts in here or AWS that could help with that?

I think that other lines of investigation now, after our findings
and apparent inconsistencies, would be based on speculation, and
we're better trying to get real information/logs from the system
with AWS Support.

cheers,
Mauricio

Revision history for this message
Ian May (ian-may) wrote :

Mauricio,

Interesting update, I agree that we need more info as to what the state is when the instance won't boot switching to the new 4.15 kernel. I'll check with my team in the morning and see if we can get additional info from AWS

I was trying a few more scenarios this evening the first being the most interesting.

Scenario 1
start with 5.4.0-1056-aws
install 5.4.0-1058-aws
reboot
confirm 5.4.0-1058-aws booted
reboot AGAIN
install 4.15.0-1113-aws
reboot
machine booted 4.15.0-1113-aws successfully

Scenario 2
start with 5.4.0-1056-aws
install 4.15.0-1112-aws
reboot
install 4.15.0-1113-aws
reboot
confirmed 4.15.0-1113-aws booted
then booted back into 5.4.0-1056-aws
removed 4.15.0-1112-aws and 4.15.0-1113-aws
rebooted again for good measure
confirmed still running 5.4.0-1056-aws
installed 4.15.0-1113-aws
rebooted
4.15.0-1113-aws successfully loaded

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

This issue is also affecting the bionic/linux generic kernel.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Kleber,

Thanks for the info.

The impact on bionic/generic is also exclusively on aws r5.metal or broader?

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Hi Mauricio,

We are seeing the issue only on r5.metal.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hey Kleber,

Thanks for confirming.

I guess there might be something wrong with the boot process on r5.metal, specifically:
- there's no issue with kexec boot, just with normal boot (same code and from/to versions)
- there's no issue with normal boot on similar instance types (r5d.metal, r5.24xlarge)
- rarely, but sometimes, it works with normal boot.

I'm looking forward to hearing back from Ian on whether he could
get additional information from AWS, or how to reach out to them,
about getting the serial console log from the r5.metal instances
internally, via AWS support, so we know what the actual issue is.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

We've got a serial console log from AWS Support through our Support team
(special thanks to Pedro Principeza and our former colleague Mark Thomas.)

The problem is definitely not the ext4/jbd2 patchset as suspected
(although it's unclear how reverting it caused the kernel to boot;
maybe build environment differences?)

Early in the kernel boot, before even trying to mount the rootfs,
there are blcked swapper tasks, and they just continue to happen.

(full log attached.)

```
         Starting Reboot...
...
[ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-1113-aws root=UUID=db937f23-4ed7-4c4b-8058-b23a860fae08 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295
...
[ 0.000000] gran_size: 64K chunk_size: 256M num_reg: 10 lose cover RAM: 737G
...
[ 2.742455] clocksource: Switched to clocksource tsc
[ 242.656089] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 363.488083] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 484.320066] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 605.152061] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 725.984054] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 846.816051] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 967.648055] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 1088.480033] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
[ 1209.312036] INFO: task swapper/0:1 blocked for more than 120 seconds.
...
<end of log>
```

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Steps to reproduce:
---

Ubuntu 18.04 image in AWS r5.metal instance type.

$ lsb_release -cs
bionic

$ dmesg | grep DMI:
[ 0.000000] DMI: Amazon EC2 r5.metal/Not Specified, BIOS 1.0 10/16/2017

$ uname -rv
5.4.0-1045-aws #47~18.04.1-Ubuntu SMP Tue Apr 13 15:58:14 UTC 2021

$ sudo add-apt-repository ppa:canonical-kernel-team/ppa

$ sudo apt install linux-image-4.15.0-1113-aws

$ sudo grub-reboot 'Advanced options for Ubuntu>Ubuntu, with Linux 4.15.0-1113-aws'

$ sudo grub-editenv /boot/grub/grubenv list
next_entry=Advanced options for Ubuntu>Ubuntu, with Linux 4.15.0-1113-aws

$ sudo reboot

The system should not come back from reboot (which usually takes 10-15 minutes on r5.metal)

Revision history for this message
Ian May (ian-may) wrote :

Hi Mauricio,

Thanks for getting this info. This is very helpful! I see a few potential patches between 4.15.0-159.167 and 4.15.0-160.168 that could be related to the hang. This will help greatly with the bisect.

Ian

Revision history for this message
Ian May (ian-may) wrote :

As I was bisecting the commits, I was attempting to take advantage of parallelism. While my test kernel was building I would deploy a clean AWS r5.metal instance. I started seeing test kernels boot that I wouldn't expect to boot. So I decided as a sanity test, I would deploy an r5.metal instance, let it sit idle for 20 minutes and then install the known problematic 4.15.0-1113-aws kernel. Sure enough it booted fine. Tried the same thing again with letting it sit idle 20 mins and it worked again. So this does appear to be a race condition. I think this also explains some of the erratic test results I've seen while looking at this bug. Fortunately the console output gave us some definitive proof as to where the problem was occurring.

With that being said, it appears I have found the offending commits.

PCI/MSI: Enforce that MSI-X table entry is masked for update
PCI/MSI: Enforce MSI[X] entry updates to be visible

https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-aws/+git/bionic/commit/?id=27571f5ea1dd074924b41a455c50dc2278e8c2b7

https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-aws/+git/bionic/commit/?id=2478f358c2b35fea04e005447ce99ad8dc53fd5d

More specifically the hang is introduced by 'PCI/MSI: Enforce that MSI-X table entry is masked for update', but it isn't a clean revert without reverting the other commit. So for a quick test confirmation I reverted both.

I have not had a chance to determine why these commits are causing the problem, but with these reverted in a test build on top of 4.15.0-1113-aws, I can migrate from 5.4 to 4.15 as soon as the instance is available. I've done at least 6 attempts now and all have passed and doing the same steps without the reverts all have hung(unless I wait 20 mins).

description: updated
description: updated
Stefan Bader (smb)
Changed in linux-aws (Ubuntu Bionic):
importance: Undecided → High
status: New → In Progress
affects: linux-aws (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Invalid
Stefan Bader (smb)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

I confirm that bionic/linux 4.15.0-161.169 and bionic/linux-aws 4.15.0-1114.121 are not experiencing the reported boot issues on AWS r5.metal or on any other platform/instance.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package linux - 4.15.0-161.169

---------------
linux (4.15.0-161.169) bionic; urgency=medium

  * bionic/linux: 4.15.0-161.169 -proposed tracker (LP: #1947358)

  * Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on
    r5.metal (LP: #1946149)
    - SAUCE: Revert "PCI/MSI: Enforce MSI[X] entry updates to be visible"
    - SAUCE: Revert "PCI/MSI: Enforce that MSI-X table entry is masked for update"

linux (4.15.0-160.168) bionic; urgency=medium

  * bionic/linux: 4.15.0-160.168 -proposed tracker (LP: #1944933)

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2021.09.27)

  * ext4 journal recovery fails w/ data=journal + mmap (LP: #1847340)
    - jbd2: introduce/export functions
      jbd2_journal_submit|finish_inode_data_buffers()
    - jbd2, ext4, ocfs2: introduce/use journal callbacks
      j_submit|finish_inode_data_buffers()
    - ext4: data=journal: fixes for ext4_page_mkwrite()
    - ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
    - ext4: fix mmap write protection for data=journal mode

  * CVE-2021-40490
    - ext4: fix race writing to an inline_data file while its xattrs are changing

  * Bionic update: upstream stable patchset 2021-09-22 (LP: #1944600)
    - iio: humidity: hdc100x: Add margin to the conversion time
    - iio: adc: Fix incorrect exit of for-loop
    - ASoC: intel: atom: Fix reference to PCM buffer address
    - i2c: dev: zero out array used for i2c reads from userspace
    - ACPI: NFIT: Fix support for virtual SPA ranges
    - ASoC: cs42l42: Correct definition of ADC Volume control
    - ASoC: cs42l42: Don't allow SND_SOC_DAIFMT_LEFT_J
    - ASoC: cs42l42: Fix inversion of ADC Notch Switch control
    - ASoC: cs42l42: Remove duplicate control for WNF filter frequency
    - net: dsa: mt7530: add the missing RxUnicast MIB counter
    - ppp: Fix generating ifname when empty IFLA_IFNAME is specified
    - psample: Add a fwd declaration for skbuff
    - net: Fix memory leak in ieee802154_raw_deliver
    - net: bridge: fix memleak in br_add_if()
    - tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B
      packets
    - xen/events: Fix race in set_evtchn_to_irq
    - vsock/virtio: avoid potential deadlock when vsock device remove
    - powerpc/kprobes: Fix kprobe Oops happens in booke
    - x86/tools: Fix objdump version check again
    - x86/resctrl: Fix default monitoring groups reporting
    - PCI/MSI: Enable and mask MSI-X early
    - PCI/MSI: Do not set invalid bits in MSI mask
    - PCI/MSI: Correct misleading comments
    - PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown()
    - PCI/MSI: Protect msi_desc::masked for multi-MSI
    - PCI/MSI: Mask all unused MSI-X entries
    - PCI/MSI: Enforce that MSI-X table entry is masked for update
    - PCI/MSI: Enforce MSI[X] entry updates to be visible
    - vmlinux.lds.h: Handle clang's module.{c,d}tor sections
    - mac80211: drop data frames without key on encrypted links
    - x86/fpu: Make init_fpstate correct with optimized XSAVE
    - ath: Use safer key clearing with key cache entries
    - ath9k: Clear key cache explicitly ...

Read more...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-kvm/4.15.0-1102.104 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
tags: added: verification-done-bionic
removed: verification-needed-bionic
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers