Broken network on some AWS instances with focal/impish kernels

Bug #1961968 reported by Kleber Sacilotto de Souza
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Critical
Kleber Sacilotto de Souza
Impish
Fix Released
Critical
Kleber Sacilotto de Souza

Bug Description

[Impact]
With the latest focal/linux (5.4.0-101.114) and impish/linux (5.13.0-31.34) kernels built for SRU cycle 2022.02.21 some AWS instances fail to boot. This impacts mostly the instance types: c4.large, c3.xlarge and x1e.xlarge. However, not all instances deployed on those types will fail. This is affecting mostly c4.large which fails about 80-90% of all deployments.

This was traced to be caused by the network interface failing to come up. The following console log snippets from 5.4.0-101-generic on a c4.large show some hints of what's going on:

[...]
[ 3.990368] unchecked MSR access error: RDMSR from 0xc90 at rIP: 0xffffffff8ea733c8 (native_read_msr+0x8/0x40)
[ 3.998463] Call Trace:
[ 4.001164] ? set_rdt_options+0x91/0x91
[ 4.004864] resctrl_late_init+0x592/0x63c
[ 4.008711] ? set_rdt_options+0x91/0x91
[ 4.012452] do_one_initcall+0x4a/0x200
[ 4.016115] kernel_init_freeable+0x1c0/0x263
[ 4.020402] ? rest_init+0xb0/0xb0
[ 4.024889] kernel_init+0xe/0x110
[ 4.029245] ret_from_fork+0x35/0x40
[...]
[ 7.718268] ena: The ena device sent a completion but the driver didn't receive a MSI-X interrupt (cmd 8), autopolling mode is OFF
[ 7.727036] ena: Failed to submit get_feature command 12 error: -62
[ 7.731691] ena 0000:00:03.0: Cannot init indirect table
[ 7.735636] ena 0000:00:03.0: Cannot init RSS rc: -62
[ 7.740700] ena: probe of 0000:00:03.0 failed with error -62
[...]

[Fix]
Reverting the following upstream stable commit fixes the issue:

83dbf898a2d4 PCI/MSI: Mask MSI-X vectors only on success

[Test Case]
Boot an affected AWS instance type with focal/linux (5.4.0-101.114) and impish/linux (5.13.0-31.34) kernels with the mentioned patch reverted. Then boot with the original kernels. It should boot successfully with the reverted patch but fail with the original kernels.

[Regression Potential]
The patch description mentions fixing a MSI-X issue with a Marvell NVME device, which doesn't seem to be following the PCI-E specification. Reverting this commit will keep the issue on systems with that particular NVME device unfixed.
As of now there is no follow-up fix for this commit upstream, we might need to keep an eye on any change and re-apply it in case a fix is found.

Changed in linux (Ubuntu Focal):
status: New → Confirmed
Changed in linux (Ubuntu Impish):
status: New → Confirmed
description: updated
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :
Changed in linux (Ubuntu Focal):
status: Confirmed → In Progress
Changed in linux (Ubuntu Impish):
status: Confirmed → In Progress
Changed in linux (Ubuntu Focal):
importance: Undecided → Critical
Changed in linux (Ubuntu Impish):
importance: Undecided → Critical
Changed in linux (Ubuntu Focal):
assignee: nobody → Kleber Sacilotto de Souza (kleber-souza)
Changed in linux (Ubuntu Impish):
assignee: nobody → Kleber Sacilotto de Souza (kleber-souza)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1961968

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote (last edit ):

This was already fixed in Jammy/linux by reverting the same patch (bug 1956780).

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Impish):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.13.0-32.35 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-impish' to 'verification-done-impish'. If the problem still exists, change the tag 'verification-needed-impish' to 'verification-failed-impish'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-impish
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.4.0-102.115 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Confirmed that impish/linux 5.13.0-32.35 and focal/linux 5.4.0-102.115 are not failing to boot on any AWS instance.

tags: added: verification-done-focal
removed: verification-needed-focal
tags: added: verification-done-impish
removed: verification-needed-impish
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (49.8 KiB)

This bug was fixed in the package linux - 5.13.0-37.42

---------------
linux (5.13.0-37.42) impish; urgency=medium

  * impish/linux: 5.13.0-37.42 -proposed tracker (LP: #1964959)

  * CVE-2022-0742
    - ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report()

linux (5.13.0-36.41) impish; urgency=medium

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - debian/dkms-versions -- update from kernel-versions (main/2022.02.21)

  * Broken network on some AWS instances with focal/impish kernels
    (LP: #1961968)
    - SAUCE: Revert "PCI/MSI: Mask MSI-X vectors only on success"

  * [SRU]PCI: vmd: Do not disable MSI-X remapping if interrupt remapping is
    enabled by IOMMU (LP: #1937295)
    - PCI: vmd: Do not disable MSI-X remapping if interrupt remapping is enabled
      by IOMMU

  * [UBUNTU 20.04] kernel: Add support for CPU-MF counter second version 7
    (LP: #1960182)
    - s390/cpumf: Support for CPU Measurement Facility CSVN 7
    - s390/cpumf: Support for CPU Measurement Sampling Facility LS bit

  * [UBUNTU 21.10] s390/cio: verify the driver availability for path_event call
    (LP: #1960875)
    - s390/cio: verify the driver availability for path_event call

  * Impish update: upstream stable patchset 2022-02-14 (LP: #1960861)
    - devtmpfs regression fix: reconfigure on each mount
    - orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc()
    - remoteproc: qcom: pil_info: Don't memcpy_toio more than is provided
    - perf: Protect perf_guest_cbs with RCU
    - KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest
    - KVM: s390: Clarify SIGP orders versus STOP/RESTART
    - 9p: only copy valid iattrs in 9P2000.L setattr implementation
    - video: vga16fb: Only probe for EGA and VGA 16 color graphic cards
    - media: uvcvideo: fix division by zero at stream start
    - rtlwifi: rtl8192cu: Fix WARNING when calling local_irq_restore() with
      interrupts enabled
    - firmware: qemu_fw_cfg: fix sysfs information leak
    - firmware: qemu_fw_cfg: fix NULL-pointer deref on duplicate entries
    - firmware: qemu_fw_cfg: fix kobject leak in probe error path
    - KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all
    - ALSA: hda/realtek: Add speaker fixup for some Yoga 15ITL5 devices
    - ALSA: hda/realtek - Fix silent output on Gigabyte X570 Aorus Master after
      reboot from Windows
    - ALSA: hda: ALC287: Add Lenovo IdeaPad Slim 9i 14ITL5 speaker quirk
    - ALSA: hda/realtek: Add quirk for Legion Y9000X 2020
    - ALSA: hda/realtek: Re-order quirk entries for Lenovo
    - powerpc/pseries: Get entry and uaccess flush required bits from
      H_GET_CPU_CHARACTERISTICS
    - mtd: fixup CFI on ixp4xx
    - KVM: x86: don't print when fail to read/write pv eoi memory
    - remoteproc: qcom: pas: Add missing power-domain "mxc" for CDSP
    - perf annotate: Avoid TUI crash when navigating in the annotation of
      recursive functions
    - ALSA: hda/realtek: Use ALC285_FIXUP_HP_GPIO_LED on another HP laptop
    - ALSA: hda/tegra: Fix Tegra194 HDA reset failure

  * CVE-2022-0516
    - KVM: s390: Return error on SIDA memop on normal guest

  * CVE-2022-04...

Changed in linux (Ubuntu Impish):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (37.4 KiB)

This bug was fixed in the package linux - 5.4.0-105.119

---------------
linux (5.4.0-105.119) focal; urgency=medium

  * CVE-2022-0847
    - lib/iov_iter: initialize "flags" in new pipe_buffer

  * Broken network on some AWS instances with focal/impish kernels
    (LP: #1961968)
    - SAUCE: Revert "PCI/MSI: Mask MSI-X vectors only on success"

  * [UBUNTU 20.04] kernel: Add support for CPU-MF counter second version 7
    (LP: #1960182)
    - s390/cpumf: Support for CPU Measurement Facility CSVN 7
    - s390/cpumf: Support for CPU Measurement Sampling Facility LS bit

  * Hipersocket page allocation failure on Ubuntu 20.04 based SSC environments
    (LP: #1959529)
    - s390/qeth: use memory reserves to back RX buffers

  * CVE-2022-0516
    - KVM: s390: Return error on SIDA memop on normal guest

  * CVE-2022-0435
    - tipc: improve size validations for received domain records

  * CVE-2022-0492
    - cgroup-v1: Require capabilities to set release_agent

  * Recalled NFSv4 files delegations overwhelm server (LP: #1957986)
    - NFSv4: Fix delegation handling in update_open_stateid()
    - NFSv4: nfs4_callback_getattr() should ignore revoked delegations
    - NFSv4: Delegation recalls should not find revoked delegations
    - NFSv4: fail nfs4_refresh_delegation_stateid() when the delegation was
      revoked
    - NFS: Rename nfs_inode_return_delegation_noreclaim()
    - NFSv4: Don't remove the delegation from the super_list more than once
    - NFSv4: Hold the delegation spinlock when updating the seqid
    - NFSv4: Clear the NFS_DELEGATION_REVOKED flag in
      nfs_update_inplace_delegation()
    - NFSv4: Update the stateid seqid in nfs_revoke_delegation()
    - NFSv4: Revoke the delegation on success in nfs4_delegreturn_done()
    - NFSv4: Ignore requests to return the delegation if it was revoked
    - NFSv4: Don't reclaim delegations that have been returned or revoked
    - NFSv4: nfs4_return_incompatible_delegation() should check delegation
      validity
    - NFSv4: Fix nfs4_inode_make_writeable()
    - NFS: nfs_inode_find_state_and_recover() fix stateid matching
    - NFSv4: Fix races between open and delegreturn
    - NFSv4: Handle NFS4ERR_OLD_STATEID in delegreturn
    - NFSv4: Don't retry the GETATTR on old stateid in nfs4_delegreturn_done()
    - NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
    - NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
    - NFSv4: Try to return the delegation immediately when marked for return on
      close
    - NFSv4: Add accounting for the number of active delegations held
    - NFSv4: Limit the total number of cached delegations
    - NFSv4: Ensure the delegation is pinned in nfs_do_return_delegation()
    - NFSv4: Ensure the delegation cred is pinned when we call delegreturn

  * Focal update: v5.4.174 upstream stable release (LP: #1960566)
    - HID: uhid: Fix worker destroying device without any protection
    - HID: wacom: Reset expected and received contact counts at the same time
    - HID: wacom: Ignore the confidence flag when a touch is removed
    - HID: wacom: Avoid using stale array indicies to read contact count
    - f2fs: fix to ...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.