aws: properly support instance types with > 255 cpu cores

Bug #1913739 reported by Andrea Righi on 2021-01-29
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
Undecided
Unassigned
Xenial
Medium
Unassigned
Bionic
Medium
Unassigned
Focal
Medium
Unassigned
Groovy
Medium
Andrea Righi

Bug Description

[Impact]

To properly support instance types in AWS with > 255 cores we need to make sure that commit c40aaaac1018ff1382f2d35df5129a6bcea3df6b ("iommu/vt-d: Gracefully handle DMAR units with no supported address widths") is applied across all our linux-aws kernels.

[Test Case]

Tests performed by AWS.

[Fix]

Apply/backport commit c40aaaac1018ff1382f2d35df5129a6bcea3df6b.

For groovy it's a clean cherry pick.

For Focal/Bionic the backport activity is to simply rename the patched file.

Only Xenial requires a little more backporting activity, in particular we need to change the condition in free_iommu() checking if iommu->iommu_dev is defined (instead of checking for iommu->iommu.ops, that is not available in the 4.4 kernel).

[Regression potential]

This change allows to use DMAR units that were not used before (for interrupt remapping). Theoretically this should provide a performance improvement, but we should keep an eye about potential performance regressions of drivers that are using IOMMU, in case they're using this new behavior incorrectly.

Moreover, the Xenial backport required an adjustment in free_iommu() to check when we need to call iommu_device_destroy(), if the adjustment is not 100% correct we could introduce a potential memory leak.

Andrea Righi (arighi) on 2021-01-29
Changed in linux-aws (Ubuntu Bionic):
importance: Undecided → Medium
importance: Medium → High
Changed in linux-aws (Ubuntu Focal):
importance: Undecided → High
Changed in linux-aws (Ubuntu Bionic):
importance: High → Medium
Changed in linux-aws (Ubuntu Focal):
importance: High → Medium
Stefan Bader (smb) on 2021-02-02
Changed in linux-aws (Ubuntu Xenial):
importance: Undecided → Medium
status: New → In Progress
Changed in linux-aws (Ubuntu Focal):
status: New → In Progress
Changed in linux-aws (Ubuntu Bionic):
status: New → In Progress
Changed in linux-aws (Ubuntu Groovy):
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Andrea Righi (arighi)
Andrea Righi (arighi) on 2021-02-02
description: updated
no longer affects: linux-aws (Ubuntu Hirsute)
Ian (ian-may) on 2021-02-04
Changed in linux-aws (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux-aws (Ubuntu Bionic):
status: In Progress → Fix Committed
Changed in linux-aws (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-aws (Ubuntu Groovy):
status: In Progress → Fix Committed
Chris Newcomer (cnewcomer) wrote :

Validation results. Proper number of vpus did show up in Bionic, Focal, Groovy, and Xenial-4.15. The Xenial 4.4 kernel only saw 256 vcpus.

ubuntu@ip-172-31-5-48:~$ uname -r && nproc
4.4.0-1121-aws
144

ubuntu@ip-172-31-5-48:~$ uname -r && nproc
4.4.0-1122-aws
256

ubuntu@ip-172-31-5-48:~$ uname -r && nproc
4.15.0-1094-aws
448

Launchpad Janitor (janitor) wrote :
Download full text (63.4 KiB)

This bug was fixed in the package linux-aws - 5.4.0-1038.40

---------------
linux-aws (5.4.0-1038.40) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1038.40 -proposed tracker (LP: #1913135)

  * Focal update: v5.4.85 upstream stable release (LP: #1910817)
    - [Config] aws: updateconfigs for USB_SISUSBVGA_CON

  * Focal update: v5.4.84 upstream stable release (LP: #1910816)
    - [Config] aws: updateconfigs for PGTABLE_MAPPING

  * Focal update: v5.4.80 upstream stable release (LP: #1908561)
    - [Config] aws: updateconfigs for INFINIBAND_VIRT_DMA

  * Disable Bluetooth in cloud kernels (LP: #1840488)
    - aws: [Config] disable CONFIG_BT
    - aws: [Config] remove disabled BT modules

  * aws: properly support instance types with > 255 cpu cores (LP: #1913739)
    - iommu/vt-d: Gracefully handle DMAR units with no supported address widths

  [ Ubuntu: 5.4.0-66.74 ]

  * focal/linux: 5.4.0-66.74 -proposed tracker (LP: #1913152)
  * Add support for selective build of special drivers (LP: #1912789)
    - [Packaging] Add support for ODM drivers
    - [Packaging] Turn on ODM support for amd64
  * Packaging resync (LP: #1786013)
    - update dkms package versions
    - update dkms package versions
  * Introduce the new NVIDIA 460-server series and update the 460 series
    (LP: #1913200)
    - [Config] dkms-versions -- drop NVIDIA 435 455 and 440-server
    - [Config] dkms-versions -- add the 460-server nvidia driver
  * Enable mute and micmute LED on HP EliteBook 850 G7 (LP: #1910102)
    - ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7
  * SYNA30B4:00 06CB:CE09 Mouse on HP EliteBook 850 G7 not working at all
    (LP: #1908992)
    - HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device
  * HD Audio Device PCI ID for the Intel Cometlake-R platform (LP: #1912427)
    - SAUCE: ALSA: hda: Add Cometlake-R PCI ID
  * switch to an autogenerated nvidia series based core via dkms-versions
    (LP: #1912803)
    - [Packaging] nvidia -- use dkms-versions to define versions built
    - [Packaging] update-version-dkms -- maintain flags fields
    - [Config] dkms-versions -- add transitional/skip information for nvidia
      packages
  * udpgro.sh in net from ubuntu_kernel_selftests seems not reflecting sub-test
    result (LP: #1908499)
    - selftests: fix the return value for UDP GRO test
  * qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP
    tx csum offload (LP: #1909062)
    - qede: fix offload for IPIP tunnel packets
  * Use DCPD to control HP DreamColor panel (LP: #1911001)
    - SAUCE: drm/dp: Another HP DreamColor panel brigntness fix
  * kvm: Windows 2k19 with Hyper-v role gets stuck on pending hypervisor
    requests on cascadelake based kvm hosts (LP: #1911848)
    - KVM: x86: Set KVM_REQ_EVENT if run is canceled with req_immediate_exit set
  * Ubuntu 20.10 four needed fixes to 'Add driver for Mellanox Connect-IB
    adapters' (LP: #1905574)
    - net/mlx5: Fix a race when moving command interface to polling mode
  * Fix right sounds and mute/micmute LEDs for HP ZBook Fury 15/17 G7 Mobile
    Workstation (LP: #1910561)
    - ALSA: hda/realtek: fix right sounds and m...

Changed in linux-aws (Ubuntu Focal):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (130.6 KiB)

This bug was fixed in the package linux-aws - 5.8.0-1024.26

---------------
linux-aws (5.8.0-1024.26) groovy; urgency=medium

  * groovy/linux-aws: 5.8.0-1024.26 -proposed tracker (LP: #1914790)

  * Groovy update: upstream stable patchset 2021-01-13 (LP: #1911476)
    - [Config] aws: update config for ZSMALLOC_PGTABLE_MAPPING
    - [Config] aws: update config for USB_SISUSBVGA_CON

  * Groovy update: upstream stable patchset 2021-01-12 (LP: #1911235)
    - [Config] aws: update config for INFINIBAND_VIRT_DMA

  * Groovy update: upstream stable patchset 2020-12-14 (LP: #1908150)
    - [Config] aws: update configs for DW_APB_TIMER

  * Disable Bluetooth in cloud kernels (LP: #1840488)
    - aws: [Config] disable CONFIG_BT
    - aws: [Config] remove disabled BT modules

  * aws: properly support instance types with > 255 cpu cores (LP: #1913739)
    - iommu/vt-d: Gracefully handle DMAR units with no supported address widths

  [ Ubuntu: 5.8.0-44.50 ]

  * groovy/linux: 5.8.0-44.50 -proposed tracker (LP: #1914805)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
    - update dkms package versions
  * Introduce the new NVIDIA 460-server series and update the 460 series
    (LP: #1913200)
    - [Config] dkms-versions -- drop NVIDIA 435 455 and 440-server
    - [Config] dkms-versions -- add the 460-server nvidia driver
  * [SRU][G/H/U/OEM-5.10] re-enable s0ix of e1000e (LP: #1910541)
    - Revert "UBUNTU: SAUCE: e1000e: bump up timeout to wait when ME un-configure
      ULP mode"
    - e1000e: Only run S0ix flows if shutdown succeeded
    - Revert "e1000e: disable s0ix entry and exit flows for ME systems"
    - e1000e: Export S0ix flags to ethtool
  * suspend only works once on ThinkPad X1 Carbon gen 7 (LP: #1865570) //
    [SRU][G/H/U/OEM-5.10] re-enable s0ix of e1000e (LP: #1910541)
    - e1000e: bump up timeout to wait when ME un-configures ULP mode
  * Cannot probe sata disk on sata controller behind VMD: ata1.00: failed to
    IDENTIFY (I/O error, err_mask=0x4) (LP: #1894778)
    - PCI: vmd: Offset Client VMD MSI-X vectors
  * Enable mute and micmute LED on HP EliteBook 850 G7 (LP: #1910102)
    - ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7
  * SYNA30B4:00 06CB:CE09 Mouse on HP EliteBook 850 G7 not working at all
    (LP: #1908992)
    - HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device
  * HD Audio Device PCI ID for the Intel Cometlake-R platform (LP: #1912427)
    - SAUCE: ALSA: hda: Add Cometlake-R PCI ID
  * switch to an autogenerated nvidia series based core via dkms-versions
    (LP: #1912803)
    - [Packaging] nvidia -- use dkms-versions to define versions built
    - [Packaging] update-version-dkms -- maintain flags fields
    - [Config] dkms-versions -- add transitional/skip information for nvidia
      packages
  * udpgro.sh in net from ubuntu_kernel_selftests seems not reflecting sub-test
    result (LP: #1908499)
    - selftests: fix the return value for UDP GRO test
  * [UBUNTU 21.04] vfio: pass DMA availability information to userspace
    (LP: #1907421)
    - vfio/type1: Refactor vfio_iommu_type1_ioctl()
    - vfio iommu: Add dma available capability
  *...

Changed in linux-aws (Ubuntu Groovy):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (38.1 KiB)

This bug was fixed in the package linux-aws - 4.15.0-1094.101

---------------
linux-aws (4.15.0-1094.101) bionic; urgency=medium

  * bionic/linux-aws: 4.15.0-1094.101 -proposed tracker (LP: #1913097)

  * Bionic update: upstream stable patchset 2021-01-12 (LP: #1911331)
    - [Config] aws: Update config for USB_SISUSBVGA_CON

  * Disable Bluetooth in cloud kernels (LP: #1840488)
    - aws: [Config] disable CONFIG_BT
    - aws: [Config] remove disabled BT modules

  * aws: properly support instance types with > 255 cpu cores (LP: #1913739)
    - iommu/vt-d: Gracefully handle DMAR units with no supported address widths

  [ Ubuntu: 4.15.0-136.140 ]

  * bionic/linux: 4.15.0-136.140 -proposed tracker (LP: #1913117)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
    - update dkms package versions
  * Introduce the new NVIDIA 460-server series and update the 460 series
    (LP: #1913200)
    - [Config] dkms-versions -- drop NVIDIA 435 455 and 440-server
    - [Config] dkms-versions -- add the 460-server nvidia driver
  * switch to an autogenerated nvidia series based core via dkms-versions
    (LP: #1912803)
    - [Packaging] nvidia -- use dkms-versions to define versions built
    - [Packaging] update-version-dkms -- maintain flags fields
    - [Config] dkms-versions -- add transitional/skip information for nvidia
      packages
  * DMI entry syntax fix for Pegatron / ByteSpeed C15B (LP: #1910639)
    - Input: i8042 - unbreak Pegatron C15B
  * CVE-2020-29372
    - mm: check that mm is still valid in madvise()
  * update ENA driver, incl. new ethtool stats (LP: #1910291)
    - net: ena: change num_queues to num_io_queues for clarity and consistency
    - net: ena: ethtool: get_channels: use combined only
    - net: ena: ethtool: support set_channels callback
    - net: ena: ethtool: remove redundant non-zero check on rc
    - net/amazon: Ensure that driver version is aligned to the linux kernel
    - net: ena: ethtool: clean up minor indentation issue
    - net: ena: remove code that does nothing
    - net: ena: add unmask interrupts statistics to ethtool
    - net: ena: cosmetic: change ena_com_stats_admin stats to u64
    - net: ena: cosmetic: remove unnecessary code
    - net: ena: ethtool: convert stat_offset to 64 bit resolution
    - net: ena: ethtool: Add new device statistics
    - net: ena: Change license into format to SPDX in all files
    - net: ena: Change RSS related macros and variables names
  * CVE-2020-29374
    - gup: document and work around "COW can break either way" issue
  * Bionic update: upstream stable patchset 2021-01-12 (LP: #1911331)
    - spi: bcm2835aux: Fix use-after-free on unbind
    - spi: bcm2835aux: Restore err assignment in bcm2835aux_spi_probe
    - iwlwifi: pcie: limit memory read spin time
    - arm64: dts: rockchip: Assign a fixed index to mmc devices on rk3399 boards.
    - iwlwifi: mvm: fix kernel panic in case of assert during CSA
    - ARC: stack unwinding: don't assume non-current task is sleeping
    - scsi: ufs: Make sure clk scaling happens only when HBA is runtime ACTIVE
    - soc: fsl: dpio: Get the cpumask through cpumask_of(cpu)
    - platform/x86: acer-wmi: add automa...

Changed in linux-aws (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (12.2 KiB)

This bug was fixed in the package linux-aws - 4.4.0-1122.136

---------------
linux-aws (4.4.0-1122.136) xenial; urgency=medium

  * xenial/linux-aws: 4.4.0-1122.136 -proposed tracker (LP: #1914129)

  * Xenial update: v4.4.249 upstream stable release (LP: #1910139)
    - [Config] updateconfigs for SPI_DYNAMIC

  * aws: properly support instance types with > 255 cpu cores (LP: #1913739)
    - iommu/vt-d: Gracefully handle DMAR units with no supported address widths

  [ Ubuntu: 4.4.0-203.235 ]

  * xenial/linux: 4.4.0-203.235 -proposed tracker (LP: #1914140)
  * Ubuntu 16.04 kernel 4.4.0-202 basic commands hanging (LP: #1913853)
    - SAUCE: Revert "mm: check that mm is still valid in madvise()"

  [ Ubuntu: 4.4.0-202.234 ]

  * xenial/linux: 4.4.0-202.234 -proposed tracker (LP: #1913086)
  * DMI entry syntax fix for Pegatron / ByteSpeed C15B (LP: #1910639)
    - Input: i8042 - unbreak Pegatron C15B
  * CVE-2020-29372
    - mm: check that mm is still valid in madvise()
  * errinjct open fails on IBM POWER LPAR (LP: #1908710)
    - powerpc/rtas: Fix typo of ibm, open-errinjct in RTAS filter
  * 4.4 kernel panics in kvm wake_up() handler (LP: #1908428)
    - kvm: vmx: rename vmx_pre/post_block to pi_pre/post_block
    - KVM: VMX: extract __pi_post_block
    - KVM: VMX: avoid double list add with VT-d posted interrupts
  * restore reverted commit "crypto: arm64/sha - avoid non-standard inline asm
    tricks" (LP: #1907489)
    - crypto: arm64/sha - avoid non-standard inline asm tricks
  * CVE-2020-29374
    - gup: document and work around "COW can break either way" issue
  * Xenial update: v4.4.249 upstream stable release (LP: #1910139)
    - spi: bcm2835aux: Fix use-after-free on unbind
    - spi: bcm2835aux: Restore err assignment in bcm2835aux_spi_probe
    - ARC: stack unwinding: don't assume non-current task is sleeping
    - platform/x86: acer-wmi: add automatic keyboard background light toggle key
      as KEY_LIGHTS_TOGGLE
    - Input: cm109 - do not stomp on control URB
    - Input: i8042 - add Acer laptops to the i8042 reset list
    - [Config] updateconfigs for SPI_DYNAMIC
    - spi: Prevent adding devices below an unregistering controller
    - net/mlx4_en: Avoid scheduling restart task if it is already running
    - tcp: fix cwnd-limited bug for TSO deferral where we send nothing
    - net: stmmac: delete the eee_ctrl_timer after napi disabled
    - net: bridge: vlan: fix error return code in __vlan_add()
    - USB: dummy-hcd: Fix uninitialized array use in init()
    - USB: add RESET_RESUME quirk for Snapscan 1212
    - ALSA: usb-audio: Fix potential out-of-bounds shift
    - ALSA: usb-audio: Fix control 'access overflow' errors from chmap
    - xhci: Give USB2 ports time to enter U3 in bus suspend
    - USB: sisusbvga: Make console support depend on BROKEN
    - [Config] updateconfigs for USB_SISUSBVGA_CON
    - ALSA: pcm: oss: Fix potential out-of-bounds shift
    - serial: 8250_omap: Avoid FIFO corruption caused by MDR1 access
    - USB: serial: cp210x: enable usb generic throttle/unthrottle
    - scsi: bnx2i: Requires MMU
    - can: softing: softing_netdev_open(): fix error handling
    - RDMA/cm: Fix an attempt to use non-vali...

Changed in linux-aws (Ubuntu Xenial):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (149.6 KiB)

This bug was fixed in the package linux-aws - 5.8.0-1025.27+21.04.1

---------------
linux-aws (5.8.0-1025.27+21.04.1) hirsute; urgency=medium

  * hirsute/linux-aws: 5.8.0-1025.27+21.04.1 -proposed tracker (LP: #1916127)

  * Disable Bluetooth in cloud kernels (LP: #1840488)
    - [Config] aws-21.04: remove disabled BT modules

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  [ Ubuntu: 5.8.0-1025.27 ]

  * groovy/linux-aws: 5.8.0-1025.27 -proposed tracker (LP: #1916128)
  * Please trust Canonical Livepatch Service kmod signing key (LP: #1898716)
    - aws: [Config] enable CONFIG_MODVERSIONS=y
    - aws: [Packaging] build canonical-certs.pem from branch/arch certs
    - aws: [Config] Allow ASM_MODVERSIONS
  * groovy/linux: 5.8.0-45.51 -proposed tracker (LP: #1916143)
  * Please trust Canonical Livepatch Service kmod signing key (LP: #1898716)
    - [Config] enable CONFIG_MODVERSIONS=y
    - [Packaging] build canonical-certs.pem from branch/arch certs
    - [Config] add Canonical Livepatch Service key to SYSTEM_TRUSTED_KEYS
    - [Config] add ubuntu-drivers key to SYSTEM_TRUSTED_KEYS
    - [Config] Allow ASM_MODVERSIONS and MODULE_REL_CRCS
  * CVE-2021-20194
    - bpf, cgroup: Fix optlen WARN_ON_ONCE toctou
    - bpf, cgroup: Fix problematic bounds check
  * Missing device id for Intel TGL-H ISH [8086:43fc] in intel-ish-hid driver
    (LP: #1914543)
    - HID: intel-ish-hid: ipc: Add Tiger Lake H PCI device ID
  * Prevent thermal shutdown during boot process (LP: #1906168)
    - thermal/core: Emit a warning if the thermal zone is updated without ops
    - thermal/core: Add critical and hot ops
    - thermal/drivers/acpi: Use hot and critical ops
    - thermal/drivers/rcar: Remove notification usage
    - thermal: int340x: Fix unexpected shutdown at critical temperature
    - thermal: intel: pch: Fix unexpected shutdown at critical temperature
  * geneve overlay network on vlan interface broken with offload enabled
    (LP: #1914447)
    - net/mlx5e: Fix SWP offsets when vlan inserted by driver
  * Groovy update: upstream stable patchset 2021-02-11 (LP: #1915473)
    - net: cdc_ncm: correct overhead in delayed_ndp_size
    - net: hns3: fix the number of queues actually used by ARQ
    - net: hns3: fix a phy loopback fail issue
    - net: stmmac: dwmac-sun8i: Balance internal PHY resource references
    - net: stmmac: dwmac-sun8i: Balance internal PHY power
    - net: vlan: avoid leaks on register_vlan_dev() failures
    - net/sonic: Fix some resource leaks in error handling paths
    - net: ipv6: fib: flush exceptions when purging route
    - tools: selftests: add test for changing routes with PTMU exceptions
    - net: fix pmtu check in nopmtudisc mode
    - net: ip: always refragment ip defragmented packets
    - octeontx2-af: fix memory leak of lmac and lmac->name
    - nexthop: Fix off-by-one error in error path
    - nexthop: Unlink nexthop group entry in error path
    - s390/qeth: fix L2 header access in qeth_l3_osa_features_check()
    - net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
    - net/mlx5: Use port_num 1 instead of 0 when delete a RoCE address
    - net/mlx5e: ethtool, Fix restrictio...

Changed in linux-aws (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers