[Hyper-V] Fix IRQ spreading on NVMe devices with lower numbers of channels

Bug #1802358 reported by Joshua R. Poulson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned
Cosmic
Fix Released
Undecided
Unassigned

Bug Description

1. Patch to kernel/irq/affinity.c: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/kernel/irq/affinity.c?h=next-20181108&id=b82592199032bf7c778f861b936287e37ebc9f62.

2. Patches to kernel/irq/matrix.c. There are three patches for this. The first two from Fujitsu, and then there is the patch from Long that actually makes the previous two work correctly.
a. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/kernel/irq/matrix.c?h=next-20181108&id=8ffe4e61c06a48324cfd97f1199bb9838acce2f2
b. https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/kernel/irq/matrix.c?h=next-20181108&id=76f99ae5b54d48430d1f0c5512a84da0ff9761e0
c. https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=irq/core&id=e8da8794a7fd9eef1ec9a07f0d4897c68581c72b

We expect the tip patch to be applied to Linux-next soon.

genirq/affinity: Spread IRQs to all available NUMA nodes

If the number of NUMA nodes exceeds the number of MSI/MSI-X interrupts
which are allocated for a device, the interrupt affinity spreading code
fails to spread them across all nodes.

The reason is, that the spreading code starts from node 0 and continues up
to the number of interrupts requested for allocation. This leaves the nodes
past the last interrupt unused.

This results in interrupt concentration on the first nodes which violates
the assumption of the block layer that all nodes are covered evenly. As a
consequence the NUMA nodes above the number of interrupts are all assigned
to hardware queue 0 and therefore NUMA node 0, which results in bad
performance and has CPU hotplug implications, because queue 0 gets shut
down when the last CPU of node 0 is offlined.

Go over all NUMA nodes and assign them round-robin to all requested
interrupts to solve this.

irq/matrix: Split out the CPU selection code into a helper

Linux finds the CPU which has the lowest vector allocation count to spread
out the non managed interrupts across the possible target CPUs, but does
not do so for managed interrupts.

Split out the CPU selection code into a helper function for reuse. No
functional change.

irq/matrix: Spread managed interrupts on allocation

Linux spreads out the non managed interrupt across the possible target CPUs
to avoid vector space exhaustion.

Managed interrupts are treated differently, as for them the vectors are
reserved (with guarantee) when the interrupt descriptors are initialized.

When the interrupt is requested a real vector is assigned. The assignment
logic uses the first CPU in the affinity mask for assignment. If the
interrupt has more than one CPU in the affinity mask, which happens when a
multi queue device has less queues than CPUs, then doing the same search as
for non managed interrupts makes sense as it puts the interrupt on the
least interrupt plagued CPU. For single CPU affine vectors that's obviously
a NOOP.

Restructre the matrix allocation code so it does the 'best CPU' search, add
the sanity check for an empty affinity mask and adapt the call site in the
x86 vector management code.

genirq/matrix: Improve target CPU selection for managed interrupts.irq/core
On large systems with multiple devices of the same class (e.g. NVMe disks,
using managed interrupts), the kernel can affinitize these interrupts to a
small subset of CPUs instead of spreading them out evenly.

irq_matrix_alloc_managed() tries to select the CPU in the supplied cpumask
of possible target CPUs which has the lowest number of interrupt vectors
allocated.

This is done by searching the CPU with the highest number of available
vectors. While this is correct for non-managed CPUs it can select the wrong
CPU for managed interrupts. Under certain constellations this results in
affinitizing the managed interrupts of several devices to a single CPU in
a set.

The book keeping of available vectors works the following way:

 1) Non-managed interrupts:

    available is decremented when the interrupt is actually requested by
    the device driver and a vector is assigned. It's incremented when the
    interrupt and the vector are freed.

 2) Managed interrupts:

    Managed interrupts guarantee vector reservation when the MSI/MSI-X
    functionality of a device is enabled, which is achieved by reserving
    vectors in the bitmaps of the possible target CPUs. This reservation
    decrements the available count on each possible target CPU.

    When the interrupt is requested by the device driver then a vector is
    allocated from the reserved region. The operation is reversed when the
    interrupt is freed by the device driver. Neither of these operations
    affect the available count.

    The reservation persist up to the point where the MSI/MSI-X
    functionality is disabled and only this operation increments the
    available count again.

For non-managed interrupts the available count is the correct selection
criterion because the guaranteed reservations need to be taken into
account. Using the allocated counter could lead to a failing allocation in
the following situation (total vector space of 10 assumed):

   CPU0 CPU1
 available: 2 0
 allocated: 5 3 <--- CPU1 is selected, but available space = 0
 managed reserved: 3 7

 while available yields the correct result.

For managed interrupts the available count is not the appropriate
selection criterion because as explained above the available count is not
affected by the actual vector allocation.

The following example illustrates that. Total vector space of 10
assumed. The starting point is:

   CPU0 CPU1
 available: 5 4
 allocated: 2 3
 managed reserved: 3 3

 Allocating vectors for three non-managed interrupts will result in
 affinitizing the first two to CPU0 and the third one to CPU1 because the
 available count is adjusted with each allocation:

    CPU0 CPU1
 available: 5 4 <- Select CPU0 for 1st allocation
 --> allocated: 3 3

 available: 4 4 <- Select CPU0 for 2nd allocation
 --> allocated: 4 3

 available: 3 4 <- Select CPU1 for 3rd allocation
 --> allocated: 4 4

 But the allocation of three managed interrupts starting from the same
 point will affinitize all of them to CPU0 because the available count is
 not affected by the allocation (see above). So the end result is:

    CPU0 CPU1
 available: 5 4
 allocated: 5 3

Introduce a "managed_allocated" field in struct cpumap to track the vector
allocation for managed interrupts separately. Use this information to
select the target CPU when a vector is allocated for a managed interrupt,
which results in more evenly distributed vector assignments. The above
example results in the following allocations:

   CPU0 CPU1
 managed_allocated: 0 0 <- Select CPU0 for 1st allocation
 --> allocated: 3 3

 managed_allocated: 1 0 <- Select CPU1 for 2nd allocation
 --> allocated: 3 4

 managed_allocated: 1 1 <- Select CPU0 for 3rd allocation
 --> allocated: 4 4

The allocation of non-managed interrupts is not affected by this change and
is still evaluating the available count.

The overall distribution of interrupt vectors for both types of interrupts
might still not be perfectly even depending on the number of non-managed
and managed interrupts in a system, but due to the reservation guarantee
for managed interrupts this cannot be avoided.

Expose the new field in debugfs as well.

Joshua R. Poulson (jrp)
Changed in linux-azure (Ubuntu):
status: New → Confirmed
Marcelo Cerri (mhcerri)
Changed in linux-azure (Ubuntu Bionic):
status: New → Fix Committed
Changed in linux-azure (Ubuntu):
status: Confirmed → In Progress
tags: added: kernel-da-key kernel-hyper-v
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 4.15.0-1032.33~16.04.1

---------------
linux-azure (4.15.0-1032.33~16.04.1) xenial; urgency=medium

  * linux-azure: 4.15.0-1032.33~16.04.1 -proposed tracker (LP: #1802588)

  * Packaging resync (LP: #1786013)
    - [Package] add support for specifying the primary makefile

  [ Ubuntu: 4.15.0-1032.33 ]

  * linux-azure: 4.15.0-1032.33 -proposed tracker (LP: #1802503)
  * [Hyper-V] Fix IRQ spreading on NVMe devices with lower numbers of channels
    (LP: #1802358)
    - SAUCE: genirq/affinity: Spread IRQs to all available NUMA nodes
    - SAUCE: irq/matrix: Split out the CPU selection code into a helper
    - SAUCE: irq/matrix: Spread managed interrupts on allocation
    - SAUCE: genirq/matrix: Improve target CPU selection for managed interrupts.
  * linux-azure: fix systemd ADT test failure (LP: #1722226)
    - [Packaging] Move scsi_debug to the linux-image package

 -- Marcelo Henrique Cerri <email address hidden> Fri, 09 Nov 2018 18:59:03 -0200

Changed in linux-azure (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 4.15.0-1032.33

---------------
linux-azure (4.15.0-1032.33) bionic; urgency=medium

  * linux-azure: 4.15.0-1032.33 -proposed tracker (LP: #1802503)

  * [Hyper-V] Fix IRQ spreading on NVMe devices with lower numbers of channels
    (LP: #1802358)
    - SAUCE: genirq/affinity: Spread IRQs to all available NUMA nodes
    - SAUCE: irq/matrix: Split out the CPU selection code into a helper
    - SAUCE: irq/matrix: Spread managed interrupts on allocation
    - SAUCE: genirq/matrix: Improve target CPU selection for managed interrupts.

  * linux-azure: fix systemd ADT test failure (LP: #1722226)
    - [Packaging] Move scsi_debug to the linux-image package

 -- Marcelo Henrique Cerri <email address hidden> Fri, 09 Nov 2018 11:12:54 -0200

Changed in linux-azure (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (43.5 KiB)

This bug was fixed in the package linux-azure - 4.18.0-1006.6

---------------
linux-azure (4.18.0-1006.6) cosmic; urgency=medium

  * linux-azure: 4.18.0-1006.6 -proposed tracker (LP: #1805244)

  * Accelerated networking (SR-IOV VF) broken in 18.10 daily (LP: #1794477)
    - [Packaging] Move pci-hyperv and autofs4 back to linux-modules

linux-azure (4.18.0-1005.5) cosmic; urgency=medium

  * linux-azure: 4.18.0-1005.5 -proposed tracker (LP: #1802752)

  * [Hyper-V] Fix IRQ spreading on NVMe devices with lower numbers of channels
    (LP: #1802358)
    - SAUCE: genirq/affinity: Spread IRQs to all available NUMA nodes
    - SAUCE: irq/matrix: Split out the CPU selection code into a helper
    - SAUCE: irq/matrix: Spread managed interrupts on allocation
    - SAUCE: genirq/matrix: Improve target CPU selection for managed interrupts.

  [ Ubuntu: 4.18.0-12.13 ]

  * linux: 4.18.0-12.13 -proposed tracker (LP: #1802743)
  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.
  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts
  * CVE-2018-18955: nested user namespaces with more than fiv...

Changed in linux-azure (Ubuntu Cosmic):
status: New → Fix Released
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Andy Whitcroft (apw)
tags: added: kernel-fixup-verification-needed-bionic
removed: verification-needed-bionic
Brad Figg (brad-figg)
tags: added: verification-needed-bionic
Revision history for this message
Andy Whitcroft (apw) wrote :

This bug was erroneously marked for verification in bionic; verification is not required and verification-needed-bionic is being removed.

tags: removed: verification-needed-bionic
tags: added: verification-done-bionic
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.