[SRU] irqbalance stops balancing Mellanox VF IRQs due to unsigned integer overflow
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
irqbalance (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
Fix Released
|
High
|
Mauricio Faria de Oliveira |
Bug Description
[Impact]
* irqbalance in Jammy might stop rebalancing IRQs
due to an unsigned integer overflow bug.
* This impacts the key functionality of irqbalance,
and impacts high-performance networking traffic
(eg, network IRQs remain unbalanced in same CPUs).
* Only Jammy is impacted:
- Kinetic and later have the fix.
- Focal and earlier do not have the bug.
* There are no fix-up commits upstream (the fix is
the last commit in the file since August, 2021).
* The fix just uses an equivalent comparison:
- Before: A - B < C (ie, overflows if A < B)
- After: A < C + B
[Test Case]
* Scenario described by the original reporter:
- VM with 4 vCPUs and VF from Mellanox NIC.
(comment #6)
It reproduces with virtio-net multiqueue too.
(comment #7)
- Detach the VF (some maintenance, for example)
- Attach the VF back.
- VF IRQs are mapped to 4 vCPUs during attach.
- VF IRQs are mapped to 1 vCPU by irqbalance.
- VF network receive (RX) traffic is increased.
Now:
- Without the fix: that 1 vCPU keeps at high
CPU utilization in ksoftirqd (network RX),
and the VF IRQs are not rebalanced to vCPUs.
The total network RX speed is reduced
(limited to the performance of 1 vCPU).
- With the fix: the VF IRQs are rebalanced to
other vCPUs eventually, and CPU util is not
too high on any vCPUs.
The total network RX speed is not reduced
(limited to other factors, but not 1 vCPU).
* The original reporter verified a test package
in ppa:mfo/sf00370397, and confirmed it works
as expected.
[Regression Potential]
* Theoretically, the updated condition might hit
other issues, but it should not regress existing
cases since it's mathematically equivalent to the
original condition.
* If an issue were to occur, it might prevent again
irqbalance from balancing IRQs, as this if-block
has `else return;` before calling `migrate_irq()`
(which is the issue hit by this bug, actually).
* Fortunately, not only the condition is equivalent,
but this change has been in place upstream since
Aug 2021, and in Ubuntu since Kinetic, and has not
received additional fix-ups or reverts in either.
[Other Info]
* More details about the fix and scope in comment #1.
The fix commit, dated Aug 2021:
$ git show --no-patch 2a66a666d3e20 ec5b1a4309447e3 2d5f292871
commit 2a66a666d3e202d
Author: liuchao173 <email address hidden>
Date: Tue Aug 24 20:50:18 2021 +0800
fix unsigned integer subtraction sign overflow
Min_load, adjustment_load and load are unsigned integers, so it overflows when (lb_info->min_load + info->load) < (lb_info- >adjustment_ load - info->load). The result will be greater than zero. Therefore the irq cannot be selected to rebalanced.
- No fixups upstream; it's the last commit in the file (since Aug 2021):
$ git log --oneline -1 origin/master -- irqlist.c
2a66a666d3e2 fix unsigned integer subtraction sign overflow
Only needed in Ubuntu Jammy (v1.8.0-based):
- Introduced upstream for v1.9.0:
$ git describe --contains 2a66a666d3e20
v1.9.0~7^2
- Present in Ubuntu Lunar (1.9.2) and later:
$ rmadison -a source irqbalance 14.04.4 | trusty-updates | source 1ubuntu0. 18.04.1 | bionic-updates | source
irqbalance | 1.0.6-2 | trusty | source
irqbalance | 1.0.6-2ubuntu0.
irqbalance | 1.1.0-2ubuntu1 | xenial | source
irqbalance | 1.3.0-0.1 | bionic | source
irqbalance | 1.3.0-0.
irqbalance | 1.6.0-3ubuntu1 | focal | source
irqbalance | 1.8.0-1build1 | jammy | source
irqbalance | 1.9.2-1 | lunar | source
irqbalance | 1.9.2-1 | mantic | source
- First introduced in Ubuntu Kinetic (1.9.0-1), so it's been tested since. /launchpad. net/ubuntu/ +source/ irqbalance/ 1.9.0-1
https:/
- Focal (v1.6.0-based) and earlier do not it, because it fixes commit
e9e28114036a ("improve irq migrate rule to avoid high irq load") that
was introduced upstream for v1.7.0.
The code change is relatively simple math, for an equivalent comparison:
- if ((lb_info->min_load + info->load) - (lb_info- >adjustment_ load - info->load) < delta_load) { >adjustment_ load - info->load)) {
+ if ((lb_info->min_load + info->load) < delta_load + (lb_info-
i.e.,
- if ((A) - (B) < C) {
+ if ((A) < C + (B)) {
And, indeed, for unsigned integer math, if "A is less than B", then "A - B" (above) is not negative (as it's unsigned), but actually overflows to a very large positive integer, which does not meet the condition "< delta_load".