mlx5_core reports hardware checksum error for padded packets on Mellanox NICs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Medium
|
Matthew Ruffell |
Bug Description
BugLink: https:/
[Impact]
On machines equipped with Mellanox NIC's, in this particular case, Mellanox 5 series NICs using the mlx5_core driver, after installing 4.15.0-56 or later there is the following kernel splat:
bond0: hw csum failure
CPU: 63 PID: 2473 Comm: in:imklog Tainted: P OE 4.15.0-58-generic #64~16.04.1-Ubuntu
Call Trace:
<IRQ>
dump_stack+
netdev_
__skb_checksum_
nf_ip_checksum+
tcp_error+
? tcp_v4_
nf_conntrack_
ipv4_conntrack_
nf_hook_
? skb_send_
ip_rcv+0x30f/0x370
? inet_del_
__netif_
? tcp4_gro_
__netif_
? __netif_
netif_receive_
napi_gro_
mlx5e_handle_
mlx5e_poll_
mlx5e_napi_
net_rx_
__do_softirq+
irq_exit+0xca/0xd0
do_IRQ+0x57/0xe0
common_
</IRQ>
In 4.15.0-56, a commit was added from upstream -stable that introduced an optimisation for checksumming packets which have had zero bytes padded to the end of the packet.
commit 88078d98d1bb085
Author: Eric Dumazet <email address hidden>
Date: Wed Apr 18 11:43:15 2018 -0700
subject: net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends
You can read it here:
https:/
It was discussed in this bugzilla link:
https:/
This commit causes problems with a number of NIC devices, including Mellanox.
This is best described by the maintainer, Dimitris Michailidis:
> > > MLNX devices have an issue with packets that are padded past the end of
> > > the L3 payload with bytes that aren't all 0s. They use a mode of checksum
> > > reporting which should be including the padding bytes but MLNX devices
> > > leave those out. When the padding bytes aren't all 0 this omission causes
> > > a checksum error. This device behavior has existed for a long time but it
> > > has begun causing errors only this year. Before a padded packet had its
> HW
> > > checksum ignored so it wasn't material what HW had reported. More
> recently
> > > padded packet checksums started using the HW value and now it is
> > > noticeable when that value isn't right.
Now, some routers stick additional information in the zero padding section on occasion, which will change the hardware checksum. Since the hardware checksum was ignored until 4.15.0-56 with 88078d98d1bb085
[Fix]
This was fixed for Mellanox 4 and 5 series drivers recently.
Mellanox 4: 74abc07dee61308
https:/
Mellanox 5: e8c8b53ccaff568
https:/
This customer hit the issue with mlx5_core driver, so the fix is:
commit e8c8b53ccaff568
Author: Cong Wang <email address hidden>
Date: Mon Dec 3 22:14:04 2018 -0800
subject: net/mlx5e: Force CHECKSUM_
This is actually present in 4.15.0-59, which is currently sitting in -proposed.
The commits are a part of 4.9.156, 4.14.99, 4.19.21 upstream -stable releases, and have been pulled into bionic as a part of LP #1837664
[Testcase]
Simply try and bring an interface up on a machine with Mellanox series 5 NICs.
When a packet comes through which is smaller than required and padding is added, the problem will be triggered.
The 4.15.0-59 from -proposed has been tested by the customer, and resolves the issue.
[Regression Potential]
This patch has a low chance of regression since it fixes a regression introduced by 88078d98d1bb085
The changes are limited to mlx5_core driver, and have been well tested and accepted by the community due to their selection for upstream -stable.
description: | updated |
Changed in linux (Ubuntu Bionic): | |
status: | New → Fix Committed |
importance: | Undecided → Medium |
assignee: | nobody → Matthew Ruffell (mruffell) |
tags: | added: sts |
The customer installed 4.15.0-59 from -proposed to a machine with Mellanox Ethernet CX4LX cards, using the mlx5_core kernel module.
Checksums are now calculated correctly and the kernel spat does not occur when the devices are brought up.
Marking this as verified.