Activity log for bug #1840854

Date Who What changed Old value New value Message
2019-08-20 23:40:35 Matthew Ruffell bug added bug
2019-08-20 23:41:03 Matthew Ruffell description BugLink: https://bugs.launchpad.net/bugs/ [Impact] On machines equipped with Mellanox NIC's, in this particular case, Mellanox 5 series NICs using the mlx5_core driver, after installing 4.15.0-56 or later there is the following kernel splat: bond0: hw csum failure CPU: 63 PID: 2473 Comm: in:imklog Tainted: P OE 4.15.0-58-generic #64~16.04.1-Ubuntu Call Trace: <IRQ> dump_stack+0x63/0x8b netdev_rx_csum_fault+0x38/0x40 __skb_checksum_complete+0xc0/0xd0 nf_ip_checksum+0xca/0xf0 tcp_error+0xe0/0x1a0 [nf_conntrack] ? tcp_v4_rcv+0x7c6/0xa70 nf_conntrack_in+0xde/0x520 [nf_conntrack] ipv4_conntrack_in+0x1c/0x20 [nf_conntrack_ipv4] nf_hook_slow+0x48/0xd0 ? skb_send_sock+0x50/0x50 ip_rcv+0x30f/0x370 ? inet_del_offload+0x40/0x40 __netif_receive_skb_core+0x879/0xba0 ? tcp4_gro_receive+0x117/0x1b0 __netif_receive_skb+0x18/0x60 ? __netif_receive_skb+0x18/0x60 netif_receive_skb_internal+0x45/0xf0 napi_gro_receive+0xd0/0xf0 mlx5e_handle_rx_cqe_mpwrq+0x4a1/0x8a0 [mlx5_core] mlx5e_poll_rx_cq+0xc3/0x880 [mlx5_core] mlx5e_napi_poll+0x9b/0x280 [mlx5_core] net_rx_action+0x265/0x3b0 __do_softirq+0xf5/0x2a8 irq_exit+0xca/0xd0 do_IRQ+0x57/0xe0 common_interrupt+0x8c/0x8c </IRQ> In 4.15.0-56, a commit was added from upstream -stable that introduced an optimisation for checksumming packets which have had zero bytes padded to the end of the packet. commit 88078d98d1bb085d72af8437707279e203524fa5 Author: Eric Dumazet <edumazet@google.com> Date: Wed Apr 18 11:43:15 2018 -0700 subject: net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends You can read it here: https://github.com/torvalds/linux/commit/88078d98d1bb085d72af8437707279e203524fa5 It was discussed in this bugzilla link: https://bugzilla.kernel.org/show_bug.cgi?id=201849 This commit causes problems with a number of NIC devices, including Mellanox. This is best described by the maintainer, Dimitris Michailidis: > > > MLNX devices have an issue with packets that are padded past the end of > > > the L3 payload with bytes that aren't all 0s. They use a mode of checksum > > > reporting which should be including the padding bytes but MLNX devices > > > leave those out. When the padding bytes aren't all 0 this omission causes > > > a checksum error. This device behavior has existed for a long time but it > > > has begun causing errors only this year. Before a padded packet had its > HW > > > checksum ignored so it wasn't material what HW had reported. More > recently > > > padded packet checksums started using the HW value and now it is > > > noticeable when that value isn't right. Now, some routers stick additional information in the zero padding section on occasion, which will change the hardware checksum. Since the hardware checksum was ignored until 4.15.0-56 with 88078d98d1bb085d72af8437707279e203524fa5, this wasn't an issue. But with the optimisation, we start running into trouble since the hardware checksums no longer match what the kernel is expecting. [Fix] This was fixed for Mellanox 4 and 5 series drivers recently. Mellanox 4: 74abc07dee613086f9c0ded9e263ddc959a6de04 https://github.com/torvalds/linux/commit/74abc07dee613086f9c0ded9e263ddc959a6de04 Mellanox 5: e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a https://github.com/torvalds/linux/commit/e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a This customer hit the issue with mlx5_core driver, so the fix is: commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a Author: Cong Wang <xiyou.wangcong@gmail.com> Date: Mon Dec 3 22:14:04 2018 -0800 subject: net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames This is actually present in 4.15.0-59, which is currently sitting in -proposed. The commits are a part of 4.9.156, 4.14.99, 4.19.21 upstream -stable releases, and have been pulled into bionic as a part of LP #1837664 [Testcase] Simply try and bring an interface up on a machine with Mellanox series 5 NICs. When a packet comes through which is smaller than required and padding is added, the problem will be triggered. The 4.15.0-59 from -proposed has been tested by the customer, and resolves the issue. [Regression Potential] This patch has a low chance of regression since it fixes a regression introduced by 88078d98d1bb085d72af8437707279e203524fa5 in 4.15.0-56. The changes are limited to mlx5_core driver, and have been well tested and accepted by the community due to their selection for upstream -stable. BugLink: https://bugs.launchpad.net/bugs/1840854 [Impact] On machines equipped with Mellanox NIC's, in this particular case, Mellanox 5 series NICs using the mlx5_core driver, after installing 4.15.0-56 or later there is the following kernel splat: bond0: hw csum failure CPU: 63 PID: 2473 Comm: in:imklog Tainted: P OE 4.15.0-58-generic #64~16.04.1-Ubuntu Call Trace: <IRQ> dump_stack+0x63/0x8b netdev_rx_csum_fault+0x38/0x40 __skb_checksum_complete+0xc0/0xd0 nf_ip_checksum+0xca/0xf0 tcp_error+0xe0/0x1a0 [nf_conntrack] ? tcp_v4_rcv+0x7c6/0xa70 nf_conntrack_in+0xde/0x520 [nf_conntrack] ipv4_conntrack_in+0x1c/0x20 [nf_conntrack_ipv4] nf_hook_slow+0x48/0xd0 ? skb_send_sock+0x50/0x50 ip_rcv+0x30f/0x370 ? inet_del_offload+0x40/0x40 __netif_receive_skb_core+0x879/0xba0 ? tcp4_gro_receive+0x117/0x1b0 __netif_receive_skb+0x18/0x60 ? __netif_receive_skb+0x18/0x60 netif_receive_skb_internal+0x45/0xf0 napi_gro_receive+0xd0/0xf0 mlx5e_handle_rx_cqe_mpwrq+0x4a1/0x8a0 [mlx5_core] mlx5e_poll_rx_cq+0xc3/0x880 [mlx5_core] mlx5e_napi_poll+0x9b/0x280 [mlx5_core] net_rx_action+0x265/0x3b0 __do_softirq+0xf5/0x2a8 irq_exit+0xca/0xd0 do_IRQ+0x57/0xe0 common_interrupt+0x8c/0x8c </IRQ> In 4.15.0-56, a commit was added from upstream -stable that introduced an optimisation for checksumming packets which have had zero bytes padded to the end of the packet. commit 88078d98d1bb085d72af8437707279e203524fa5 Author: Eric Dumazet <edumazet@google.com> Date: Wed Apr 18 11:43:15 2018 -0700 subject: net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends You can read it here: https://github.com/torvalds/linux/commit/88078d98d1bb085d72af8437707279e203524fa5 It was discussed in this bugzilla link: https://bugzilla.kernel.org/show_bug.cgi?id=201849 This commit causes problems with a number of NIC devices, including Mellanox. This is best described by the maintainer, Dimitris Michailidis: > > > MLNX devices have an issue with packets that are padded past the end of > > > the L3 payload with bytes that aren't all 0s. They use a mode of checksum > > > reporting which should be including the padding bytes but MLNX devices > > > leave those out. When the padding bytes aren't all 0 this omission causes > > > a checksum error. This device behavior has existed for a long time but it > > > has begun causing errors only this year. Before a padded packet had its > HW > > > checksum ignored so it wasn't material what HW had reported. More > recently > > > padded packet checksums started using the HW value and now it is > > > noticeable when that value isn't right. Now, some routers stick additional information in the zero padding section on occasion, which will change the hardware checksum. Since the hardware checksum was ignored until 4.15.0-56 with 88078d98d1bb085d72af8437707279e203524fa5, this wasn't an issue. But with the optimisation, we start running into trouble since the hardware checksums no longer match what the kernel is expecting. [Fix] This was fixed for Mellanox 4 and 5 series drivers recently. Mellanox 4: 74abc07dee613086f9c0ded9e263ddc959a6de04 https://github.com/torvalds/linux/commit/74abc07dee613086f9c0ded9e263ddc959a6de04 Mellanox 5: e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a https://github.com/torvalds/linux/commit/e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a This customer hit the issue with mlx5_core driver, so the fix is: commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a Author: Cong Wang <xiyou.wangcong@gmail.com> Date: Mon Dec 3 22:14:04 2018 -0800 subject: net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames This is actually present in 4.15.0-59, which is currently sitting in -proposed. The commits are a part of 4.9.156, 4.14.99, 4.19.21 upstream -stable releases, and have been pulled into bionic as a part of LP #1837664 [Testcase] Simply try and bring an interface up on a machine with Mellanox series 5 NICs. When a packet comes through which is smaller than required and padding is added, the problem will be triggered. The 4.15.0-59 from -proposed has been tested by the customer, and resolves the issue. [Regression Potential] This patch has a low chance of regression since it fixes a regression introduced by 88078d98d1bb085d72af8437707279e203524fa5 in 4.15.0-56. The changes are limited to mlx5_core driver, and have been well tested and accepted by the community due to their selection for upstream -stable.
2019-08-20 23:41:12 Matthew Ruffell nominated for series Ubuntu Bionic
2019-08-20 23:41:12 Matthew Ruffell bug task added linux (Ubuntu Bionic)
2019-08-20 23:41:20 Matthew Ruffell linux (Ubuntu Bionic): status New Fix Committed
2019-08-20 23:41:23 Matthew Ruffell linux (Ubuntu Bionic): importance Undecided Medium
2019-08-20 23:41:27 Matthew Ruffell linux (Ubuntu Bionic): assignee Matthew Ruffell (mruffell)
2019-08-20 23:41:37 Matthew Ruffell tags sts
2019-08-20 23:49:59 Matthew Ruffell tags sts sts verification-done-bionic
2019-08-21 00:00:07 Ubuntu Kernel Bot linux (Ubuntu): status New Incomplete
2019-08-21 00:00:10 Ubuntu Kernel Bot tags sts verification-done-bionic bionic sts verification-done-bionic
2019-09-23 16:49:08 Mauricio Faria de Oliveira linux (Ubuntu Bionic): status Fix Committed Fix Released
2019-10-03 07:15:07 Po-Hsu Lin linux (Ubuntu): status Incomplete Fix Released