bnx2x driver causes 100% CPU load
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Guilherme G. Piccoli | ||
Xenial |
Fix Released
|
High
|
Guilherme G. Piccoli | ||
Bionic |
Fix Released
|
High
|
Guilherme G. Piccoli | ||
Cosmic |
Won't Fix
|
High
|
Guilherme G. Piccoli | ||
Disco |
Fix Released
|
High
|
Guilherme G. Piccoli | ||
Eoan |
Fix Released
|
High
|
Guilherme G. Piccoli | ||
Focal |
Fix Released
|
High
|
Guilherme G. Piccoli |
Bug Description
[Impact]
* The PTP feature in bnx2x driver is implemented in a way that if the NIC firmware takes some time to perform the timestamping - which is observed as a bad register read in bnx2x_ptp_task() - then the ptp worker function will reschedule itself indefinitely until the value read from the register is meaningful. With that behavior, if an userspace tool request a bad configured RX filter to bnx2x (or if NIC firmware has any other issue in timestamping), the function bnx2x_ptp_task() will be rescheduled forever and cause a unbound resource consumption. This manifests as a kworker thread consuming 100% of CPU.
* The dmesg log will show the following message regarding other packets being skipped on timestamp routine due to a packet getting stuck in the timestamping "pipeline":
"bnx2x: [bnx2x_
outstanding packet to timestamp, this packet will not be timestamped"
Also, by using ftrace user can notice that function bnx2x_ptp_task() is being called a lot, and by enabling bnx2x PTP debugging log (ethtool -s <iface> msglvl 16777216) it's possible to observe the following message flooding the kernel log:
"bnx2x: [bnx2x_
* The patch proposed in this SRU request is accepted upstream and is available currently (2019-07-03) in David Miller's linux-net tree:
git.kernel.
Besides fixing the issue, it also adds an ethtool statistics for accounting the ptp errors and reduces message flooding in case of errors.
[Test case]
Reproducing the problem is not difficult; we've used chrony in Bionic to trigger the problem. The steps are:
a) Install chrony on Bionic in a system with working NIC managed by bnx2x;
b) Edit chrony configuration and add: "hwtimestamp *" to the top of its conf file;
c) Restart chrony service
Check dmesg for the "[...]single outstanding packet" message and the overall CPU workload using a tool like "top" to observe a kthread consuming 100% of CPU.
[Regression potential]
The patch scope is restricted to bnx2x ptp handler, and was validated by the driver maintainer. If there's any possibility of regressions, we believe the worst would be an issue affecting the packet timestamping, not messing with the regular xmit path for the driver.
information type: | Public → Private |
information type: | Private → Public |
description: | updated |
Changed in linux (Ubuntu Cosmic): | |
status: | In Progress → Won't Fix |
Changed in linux (Ubuntu Bionic): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Disco): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
tags: | added: cscc |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1832082
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.