We have a user who has been successfully running under load with the test kernel provided here which was patched with the following two commits:
"i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" Commit: fa38e30ac73fbb01d7e5d0fd1b12d412fa3ac3ee
"i40e: prevent overlapping tx_timeout recover" Commit: d5585b7b6846a6d0f9517afe57be3843150719da
The issue was hit while running on 4.15.0-38-generic #41~16.04.1-Ubuntu on Xenial (the hwe kernel).
Symptoms include messages in the kernel log of the form:
[4733544.982116] i40e 0000:18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0 [4733544.982119] i40e 0000:18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6 [4733572.116270] i40e 0000:18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 2, NTC: 0x49, HWB: 0x123, NTU: 0x123, TAIL: 0x123, INT: 0x0 [4733572.116272] i40e 0000:18:00.1 eno2: tx_timeout recovery level 1, hung_queue 2
Leading to Kafka server issues, etc.
We are fairly confident this is the same as the original reporter, and we'd like to use this bug to proceed on the stable release update process.
We have a user who has been successfully running under load
with the test kernel provided here which was patched with
the following two commits:
"i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled" 1d7e5d0fd1b12d4 12fa3ac3ee
Commit: fa38e30ac73fbb0
"i40e: prevent overlapping tx_timeout recover" 0f9517afe57be38 43150719da
Commit: d5585b7b6846a6d
The issue was hit while running on 4.15.0-38-generic #41~16.04.1-Ubuntu
on Xenial (the hwe kernel).
Symptoms include messages in the kernel log of the form:
[4733544.982116] i40e 0000:18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 6, NTC: 0x1a0, HWB: 0x66, NTU: 0x66, TAIL: 0x66, INT: 0x0
[4733544.982119] i40e 0000:18:00.1 eno2: tx_timeout recovery level 1, hung_queue 6
[4733572.116270] i40e 0000:18:00.1 eno2: tx_timeout: VSI_seid: 390, Q 2, NTC: 0x49, HWB: 0x123, NTU: 0x123, TAIL: 0x123, INT: 0x0
[4733572.116272] i40e 0000:18:00.1 eno2: tx_timeout recovery level 1, hung_queue 2
Leading to Kafka server issues, etc.
We are fairly confident this is the same as the original reporter,
and we'd like to use this bug to proceed on the stable release update process.