BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

Bug #1853638 reported by diarmuid on 2019-11-22
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Unassigned
network-manager (Ubuntu)
Critical
Unassigned

Bug Description

The issue appears to be with the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device seems to be dropping data

Basically, we are dropping data, as you can see from the benchmark tool as follows:

tcdforge@x310a:/usr/local/lib/lib/uhd/examples$ ./benchmark_rate --rx_rate 10e6 --tx_rate 10e6 --duration 300
[INFO] [UHD] linux; GNU C++ version 5.4.0 20160609; Boost_105800; UHD_3.14.1.1-0-g98c7c986
[WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected.
Please see the general application notes in the manual for instructions.
EnvironmentError: OSError: error in pthread_setschedparam

[00:00:00.000007] Creating the usrp device with: ...
[INFO] [X300] X300 initialization sequence...
[INFO] [X300] Maximum frame size: 1472 bytes.
[INFO] [X300] Radio 1x clock: 200 MHz
[INFO] [GPS] Found an internal GPSDO: LC_XO, Firmware Rev 0.929a
[INFO] [0/DmaFIFO_0] Initializing block control (NOC ID: 0xF1F0D00000000000)
[INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1308 MB/s)
[INFO] [0/DmaFIFO_0] BIST passed (Throughput: 1316 MB/s)
[INFO] [0/Radio_0] Initializing block control (NOC ID: 0x12AD100000000001)
[INFO] [0/Radio_1] Initializing block control (NOC ID: 0x12AD100000000001)
[INFO] [0/DDC_0] Initializing block control (NOC ID: 0xDDC0000000000000)
[INFO] [0/DDC_1] Initializing block control (NOC ID: 0xDDC0000000000000)
[INFO] [0/DUC_0] Initializing block control (NOC ID: 0xD0C0000000000000)
[INFO] [0/DUC_1] Initializing block control (NOC ID: 0xD0C0000000000000)
Using Device: Single USRP:
  Device: X-Series Device
  Mboard 0: X310
  RX Channel: 0
    RX DSP: 0
    RX Dboard: A
    RX Subdev: SBX-120 RX
  RX Channel: 1
    RX DSP: 0
    RX Dboard: B
    RX Subdev: SBX-120 RX
  TX Channel: 0
    TX DSP: 0
    TX Dboard: A
    TX Subdev: SBX-120 TX
  TX Channel: 1
    TX DSP: 0
    TX Dboard: B
    TX Subdev: SBX-120 TX

[00:00:04.305374] Setting device timestamp to 0...
[WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected.
Please see the general application notes in the manual for instructions.
EnvironmentError: OSError: error in pthread_setschedparam
[00:00:04.310990] Testing receive rate 10.000000 Msps on 1 channels
[WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected.
Please see the general application notes in the manual for instructions.
EnvironmentError: OSError: error in pthread_setschedparam
[00:00:04.318356] Testing transmit rate 10.000000 Msps on 1 channels
[00:00:06.693119] Detected Rx sequence error.
D[00:00:09.402843] Detected Rx sequence error.
DD[00:00:40.927978] Detected Rx sequence error.
D[00:01:44.982243] Detected Rx sequence error.
D[00:02:11.400692] Detected Rx sequence error.
D[00:02:14.805292] Detected Rx sequence error.
D[00:02:41.875596] Detected Rx sequence error.
D[00:03:06.927743] Detected Rx sequence error.
D[00:03:47.967891] Detected Rx sequence error.
D[00:03:58.233659] Detected Rx sequence error.
D[00:03:58.876588] Detected Rx sequence error.
D[00:04:03.139770] Detected Rx sequence error.
D[00:04:45.287465] Detected Rx sequence error.
D[00:04:56.425845] Detected Rx sequence error.
D[00:04:57.929209] Detected Rx sequence error.
[00:05:04.529548] Benchmark complete.
Benchmark rate summary:
  Num received samples: 2995435936
  Num dropped samples: 4622800
  Num overruns detected: 0
  Num transmitted samples: 3008276544
  Num sequence errors (Tx): 0
  Num sequence errors (Rx): 15
  Num underruns detected: 0
  Num late commands: 0
  Num timeouts (Tx): 0
  Num timeouts (Rx): 0
Done!

tcdforge@x310a:/usr/local/lib/lib/uhd/examples$

In this particular case description, the nodes are USRP x310s. However, we have the same issue with N210 nodes dropping samples connected to the BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet device.

There is no problem with the USRPs themselves, as we have tested them with normal 1G network cards and have no dropped samples.

Personally I think its something to do with the 10G network card, possibly on a ubuntu driver???

Note, Dell have said there is no hardware problem with the 10G interfaces

I have followed the troubleshooting information on this link to try determine the problem: https://files.ettus.com/manual/page_usrp_x3x0_config.html
- There is no firewall on that port (disabled).
- I tried setting the cpu frequency power but got "no or unknown cpufreq driver is active on this CPU".
- I also changed the cable to Cat6a connecting the USRPs to the 10G SRIOV port, and I get the same issue

This is from the VM with connected USRP x310
tcdforge@x310a:~$ lspci -nn | grep -i ethernet
00:03.0 Ethernet controller [0200]: Red Hat, Inc. Virtio network device [1af4:1000]
00:05.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme-E Ethernet Virtual Function [14e4:16dc]
tcdforge@x310a:~$

5e:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller [14e4:16d8] (rev 01)
        Subsystem: Dell BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller [1028:1fea]
        Flags: bus master, fast devsel, latency 0, IRQ 50, NUMA node 0
        Memory at b9a10000 (64-bit, prefetchable) [size=64K]
        Memory at b9100000 (64-bit, prefetchable) [size=1M]
        Memory at b9aa2000 (64-bit, prefetchable) [size=8K]
        Expansion ROM at b9c00000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: bnxt_en
        Kernel modules: bnxt_en

We get this info from the server:
scamallra@rack9:~$ cpupower frequency-info
analyzing CPU 0:
  no or unknown cpufreq driver is active on this CPU
  CPUs which run at the same hardware frequency: Not Available
  CPUs which need to have their frequency coordinated by software: Not Available
  maximum transition latency: Cannot determine or is not supported.
Not Available
  available cpufreq governors: Not Available
  Unable to determine current policy
  current CPU frequency: Unable to call hardware
  current CPU frequency: Unable to call to kernel
  boost state support:
    Supported: yes
    Active: yes

lsb_release -rd
Description: Ubuntu 18.04.3 LTS
Release: 18.04

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: network-manager 1.10.6-2ubuntu1.1
ProcVersionSignature: Ubuntu 4.15.0-70.79-generic 4.15.18
Uname: Linux 4.15.0-70-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.9
Architecture: amd64
Date: Fri Nov 22 17:39:21 2019
NetworkManager.state:
 [main]
 NetworkingEnabled=true
 WirelessEnabled=true
 WWANEnabled=true
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: network-manager
UpgradeStatus: No upgrade log present (probably fresh install)
nmcli-con: NAME UUID TYPE TIMESTAMP TIMESTAMP-REAL AUTOCONNECT AUTOCONNECT-PRIORITY READONLY DBUS-PATH ACTIVE DEVICE STATE ACTIVE-PATH SLAVE
nmcli-nm:
 RUNNING VERSION STATE STARTUP CONNECTIVITY NETWORKING WIFI-HW WIFI WWAN-HW WWAN
 running 1.10.6 connected started unknown enabled enabled enabled enabled enabled

diarmuid (diarmuidcire) wrote :

I have reports of the same device appearing to drop packets and incur greater number of retransmissions under certain circumstances which we're still trying to nail down.

I'm using this bug for now until proven to be a different problem.

This is causing issues in a production environment.

Changed in network-manager (Ubuntu):
status: New → Confirmed
importance: Undecided → Critical
tags: added: sts

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1853638

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

We suspect this is a device (hw/fw) issue, however, not NetworkManager
or kernel (driver bnxt_en). I've added the kernel for the driver impact
(just in case, for now). This is really to eliminate all other causes
and confirm whether it's the device at root cause).

NIC
--------
Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet
5e:00.1 Ethernet controller: Broadcom Inc. and subsidiaries
BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

NIC Driver/FW
-------------------
driver: bnxt_en
version: 1.10.0
firmware-version: 214.0.253.1/pkg 21.40.25.31
expansion-rom-version:
bus-info: 0000:5e:00.1
supports-statistics: yes

Kernel
---------
5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019

(appears to be an issue on all kernel versions)

Environment Configuration
-----------------------------
active-backup bonding mode

(having the active backup up *might* potentially be the problem,
 but it might just be the device itself).

The exact same distro, kernel, applications and configuration
works fine with a different NIC (Broadcom 10g bnx2x).

There were quite a few total tpa_abort stats counts (1118473)
during the duration of a 2 minute iperf test.

Hoping to get more information from other users seeing the
same issue.

(active interface)

> cat ethtool-S-enp94s0f1d1 | grep abort
     [0]: tpa_aborts: 19775497
     [1]: tpa_aborts: 26758635
     [2]: tpa_aborts: 12008147
     [3]: tpa_aborts: 15829167
     [4]: tpa_aborts: 25099500
     [5]: tpa_aborts: 3292554
     [6]: tpa_aborts: 2863692
     [7]: tpa_aborts: 20224692

(backup interface)
> cat ethtool-S-enp94s0f0 | grep abort
     [0]: tpa_aborts: 3158584
     [1]: tpa_aborts: 1670319
     [2]: tpa_aborts: 1749371
     [3]: tpa_aborts: 1454301
     [4]: tpa_aborts: 123020
     [5]: tpa_aborts: 1403509
     [6]: tpa_aborts: 1298383
     [7]: tpa_aborts: 1858753

Netted out from previous capture, there were

*f0 = 2014 tpa_aborts
*d1 = 1118473 tpa_aborts

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
importance: Undecided → Critical
Edwin Peer (espeer) wrote :

I am an engineer at Broadcom and have been assigned to investigate this issue. To that end, I have a few clarifying questions:

1a) What is the benchmark tool you are using and could you provide a link to where I can get it?

 b) What kind of network traffic is it sending?

2a) In what units are the data rate parameters "--rx_rate 10e6 --tx_rate 10e6" specified?

 b) What data rate are you attempting to send? The report notes that the platform can't be the issue at 1G, but are you attempting to utilize 10G?

3) Perhaps stating the obvious here, but has anybody looked into the warning?

"[WARNING] [UHD] Unable to set the thread priority. Performance may be negatively affected."

which is probably related to this error:

"EnvironmentError: OSError: error in pthread_setschedparam"

4 a) I am personally unfamiliar with the USRP x310, could you provide some more information about it? Googling for it seems to indicate it is some kind of software defined radio platform?

  b) Is there a way to get access to one to reproduce and diagnose this issue?

Edwin Peer (espeer) wrote :

Could you also please dump the ethtool statistics for the NIC?

Hello, Edwin,

We have two separate users/customers filing reports, and I can answer for
one of them. I'll ask the original poster separately as well to reply.

With respect to one of these situations, this is the following system:

Dell PowerEdge R440/0XP8V5, BIOS 2.2.11 06/14/2019

Note that a similar system does not have any issues:

Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.3.4 11/08/2016

So the NIC in the "bad" environment is:

BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet

The NIC in the "good" environment is:

Broadcom Inc. and subsidiaries NetXtreme II BCM57810
10 Gigabit Ethernet [14e4:1006]
Product Name: QLogic 57810 10 Gigabit Ethernet

I'll have to scrub some files and see what I can attach,
apologies, I'll have it here by tmrw.

Unfortunately, we don't have an easy reproducer.
A single iperf and netperf test (both UDP and TCP) showed identical
results from both "good" and "bad" environments.

What we have is an identical kernel, network configuration and
stack with the "bad" system showing double, triple the latency
to the systems from a remote server. I'll have more information
for you shortly here regarding the exact k8 cmd.

Note that iperf was identical whereas netperf and mtr showed
up differences (so it's possibly sporadic as well, not continuous)

1. iperf tcp test
----------------------
GoodSystem.........9.84 Gbits/sec
BadSystem1............8.37 Gbits/sec
BadSystem2...........9.85 Gbits/sec

2. iperf udp test
----------------------
GoodSystem.........1.05 Mbits/sec
BadSystem2...........1.05 Mbits/sec

3. mtr ping test
-----------------------
GoodSystem..........0.0% Loss; 0.2 Avg; 0.1 Best, 0.9 Worst, 0.1 StdDev
BadSystem2...........11.7% Loss; 0.1 Avg; 0.1 Best, 0.2 Worst, 0.0 StdDev

4. netperf tcp_rr 1/1 bytes
------------------------------------
GoodSystem..........17921.83 t/sec
BadSystem1.............13912.45 t/sec
BadSystem2............

5. netperf tcp_rr 64/64 bytes
------------------------------------
GoodSystem..........16987.48 t/sec
BadSystem1.............13355.93 t/sec
BadSystem2............

6. netperf tcp_rr 128/8192 bytes
-----------------------------------
GoodSystem..........2396.45 t/sec
BadSystem1.............1678.54 t/sec
BadSystem2............

diarmuid (diarmuidcire) wrote :

Here is the Ettus benchmark tool
https://kb.ettus.com/Verifying_the_Operation_of_the_USRP_Using_UHD_and_GNU_Radio

You would need an Ettus device to run those tests.

I cant test the affected node now as it is in production unfortunately.

Edwin Peer (espeer) wrote :

With respect to one of these situations, this is the following system:

> Dell PowerEdge R440/0XP8V5, BIOS 2.2.11 06/14/2019
>
> Note that a similar system does not have any issues:
>
> Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.3.4 11/08/2016
>
> So the NIC in the "bad" environment is:
>
> BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
> Product Name: Broadcom Adv. Dual 10G SFP+ Ethernet
>
> The NIC in the "good" environment is:
>
> Broadcom Inc. and subsidiaries NetXtreme II BCM57810
> 10 Gigabit Ethernet [14e4:1006]
> Product Name: QLogic 57810 10 Gigabit Ethernet

There are more than one variable at play here. Does the problem follow the NIC if you swap the NICs between systems? Are OS / kernel and driver versions the same on both systems?

Edwin Peer (espeer) wrote :

> 3. mtr ping test
> -----------------------
> GoodSystem..........0.0% Loss; 0.2 Avg; 0.1 Best, 0.9 Worst, 0.1 StdDev
> BadSystem2...........11.7% Loss; 0.1 Avg; 0.1 Best, 0.2 Worst, 0.0 StdDev

The mtr packet loss is an interesting result. What mtr options did you use? Is this a UDP or ICMP test?

> The mtr packet loss is an interesting result. What mtr options did you use? Is this a UDP or ICMP test?

The mtr command was:

mtr --no-dns --report --report-cycles 60 $IP_ADDR

so ICMP was going out.

> There are more than one variable at play here.
> Does the problem follow the NIC if you swap the
> NICs between systems? Are OS / kernel and driver
> versions the same on both systems?

Unfortunately, I've not been able to get them to try
permutations or switches, as yet, as this is still a
production system/environment.

I'll try and obtain more information about it.

Thanks very much for helping on this, Edwin! Please let me
know if there's anything specific you need.

I'm asking them to disable any IPv6, LLDP traffic in their environment,
and retest and collect information again.

Also, I'd like to disable tpa, would this be at all useful:

modprobe bnx disable_tpa=1

??

> NICs between systems? Are OS / kernel and driver
> versions the same on both systems?

Yes, identical distro release, kernel, and most of the software
stack (I have not obtained and examined the full sw stack).

Configuration of networking settings is also the same.

Edwin Peer (espeer) wrote :

I don't think bnxt_en exposes the disable_tpa parameter. Be that as it may, I think the tpa_aborts may be a red herring. TPA aggregates TCP flows and you are seeing the issue with ICMP.

In which direction(s) of traffic flow do you see the losses?

Hey Edwin, sorry, I didn't see your last question.

I'll try and confirm but I've seen loss in both
directions but it's not clear whether that's significant
enough or not yet.

e.g., TCP traffic is retransmitted, so it could be segments
lost while outgoing or acks lost incoming.

4407 retransmitted TCP segments
130 TCP timeouts

in stats collected about 5 mins apart - which isn't
sufficient a sample size, we're trying to get a new
collection of stats, logs using the netperf TCP_RR test.

In our case, note, we're more concerned (and have more solid
data) of latency issues than dropped packets (which I expect
some of with heavy network testing).

For example, netperf TCP_RR latency is about 70-78% of the older
systems for 1,1 request/response byte sizes as well as 64/64,
100/200, 128/8192 sizes.

I'll update here as soon as we have more data from the production
environment.

Hello Edwin,

Here is more information on the issue we are seeing wrt dropped
packets and other connectivity issues with this NIC.

The problem is *only* seen when the second port on the NIC is
chosen as the active interface of a active-backup configuration.

So on the "bad" system with the interfaces:

enp94s0f0 -> when chosen as active, all OK
enp94s0f1d1 -> when chosen as active, not OK

I'll see if the reporters can confirm that on the "good" systems,
there was no problem when the second interface is active.

The second port on the NIC definitely works as the active
interface in an active-backup bonding configuration on the
other NICs.

At the moment, it's only this particular NIC that is seeing
this problem that we know of.

We have narrowed it down to a flaw in a specific configuration setting
on this NIC, so we're comparing the good and bad configurations now.

Primary port: enp94s0f0
Secondary port: enp94s0f1d1

A] Good config for fault-tolerance (active-backup) bonding mode:
--------------------------------------------------------------
Primary port = active interface; Secondary port = backup

B] Bad config for fault-tolerance (active-backup) bonding mode:
--------------------------------------------------------------
Primary port = backup interface; Secondary port = active

We are consistently able to reproduce a drop rate difference
with UDP pkts, for the above good/bad cases:

Good Case UDP MTR Test Result
---------------------------------
mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST
Start: 2020-02-10T10:14:01+0000
HOST: hostname Loss% Snt Last Avg Best Wrst StDev
  1.|-- nn.nn.nnn.nnn 0.0% 60 0.3 0.2 0.2 0.3 0.0

Bad Case UDP MTR Test Result
-------------------------------
mtr --no-dns --report --report-cycles 60 --udp -s 1428 $DEST
Start: 2020-02-10T14:10:52+0000
HOST: hostname Loss% Snt Last Avg Best Wrst StDev
  1.|-- nn.nn.nnn.nnn 8.3% 60 0.3 0.3 0.2 0.4 0.0

Edwin Peer (espeer) wrote :

Hi Nivedita,

I have been away on PTO the last week and am picking this up again now. Please could you post the full bonding configuration?

Regards,
Edwin Peer

"Bad" Configuration for active-backup mode:
--------------------------------------------------------

----
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: enp94s0f1d1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: enp94s0f1d1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 4c:d9:8f:48:08:da
Slave queue ID: 0

Slave Interface: enp94s0f0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 2
Permanent HW addr: 4c:d9:8f:48:08:d9
Slave queue ID: 0

---
$ cat uname-rv
5.3.0-28-generic #30~18.04.1-Ubuntu SMP Fri Jan 17 06:14:09 UTC 2020

---
Scrubbed /etc/netplan/50-cloud-init.yaml:
network:
    bonds:
        bond0:
            addresses:
            - 0.0.235.177/25
            gateway4: 0.0.235.129
            interfaces:
            - enp94s0f0
            - enp94s0f1d1
            macaddress: 00:00:00:48:08:00
            mtu: 9000
            nameservers:
                addresses:
                - 0.0.235.171
                - 0.0.235.172
                search:
                - maas
            parameters:
                down-delay: 0
                gratuitious-arp: 1
                mii-monitor-interval: 100
                mode: active-backup
                transmit-hash-policy: layer2
                up-delay: 0
    ethernets:
        eno1:
            match:
                macaddress: 00:00:00:76:6e:ca
            mtu: 1500
            set-name: eno1
        eno2:
            match:
                macaddress: 00:00:00:76:6e:cb
            mtu: 1500
            set-name: eno2
        enp94s0f0:
            match:
                macaddress: 00:00:00:48:08:00
            mtu: 9000
            set-name: enp94s0f0
        enp94s0f1d1:
            match:
                macaddress: 00:00:00:48:08:da
            mtu: 9000
            set-name: enp94s0f1d1
    version: 2

Good System/Good NIC (all configurations work) Comparison
------------------------------------------------------------
NIC: NetXtreme II BCM57000 10 Gigabit Ethernet QLogic 57000
System: Dell
Kernel: 5.0.0-25-generic #26~18.04.1-Ubuntu

/proc/net/bonding/bond0
-----------------------
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: enp5s0f1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: enp5s0f1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:00:00:00:73:e2
Slave queue ID: 0

Slave Interface: enp5s0f0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:00:00:00:73:e0
Slave queue ID: 0

/etc/netplan/50-cloud-init.yaml
--------------------------------
network:
    bonds:
        bond0:
            addresses:
            - 00.00.235.182/25
            gateway4: 00.00.235.129
            interfaces:
            - enp5s0f0
            - enp5s0f1
            macaddress: 00:00:00:00:73:e0
            mtu: 9000
            nameservers:
                addresses:
                - 00.00.235.172
                - 00.00.235.171
                search:
                - maas
            parameters:
                down-delay: 0
                gratuitious-arp: 1
                mii-monitor-interval: 100
                mode: active-backup
                transmit-hash-policy: layer2
                up-delay: 0
    ethernets:
        ...(snip)..
        enp5s0f0:
            match:
                macaddress: 00:00:00:00:73:e0
            mtu: 9000
            set-name: enp5s0f0
        enp5s0f1:
            match:
                macaddress: 00:00:00:00:73:e2
            mtu: 9000
            set-name: enp5s0f1
    version: 2

"Bad" System/NIC:

NIC: BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller
System: Dell
Kernel: 5.3.0-28-generic #30~18.04.1-Ubuntu

(Note, this issue has been seen on prior kernels as well, upgraded
 to latest to see if various problems were resolved)

Attaching stats/config files from nics from this system (seeing issue).

ethtool-enp94s0f0
----------------------
Settings for enp94s0f0:
 Supported ports: [ FIBRE ]
 Supported link modes: 10000baseT/Full
 Supported pause frame use: Symmetric Receive-only
 Supports auto-negotiation: Yes
 Supported FEC modes: Not reported
 Advertised link modes: Not reported
 Advertised pause frame use: No
 Advertised auto-negotiation: No
 Advertised FEC modes: Not reported
 Speed: 10000Mb/s
 Duplex: Full
 Port: FIBRE
 PHYAD: 1
 Transceiver: internal
 Auto-negotiation: off
 Supports Wake-on: g
 Wake-on: d
 Current message level: 0x00000000 (0)

 Link detected: yes

ethtool-i-enp94s0f0
--------------------------
driver: bnxt_en
version: 1.10.0
firmware-version: 214.0.253.1/pkg 21.40.25.31
expansion-rom-version:
bus-info: 0000:5e:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

ethtool-c-enp94s0f0
---------------------
Coalesce parameters for enp94s0f0:
Adaptive RX: off TX: off
stats-block-usecs: 1000000
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 10
rx-frames: 15
rx-usecs-irq: 1
rx-frames-irq: 1

tx-usecs: 28
tx-frames: 30
tx-usecs-irq: 2
tx-frames-irq: 2

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

ethtool-g-enp94s0f0
------------------------
Ring parameters for enp94s0f0:
Pre-set maximums:
RX: 2047
RX Mini: 0
RX Jumbo: 8191
TX: 2047
Current hardware settings:
RX: 511
RX Mini: 0
RX Jumbo: 2044
TX: 511

ethtool-k-enp94s0f0
---------------------
Features for enp94s0f0:
rx-checksumming: on
tx-checksumming: on
 tx-checksum-ipv4: on
 tx-checksum-ip-generic: off [fixed]
 tx-checksum-ipv6: on
 tx-checksum-fcoe-crc: off [fixed]
 tx-checksum-sctp: off [fixed]
scatter-gather: on
 tx-scatter-gather: on
 tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
 tx-tcp-segmentation: on
 tx-tcp-ecn-segmentation: off [fixed]
 tx-tcp-mangleid-segmentation: off
 tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: on
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: on
tls-hw-record: off [fixed]

Edwin, let me know if you can get in touch with me via the contact email
on my Launchpad page. Thanks for all the help!

Additional observations.

MAAS is being used to deploy the system and configure
the bond interface and settings.

MAAS allows you to specify which is the primary interface, with
the other being the backup, for the active-backup bonding mode.
However, it does not appear to be working -it's not passing along
a primary primitive, for instance, in the netplan yaml or otherwise
resulting in this being honored (still need to confirm).

MAAS allows you to enter a mac address for the bond interface,
but if not supplied, by default it will use the mac address of
the "primary" interface, as configured.

MAAS then populates the /etc/netplan/50-cloud-init.yaml, including
a macaddr= line with the default.

netplan then passes that along to systemd-networkd.

The bonding kernel, however, will use as the active interface
whichever interface is first attached to the bond (i.e., which
completes getting attached to the bond interface first) in the
absence of a primary= directive.

The bonding kernel will, however, use the mac addr supplied
as an override.

So let's say the active interface was configured in MAAS to be
f0, and it's mac is used to be the mac address of the bond,
but f1 (the second port of the NIC) actually gets attached
first to the bond and is used as the active interface by the
bond.

We have a situation where f0 = backup, f1 = active, and bond0
is using the mac of f0. While this should work, there is a
potential for problems depending on the circumstances.

It's likely this has nothing to do with our current issue, but
here for completeness. Will see if we can test/confirm.

Edwin Peer (espeer) wrote :
Download full text (9.1 KiB)

I have tried, unsuccessfully, to reproduce this issue internally. Details of my setup below.

1) I have a pair of Dell R210 servers racked (u072 and u073 below), each with a BCM57416 installed:

root@u072:~# lspci | grep BCM57416
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57416 NetXtreme-E Dual-Media 10G RDMA Ethernet Controller (rev 01)

2) I've matched the firmware version to one that Nivedita reported in a bad system:

root@u072:~# ethtool -i enp1s0f0np0
driver: bnxt_en
version: 1.10.0
firmware-version: 214.0.253.1/pkg 21.40.25.31
expansion-rom-version:
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

3) Matched Ubuntu release and kernel version:

root@u072:~# lsb_release -dr
Description: Ubuntu 18.04.3 LTS
Release: 18.04

root@u072:~# uname -a
Linux u072 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

4) Configured the interface into an active-backup bond:

root@u072:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: enp1s0f1np1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: enp1s0f1np1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:0a:f7:a7:10:61
Slave queue ID: 0

Slave Interface: enp1s0f0np0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:0a:f7:a7:10:60
Slave queue ID: 0

5) Run the provided mtr and netperf test cases with the 1st port selected as active:

root@u072:~# ip l set enp1s0f1np1 down
root@u072:~# ip l set enp1s0f1np1 up
root@u072:~# cat /proc/net/bonding/bond0 | grep Active
Currently Active Slave: enp1s0f0np0

a) initiated on u072:

root@u072:~# mtr --no-dns --report --report-cycles 60 192.168.1.2
Start: 2020-02-13T20:48:01+0000
HOST: u072 Loss% Snt Last Avg Best Wrst StDev
  1.|-- 192.168.1.2 0.0% 60 0.2 0.2 0.2 0.2 0.0

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 1,1
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec

16384 131072 1 1 10.00 29040.91
16384 87380

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 64,64
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.1.2 () port 0 AF_INET : demo : first burst 0
Local /Remote
Socket Size Request Resp. Elapsed Trans.
Send Recv Size Size Time Rate
bytes Bytes bytes bytes secs. per sec

16384 131072 64 64 10.00 28633.36
16384 87380

root@u072:~# netperf -t TCP_RR -H 192.168.1.2 -- -r 128,8192
MIGR...

Read more...

Edwin,

Do you happen to notice any IPv6 or LLDP or other link-local traffic
on the interfaces? (including backup interface).

The MTR loss % is purely a capture of their packets xmitted
and responses received, so for that UDP MTR test, this is saying
that UDP packets were lost, somewhere.

The NIC does not have any drops showing via ethtool -S
stats but I'm hunting down which are the right pair of before/afters.

Other than the tpa_abort counts, there were no errors that I saw.
I can't tell what the tpa_abort means for the frame - is it purely
a failure only to coalesce, or does it end up dropping packets at
some point in that functionality? I'm assuming not, as whatever the
reason, those would be counted as drops, I hope, and printed in
the interface stats.

I'll attach all the stats here once I get them sorted out, I thought
I had a clean diff of before and after from the tester, but after
looking through, I don't think the file I have is from before/after
the mtr test, as there was negligible UDP traffic. I'll try and
get clarification from the reporter.

Note that when the provision of primary= is used to configure
which interface is primary, and when the primary port is used
as the active interface for the bond, no problems are seen (and
that works deterministically to set the correct active interface).

Edwin Peer (espeer) wrote :

The tpa_aborts shouldn't be a concern. They merely indicate that a TCP flow could not be aggregated. That could have a performance impact, of course, but that should manifest as counted drops somewhere if this were the case.

Importantly, the tpa_aborts only apply to TCP traffic, but you see the problem for ICMP and UDP too.

Note, the tpa_aborts also appear to be evident on the primary as active interface while things are working as expected. A difference in magnitude tpa_aborts from one test run to another may be a clue about something else that's happening though, but I'm not sure that we are comparing apples to apples with respect the ethtool -S dumps posted thus far (when were they captured relative to the test runs, which interface was active at the time, etc?).

Edwin Peer (espeer) wrote :

Regarding your question about LLDP and IPv6...

The default Ubuntu 18.04.3 configuration has an IPv6 enabled kernel, but the interface only has the default link local address configured. I've seen it do router solicitation on link state changes and periodically thereafter. I think I recall seeing somewhere above that you were going to try without IPv6. Have you seen different behavior with IPv6 disabled?

I don't expect to see LLDP because I have the two NICs wired directly to each other, back to back. I'm not running any LLDP daemon on the host.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers