OVN provider network type vlan packets cannot go outside the bond on Intel E810-XXV card

Bug #2008781 reported by Bartosz Woronicz
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Ubuntu 20.04.5 LTS
ubuntu@compute-09:~$ uname -a
Linux compute-09 5.4.0-139-generic #156-Ubuntu SMP Fri Jan 20 17:27:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@compute-09:~$ sudo update-pciids
ubuntu@compute-09:~$ lspci |grep Intel|grep -i Ether
31:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
31:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
ca:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
ca:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)

The test instance with provider network floating ip 10.99.0.213 cannot reach the provider network gateway
openstack server create --key-name ubuntu-keypair --image auto-sync/ubuntu-jammy-22.04-amd64-server-20230210-disk1.img --flavor m1.small --net provider1-private-net ubuntu-provider1

ubuntu@compute-05:~$ sudo -E ip netns exec ovnmeta-fcd1b354-6f41-42dc-ae73-87df28856ee5 ssh ubuntu@192.168.100.123
ubuntu@ubuntu-provider1:~$ ping 10.99.0.254
PING 10.99.0.254 (10.99.0.254) 56(84) bytes of data.
^C
--- 10.99.0.254 ping statistics ---
419 packets transmitted, 0 received, 100% packet loss, time 428035ms

I found the compute from which the outside traffic is going out
and I see ARP requests with no response
compute-09:~$ sudo tcpdump -vteni bond1 '(vlan 300)'
tcpdump: listening on bond1, link-type EN10MB (Ethernet), capture size 262144 bytes fa:16:3e:ab:87:ad > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 300, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.99.0.254 tell 10.99.0.88, length 28
fa:16:3e:ab:87:ad > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 300, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.99.0.254 tell 10.99.0.88, length 28
fa:16:3e:ab:87:ad > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 300, p 0, ethertype ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 10.99.0.254 tell 10.99.0.88, length 28
For the test you may ping .254 indifenetely

The error count grows on tx packets on bond1 and the card ens2f0 (which happens to push the traffic)
ubuntu@compute-09:~$ sudo ethtool -S ens2f0|grep error
     tx_errors: 12
     tx_errors.nic: 0
     rx_length_errors.nic: 0
     rx_crc_errors.nic: 0
ubuntu@compute-09:~$ ifconfig ens2f0
ens2f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST> mtu 9000
        ether b4:83:51:00:83:d1 txqueuelen 1000 (Ethernet)
        RX packets 53784 bytes 22064970 (22.0 MB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 52163 bytes 18393142 (18.3 MB)
        TX errors 12 dropped 0 overruns 0 carrier 0 collisions 0

If I create vlan interface directly on bond1 I can ping the gateway with no problem
so that creates opportunity for
WORKAROUND 1: set the network to flat and push traffic on vlan interfaces on computes as for physnet device

Another thing I tried was to install the HWE kernel

ubuntu@compute-09:~$ uname -a
Linux compute-09 5.15.0-60-generic #66~20.04.1-Ubuntu SMP Wed Jan 25 09:41:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Fortunately traffic was still going out from compute-09 after reboot,
that fixed the issue
so we have WORKAROUND 2
ubuntu@ubuntu-provider2:~$ ping 10.99.0.254
PING 10.99.0.254 (10.99.0.254) 56(84) bytes of data.
64 bytes from 10.99.0.254: icmp_seq=1 ttl=63 time=2.15 ms
64 bytes from 10.99.0.254: icmp_seq=2 ttl=63 time=0.896 ms
64 bytes from 10.99.0.254: icmp_seq=3 ttl=63 time=1.12 ms
^C
ubuntu@infra-1:~$ ping 10.99.0.213
PING 10.99.0.213 (10.99.0.213) 56(84) bytes of data.
64 bytes from 10.99.0.213: icmp_seq=1 ttl=62 time=5.12 ms
64 bytes from 10.99.0.213: icmp_seq=2 ttl=62 time=2.17 ms
64 bytes from 10.99.0.213: icmp_seq=3 ttl=62 time=0.948 ms
64 bytes from 10.99.0.213: icmp_seq=4 ttl=62 time=1.00 ms
64 bytes from 10.99.0.213: icmp_seq=5 ttl=62 time=0.891 ms
64 bytes from 10.99.0.213: icmp_seq=6 ttl=62 time=1.05 ms

Now I can ping both ways

However I am afraid that we may encounter same issue like for Jammy for the cards when booting, as it happens randomly for the kernel with the same number 5.15.0-60
Here's the bug I am referring to
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2004262
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Feb 27 13:33 seq
 crw-rw---- 1 root audio 116, 33 Feb 27 13:33 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu27.25
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
CasperMD5CheckResult: skip
DistroRelease: Ubuntu 20.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 004: ID 1604:10c0 Tascam
 Bus 001 Device 003: ID 1604:10c0 Tascam
 Bus 001 Device 002: ID 1604:10c0 Tascam
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Lsusb-t:
 /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 5000M
 /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
     |__ Port 14: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
         |__ Port 1: Dev 3, If 0, Class=Hub, Driver=hub/4p, 480M
         |__ Port 4: Dev 4, If 0, Class=Hub, Driver=hub/4p, 480M
MachineType: Dell Inc. PowerEdge R650
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-139-generic root=UUID=70799655-ec36-47e0-a10b-d647a84ac9be ro
ProcVersionSignature: Ubuntu 5.4.0-139.156-generic 5.4.224
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-139-generic N/A
 linux-backports-modules-5.4.0-139-generic N/A
 linux-firmware 1.187.36
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: focal uec-images
Uname: Linux 5.4.0-139-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 09/14/2022
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.8.2
dmi.board.name: 0PJ7YJ
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.8.2:bd09/14/2022:svnDellInc.:pnPowerEdgeR650:pvr:rvnDellInc.:rn0PJ7YJ:rvrA01:cvnDellInc.:ct23:cvr:
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R650
dmi.product.sku: SKU=0912;ModelName=PowerEdge R650
dmi.sys.vendor: Dell Inc.

description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2008781

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Bartosz Woronicz (mastier1) wrote :

Also checked the correct ovn trace

As the packets should reach the destination if driver was behaving correctly

ubuntu@juju-7639a4-3-lxd-6:~$ network="provider1-net-external" inport="4679fbb8-3d4d-4dcd-b986-4ec4c0fb9000" ip4_src="10.99.0.88" ip4_dst="10.99.0.254" eth_src="fa:16:3e:ab:87:ad"
ubuntu@juju-7639a4-3-lxd-6:~$ sudo ovn-trace "$network" "inport == \"$inport\" && ip4.src == "$ip4_src" && ip4.dst == "$ip4_dst" && eth.src == "$eth_src" && ip.ttl == 64 && icmp4.type == 8"
# icmp,reg14=0x3,vlan_tci=0x0000,dl_src=fa:16:3e:ab:87:ad,dl_dst=00:00:00:00:00:00,nw_src=10.99.0.88,nw_dst=10.99.0.254,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0

ingress(dp="provider1-net-external", inport="4679fb")
---------------------------------------------------
 0. ls_in_port_sec_l2 (northd.c:5516): inport == "4679fb", priority 50, uuid 03e0a90b
 next;
 6. ls_in_pre_lb (northd.c:5663): ip && inport == "4679fb", priority 110, uuid f5e890b2
 next;
24. ls_in_l2_lkup (northd.c:7577): 1, priority 0, uuid 3aba4e5b
 outport = get_fdb(eth.dst);
 next;
25. ls_in_l2_unknown (northd.c:7581): outport == "none", priority 50, uuid 0bf357af
 outport = "_MC_unknown";
 output;

multicast(dp="provider1-net-external", mcgroup="_MC_unknown")
-----------------------------------------------------------

 egress(dp="provider1-net-external", inport="4679fb", outport="provnet-8d6ece")
 ----------------------------------------------------------------------------
      0. ls_out_pre_lb (northd.c:5666): ip && outport == "provnet-8d6ece", priority 110, uuid ddd2f5b5
         next;
      9. ls_out_port_sec_l2 (northd.c:5613): outport == "provnet-8d6ece", priority 50, uuid 966e0b90
         output;
         /* output to "provnet-8d6ece", type "localnet" */

Revision history for this message
Bartosz Woronicz (mastier1) wrote : AudioDevicesInUse.txt

apport information

tags: added: apport-collected focal uec-images
description: updated
Revision history for this message
Bartosz Woronicz (mastier1) wrote : CRDA.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : Lspci.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : Lspci-vt.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : Lsusb-v.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : ProcEnviron.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : ProcModules.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : UdevDb.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : WifiSyslog.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote : acpidump.txt

apport information

Revision history for this message
Bartosz Woronicz (mastier1) wrote :
Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.