instance tap port has dead vlan when parent port for trunk

Bug #2069543 reported by Tobias Urdin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
In Progress
High
Bence Romsics
os-vif
In Progress
Undecided
Bence Romsics

Bug Description

zed release, version 21.2.1

i have not tried to reproduce on a newer version

when creating a trunk the tap interface for the instance is still in dead vlan

    Bridge tbr-d0aed4fd-1
        datapath_type: system
        Port tbr-d0aed4fd-1
            Interface tbr-d0aed4fd-1
                type: internal
        Port tap2f072c63-b6
            tag: 4095
            Interface tap2f072c63-b6
        Port tpt-2f072c63-b6
            tag: 0
            Interface tpt-2f072c63-b6
                type: patch
                options: {peer=tpi-2f072c63-b6}

if I issue set tag=[] to force the tap port into being a trunk with:
ovs-vsctl set port tap2f072c63-b6 tag=[]

traffic starts to flow which makes me think the port binding for the tap is not done correctly

logs: https://paste.opendev.org/show/bstDgfi43ManfW1RijJo/

trunk ID: d0aed4fd-193d-4164-ba48-7d1d137787e1
parent port ID: 2f072c63-b6ce-45d3-96b6-27b4f46d397b

Tags: trunk
Revision history for this message
Tobias Urdin (tobias-urdin) wrote (last edit ):
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

it looks like port binding never touches tap2f072c63-b6, if i toggle for example port security on the port 2f072c63-b6ce-45d3-96b6-27b4f46d397b the port binding code processes that on the tpi-2f072c63-b6 interface
https://paste.opendev.org/show/bbqcN14JEm7kckh0mWUd/

Revision history for this message
Tobias Urdin (tobias-urdin) wrote (last edit ):

we use this patch to workaround the issue (formatting on below patch is broken when posting as LP comment)

diff --git a/neutron/services/trunk/drivers/openvswitch/agent/trunk_manager.py b/neutron/services/trunk/drivers/openvswitch/agent/trunk_manager.py
index bb9559b739..fbd0411588 100644
--- a/neutron/services/trunk/drivers/openvswitch/agent/trunk_manager.py
+++ b/neutron/services/trunk/drivers/openvswitch/agent/trunk_manager.py
@@ -127,6 +127,10 @@ class TrunkParentPort(object):
             txn.add(ovsdb.db_set('Port', self.patch_port_trunk_name,
                                  ('vlan_mode', 'access'),
                                  ('tag', tag)))
+ # TODO(tobias.urdin): Hack, fix so that the instance tap port
+ # is a trunk and not on dead vlan
+ tap_port_name = ("tap%s" % self.port_id[:11])
+ txn.add(ovsdb.db_set('Port', tap_port_name, ('tag', [])))

tags: added: trunk
Changed in neutron:
importance: Undecided → Medium
status: New → Triaged
Changed in neutron:
assignee: nobody → Bence Romsics (bence-romsics)
Revision history for this message
Bence Romsics (bence-romsics) wrote :
Download full text (4.0 KiB)

I tried reproducing this on the tip of unmaintained/zed (c32eb5669a) and on current master, but I could not.

Is this always happening for you? If not, how frequent is it? Do you see any factors influencing its frequency? Do you have steps to reproduce it?

On a system where the trunk ends up working properly I get the same logs as you:

rubasov@devstack1:/opt/stack/neutron$ sudo journalctl -o short-iso-precise -u devstack@q-agt -S -1h | egrep tap3bb2174c-e9 | head -10
2024-06-20T08:14:50.593451+0000 devstack1 neutron-openvswitch-agent[1018]: DEBUG neutron.agent.common.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["4834efe7-27fc-4f8a-8217-8b925414dc5a","insert","tap3bb2174c-e9",["set",[]],["map",[["attached-mac","fa:16:3e:6f:92:ef"],["iface-id","3bb2174c-e959-4f29-a165-6a69ea731da7"],["iface-status","active"],["vm-uuid","c2584e6e-f6e7-4e18-b106-39f7c2533e30"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=1018) _read_stdout /opt/stack/neutron/neutron/agent/common/async_process.py:268}}
2024-06-20T08:14:50.596984+0000 devstack1 neutron-openvswitch-agent[1018]: DEBUG neutron.agent.common.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["4834efe7-27fc-4f8a-8217-8b925414dc5a","old",null,["set",[]],null],["","new","tap3bb2174c-e9",-1,["map",[["attached-mac","fa:16:3e:6f:92:ef"],["iface-id","3bb2174c-e959-4f29-a165-6a69ea731da7"],["iface-status","active"],["vm-uuid","c2584e6e-f6e7-4e18-b106-39f7c2533e30"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=1018) _read_stdout /opt/stack/neutron/neutron/agent/common/async_process.py:268}}
2024-06-20T08:14:50.941812+0000 devstack1 neutron-openvswitch-agent[1018]: DEBUG neutron.agent.common.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["4834efe7-27fc-4f8a-8217-8b925414dc5a","old",null,-1,null],["","new","tap3bb2174c-e9",1,["map",[["attached-mac","fa:16:3e:6f:92:ef"],["iface-id","3bb2174c-e959-4f29-a165-6a69ea731da7"],["iface-status","active"],["vm-uuid","c2584e6e-f6e7-4e18-b106-39f7c2533e30"]]]]],"headings":["row","action","name","ofport","external_ids"]} {{(pid=1018) _read_stdout /opt/stack/neutron/neutron/agent/common/async_process.py:268}}
2024-06-20T08:14:51.112215+0000 devstack1 neutron-openvswitch-agent[1018]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): DbSetCommand(table=Interface, record=tap3bb2174c-e9, col_values=(('external_ids', {'attached-mac': 'fa:16:3e:6f:92:ef', 'iface-id': '3bb2174c-e959-4f29-a165-6a69ea731da7', 'iface-status': 'active', 'vm-uuid': 'c2584e6e-f6e7-4e18-b106-39f7c2533e30', 'bridge_name': 'tbr-141d065f-0', 'trunk_id': '141d065f-0e7c-4ad1-86c0-47c6a0329298', 'subport_ids': '[]'}),)) {{(pid=1018) do_commit /usr/local/lib/python3.8/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:89}}
2024-06-20T08:14:51.121507+0000 devstack1 neutron-openvswitch-agent[1018]: DEBUG neutron.agent.common.async_process [-] Output received...

Read more...

Changed in neutron:
status: Triaged → Incomplete
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Unfortunately it's a 100% reproduce on that issue when spawning an instance using the trunk, that the tap interface is wrong.

I also saw that my hack isn't working 100% of the time either because it races with the tap existing.

We use openvswithch firewall driver and the only special config opt we set should be below in nova.conf and neutron.conf
[os_vif_ovs]
isolate_vif=True

for nova.conf we also set
[workarounds]
wait_for_vif_plugged_event_during_hard_reboot=normal

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

I guess isolate_vif=True is the culprit here https://review.opendev.org/c/openstack/os-vif/+/612534 as the trunk code never fixes/touches the tap port.

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Confirmed removing this config option fixes the tap

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Thanks Tobias for the config info!

This way I could reproduce this bug on current master:

devstack e2aeab1b
neutron cdb644574a

$ egrep ISOLATE_VIF local.conf
OS_VIF_OVS_ISOLATE_VIF=True

$ ack isolate_vif /etc/{nova,neutron}
/etc/nova/nova.conf
140:isolate_vif = True

/etc/nova/nova-cpu.conf
139:isolate_vif = True

/etc/neutron/neutron.conf
2635:isolate_vif = True

Out of which I believe only nova-cpu.conf matters. Setting the same in nova.conf and neutron.conf is ignored, but does not hurt either.

openstack network create net0
openstack subnet create --network net0 --subnet-range 10.0.100.0/24 subnet0
openstack port create --network net0 --fixed-ip ip-address=10.0.100.10 port0
openstack network trunk create --parent-port port0 trunk0
openstack server create --flavor cirros256 --image cirros-0.6.2-x86_64-disk --nic port-id=port0 vm0 --wait

$ sudo ovs-vsctl find Port name=tap$( openstack port list --server vm0 -f value -c ID | cut -b 1-11 ) | egrep '(tag|trunks|vlan_mode)'
tag : 4095
trunks : []
vlan_mode : []

My first guess for a root cause is that in neutron we had these changes:

$ git log --oneline --no-merges --grep='dead vlan' -2
0ddca28454 Make sure "dead vlan" ports cannot transmit packets
7aae31c9f9 Make the dead vlan actually dead

However we missed changing os_vif's isolate_vif related behavior accordingly.

I hope to propose a fix soon.

Changed in neutron:
status: Incomplete → Triaged
importance: Medium → High
Revision history for this message
Bence Romsics (bence-romsics) wrote :

I think I understand the root cause now, which is twofold:

* The first part is what I already mentioned in comment #8. We have changed how we make a port dead in neutron, but forgot to do the same in os-vif. Instead of tag=4095 the port should have all these set:

tag : 4095
trunks : [4095]
vlan_mode : trunk

* The second part is, that for trunk ports we have the following datapath wiring:

|I|tpi--tpt|T|
|N|spi--spt|B|tap
|T|spi--spt|R|

os-vif puts the tap in the dead vlan, however ovs-agent assigns the vlan overriding the dead vlan on the tpi/spi ports. os-vif does not and should not know about anything but the tap and the tbr. ovs-agent wants to handle the tpi/spi ports as any other non-trunk vif. The non-trunk-specific code inside ovs-agent will never touch the tap. Something similar to the workaround in comment #3 could be the proper solution to this part.

With isolate_vif=False we did not notice that we do not remove the tap from the dead vlan, because we never put it in the dead vlan in the first place.

Hope to propose a fix tomorrow.

Changed in os-vif:
status: New → Confirmed
assignee: nobody → Bence Romsics (bence-romsics)
Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Thanks for analysing the root-cause, interesting to follow! It's working great for us just setting isolate_vif=False for now as a workaround.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/923035

Changed in neutron:
status: Triaged → In Progress
Changed in os-vif:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-vif (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/os-vif/+/923036

Revision history for this message
Tobias Urdin (tobias-urdin) wrote :

Thanks for looking into it, your patch in trunkmanager looks very similar to the workaround we used. We had issues with race conditions occuring when using that patch.

Getting error in OVS agent:
2024-06-20 15:36:29.221 1697523 ERROR ovsdbapp.backend.ovs_idl.transaction [-] txn n=1 command(idx=5): DbSetCommand(table=Port, record=tape4c861c7-da, col_values=(('tag', []),)) aborted due to error: Cannot find Port with name=tape4c861c7-da: ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Port with name=tape4c861c7-da

which made the instance creation to go into error because neutron failed the binding

so i think there is a race condition at play here were we cannot assume the tap interface exists already, i don't think we can check "if tap exists" either because the trunkmanager code path would not be called again?

Revision history for this message
sean mooney (sean-k-mooney) wrote :

my intiall read on this is this is a neutorn bug not an os-vif one and I'm reluctant to change the code in os-vif

Changed in os-vif:
status: In Progress → Opinion
Changed in os-vif:
status: Opinion → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Bence Romsics <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/923035

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.