Race condition with DPDK + trunk ports when instance port is deleted then quickly recreated

Bug #1807239 reported by Nate Johnston
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Nate Johnston

Bug Description

Deployment is Queens with ML2/OVS and DPDK. Instance ports are neutron trunk ports with DPDK vhu in vhostuserclient mode. When an instance is rebooted, nova/os_vif deletes the ovs port connected to the trunk bridge and then recreates it when the host comes back online. This causes a race condition in the trunk code that handles the tbr trunk bridge. Neutron, seeing the deletion, queues a delete of the tbr bridge. Then the subsequent re-add of the trunk is ignored because the delete is still enqueued.

Here are annotated logs for an instance of the issue from the logfiles of nova-compute and neutron-openvswitch-agent:

# (1) nova-compute deletes the instance's ovs port
2018-11-30 21:03:17.576 5758 DEBUG oslo.privsep.daemon [req-0387f343-92e4-4b0c-b43b-a58d26fafef7 cf1b5aad77aa40a6933003b69f4cba43 a830497f545440efbc007a9607f4fd89 - default default] privsep: request[140038488344528]: (3, 'vif_plug_ovs.linux_net.delete_ovs_vif_port', (u'tbr-f4554113-b', u'vhu44d242e1-30'), {'timeout': 120}) loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:443

# (2) ovsdb-monitor sees the deletion of the ovs port in (1)
2018-11-30 21:03:17.587 61270 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["4d9cbba1-fe43-4703-9735-1eb61e763b29","delete","vhu44d242e1-30",1,["map",[["attached-mac","fa:16:3e:3e:4a:b4"],["bridge_name","tbr-f4554113-b"],["iface-id","44d242e1-30aa-4da6-b00c-bad9d64560af"],["iface-status","active"],["subport_ids","[\"605ebd34-0ea4-44a6-aa5b-ce80878b1847\", \"7df8776f-4e15-4c74-bc58-0eff1bffd868\", \"e7476312-fcca-4ad8-886e-2093de1669f5\"]"],["trunk_id","f4554113-b695-4e7d-b550-6927765c6679"],["vm-uuid","1f23fafb-e370-4e53-9bf1-8d5b1d8f7387"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:239

# (3) nova-compute creates a fresh ovs port for the instance
2018-11-30 21:03:18.535 5758 DEBUG oslo.privsep.daemon [req-0387f343-92e4-4b0c-b43b-a58d26fafef7 cf1b5aad77aa40a6933003b69f4cba43 a830497f545440efbc007a9607f4fd89 - default default] privsep: request[140038488344528]: (3, 'vif_plug_ovs.linux_net.create_ovs_vif_port', (u'tbr-f4554113-b', u'vhu44d242e1-30', u'44d242e1-30aa-4da6-b00c-bad9d64560af', u'fa:16:3e:3e:4a:b4', u'1f23fafb-e370-4e53-9bf1-8d5b1d8f7387', 9000), {'interface_type': 'dpdkvhostuserclient', 'vhost_server_path': u'/var/lib/vhost_sockets/vhu44d242e1-30', 'timeout': 120}) loop /usr/lib/python2.7/site-packages/oslo_privsep/daemon.py:443

# (4) ovsdb-monitor sees the creation of the ovs port in (3)
2018-11-30 21:03:19.054 61270 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["fddedd5b-73e8-4e54-ba2a-97364f27950d","old",null,null,["map",[["attached-mac","fa:16:3e:3e:4a:b4"],["iface-id","44d242e1-30aa-4da6-b00c-bad9d64560af"],["iface-status","active"],["vm-uuid","1f23fafb-e370-4e53-9bf1-8d5b1d8f7387"]]]],["","new","vhu44d242e1-30",6,["map",[["attached-mac","fa:16:3e:3e:4a:b4"],["bridge_name","tbr-f4554113-b"],["iface-id","44d242e1-30aa-4da6-b00c-bad9d64560af"],["iface-status","active"],["subport_ids","[]"],["trunk_id","f4554113-b695-4e7d-b550-6927765c6679"],["vm-uuid","1f23fafb-e370-4e53-9bf1-8d5b1d8f7387"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:239

# (5) neutron-openvswitch-agent reacts to the deletion in (2) by deleting the trunk bridge
2018-11-30 21:03:19.093 61270 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=0): DelBridgeCommand(if_exists=True, name=tbr-f4554113-b) do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84

# (6) ovsdb-monitor reports the deletion of the trunk bridge from (5)
2018-11-30 21:03:19.097 61270 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: {"data":[["8370c94d-6920-47f7-811a-51702eee4027","delete","tbr-f4554113-b",65534,["map",[]]],["1f9b6a11-32e5-44bf-a6d3-e6750e087c33","delete","tpt-44d242e1-30",2,["map",[["attached-mac","fa:16:3e:3e:4a:b4"],["iface-id","44d242e1-30aa-4da6-b00c-bad9d64560af"]]]],["9951c13a-6bfb-45ba-8499-68fa399d1e59","delete","tpi-44d242e1-30",39,["map",[["attached-mac","fa:16:3e:3e:4a:b4"],["iface-id","44d242e1-30aa-4da6-b00c-bad9d64560af"]]]],["fddedd5b-73e8-4e54-ba2a-97364f27950d","delete","vhu44d242e1-30",6,["map",[["attached-mac","fa:16:3e:3e:4a:b4"],["bridge_name","tbr-f4554113-b"],["iface-id","44d242e1-30aa-4da6-b00c-bad9d64560af"],["iface-status","active"],["subport_ids","[]"],["trunk_id","f4554113-b695-4e7d-b550-6927765c6679"],["vm-uuid","1f23fafb-e370-4e53-9bf1-8d5b1d8f7387"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:239

The creation event detected in step (4) is lost. When the race condition does not happen, the neutron-openvswitch-agent logs will include a "Processing trunk bridge" log line; this does not occur when the race condition is triggered.

Another indication of the issue is that you will get tracebacks looking for metadata on the deleted bridges:

2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 35, in execute
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command txn.add(self)
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib64/python2.7/contextlib.py", line 24, in __exit__
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command self.gen.next()
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 94, in transaction
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command self._nested_txn = None
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/api.py", line 54, in __exit__
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command self.result = self.commit()
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 62, in commit
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command raise result.ex
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command RowNotFound: Cannot find Bridge with name=tbr-f4554113-b
2018-11-30 21:00:22.444 61270 ERROR ovsdbapp.backend.ovs_idl.command
2018-11-30 21:00:22.444 61270 ERROR neutron.services.trunk.drivers.openvswitch.agent.ovsdb_handler [-] Failed to store metadata for trunk f4554113-b695-4e7d-b550-6927765c6679: Cannot find Bridge with name=tbr-f4554113-b: RowNotFound: Cannot find Bridge with name=tbr-f4554113-b

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Nate: Do you think DPDK contibutes to this error condition? I'm trying to reproduce it in a non-DPDK environment (devstack, vm with trunk port rebooted in an infinite loop). If I manage to catch the same error messages that will remove DPDK from the moving parts. But if I don't see those error message unfortunately my test then proves nothing.

Revision history for this message
Nate Johnston (nate-johnston) wrote :

I think this issue might be specific to the combination of DPDK vhostusermode and trunk ports. I say that because it is DPDK vhostusermode that causes the OVS port to be deleted and then recreated when the instance is rebooted. I think the time window for this race condition is small, 20% or less of the time in a customer environment running with this configuration. You see that the time difference in the above timeline between when nova_os_vif deletes the port and when it recreates it is less than one second; I think you'd have to have that kind of subsecond granularity to trigger this, and DPDK vhostusermode is the only scenario I can see that applying to.

Revision history for this message
Nate Johnston (nate-johnston) wrote :
Revision history for this message
Bence Romsics (bence-romsics) wrote :

I left a non-dpdk ovs environment constantly rebooting a vm overnight and this error never occured - but that of course proves very little, if anything. I don't have a DPDK environment now to reproduce this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/623275
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bd2a1bc6c3d68ddd27fc0b77ec767b11ec0fcb54
Submitter: Zuul
Branch: master

commit bd2a1bc6c3d68ddd27fc0b77ec767b11ec0fcb54
Author: Nate Johnston <email address hidden>
Date: Thu Dec 6 12:39:49 2018 -0500

    Do not delete trunk bridges if service port attached

    When a deployment has instance ports that are neutron trunk ports with
    DPDK vhu in vhostuserclient mode, when the instance reboots nova will
    delete the ovs port and then recreate when the host comes back from
    reboot. This quick transition change can trigger a race condition that
    causes the tbr trunk bridge to be deleted after the port has been
    recreated. See the bug for more details.

    This change mitigates the race condition by adding a check for active
    service ports within the trunk port deletion function.

    Change-Id: I70b9c26990e6902f8888449bfd7483c25e5bff46
    Closes-Bug: #1807239

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/625577

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/625578

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/625579

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/625577
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=25ab89f7d36d21530063c2a93af984be732c9137
Submitter: Zuul
Branch: stable/rocky

commit 25ab89f7d36d21530063c2a93af984be732c9137
Author: Nate Johnston <email address hidden>
Date: Thu Dec 6 12:39:49 2018 -0500

    Do not delete trunk bridges if service port attached

    When a deployment has instance ports that are neutron trunk ports with
    DPDK vhu in vhostuserclient mode, when the instance reboots nova will
    delete the ovs port and then recreate when the host comes back from
    reboot. This quick transition change can trigger a race condition that
    causes the tbr trunk bridge to be deleted after the port has been
    recreated. See the bug for more details.

    This change mitigates the race condition by adding a check for active
    service ports within the trunk port deletion function.

    Change-Id: I70b9c26990e6902f8888449bfd7483c25e5bff46
    Closes-Bug: #1807239
    (cherry picked from commit bd2a1bc6c3d68ddd27fc0b77ec767b11ec0fcb54)

tags: added: in-stable-rocky
tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/625578
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e459b20fb50c5791fb30de0b35b5b5099b04cf2a
Submitter: Zuul
Branch: stable/queens

commit e459b20fb50c5791fb30de0b35b5b5099b04cf2a
Author: Nate Johnston <email address hidden>
Date: Thu Dec 6 12:39:49 2018 -0500

    Do not delete trunk bridges if service port attached

    When a deployment has instance ports that are neutron trunk ports with
    DPDK vhu in vhostuserclient mode, when the instance reboots nova will
    delete the ovs port and then recreate when the host comes back from
    reboot. This quick transition change can trigger a race condition that
    causes the tbr trunk bridge to be deleted after the port has been
    recreated. See the bug for more details.

    This change mitigates the race condition by adding a check for active
    service ports within the trunk port deletion function.

    Change-Id: I70b9c26990e6902f8888449bfd7483c25e5bff46
    Closes-Bug: #1807239
    (cherry picked from commit bd2a1bc6c3d68ddd27fc0b77ec767b11ec0fcb54)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/625579
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d36cb198134249f5fc3955d3cefa90ccb87d6d49
Submitter: Zuul
Branch: stable/pike

commit d36cb198134249f5fc3955d3cefa90ccb87d6d49
Author: Nate Johnston <email address hidden>
Date: Thu Dec 6 12:39:49 2018 -0500

    Do not delete trunk bridges if service port attached

    When a deployment has instance ports that are neutron trunk ports with
    DPDK vhu in vhostuserclient mode, when the instance reboots nova will
    delete the ovs port and then recreate when the host comes back from
    reboot. This quick transition change can trigger a race condition that
    causes the tbr trunk bridge to be deleted after the port has been
    recreated. See the bug for more details.

    This change mitigates the race condition by adding a check for active
    service ports within the trunk port deletion function.

    Change-Id: I70b9c26990e6902f8888449bfd7483c25e5bff46
    Closes-Bug: #1807239
    (cherry picked from commit bd2a1bc6c3d68ddd27fc0b77ec767b11ec0fcb54)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b1

This issue was fixed in the openstack/neutron 14.0.0.0b1 development milestone.

tags: added: neutron-proactive-backport-potential
Revision history for this message
sean mooney (sean-k-mooney) wrote :

when we were discussing this down stream i was pretty convinced this applied to kernel ovs also and
had no depency on dpdk for what its worth.

tags: removed: neutron-proactive-backport-potential queens-backport-potential rocky-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.7

This issue was fixed in the openstack/neutron 11.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.3

This issue was fixed in the openstack/neutron 13.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.6

This issue was fixed in the openstack/neutron 12.0.6 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.