dhcp_ready_on_ports causing race with Neutron OVS agent boot

Bug #1651672 reported by Anton Aksola
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

When neutron-openvswitch-agent starts it indirectly causes all its ports to transition into BUILD state. If DHCP is enabled a DHCP agent receives a port.update.end notification and refreshes its configuration. After this a dhcp_ready_on_ports RPC call is made.

In this stage there are no provisioning blocks as we haven't created any as no new ports are actually created. However, PROVISIONING_COMPLETE event is still emitted which causes the ports to transition into ACTIVE state. If l2pop is enabled, fdb entries are sent at this stage.

The problem: with large number of ports, OVS agent is most likely still processing and allocating local vlans. This causes some (or all) of the fdb entries to be discarded as there are no local vlans. When the OVS agent reaches the point where it uses update_device_list RPC call to transition ports into ACTIVE they are already in that state and no fdb entries are emitted.

Version: observed in Newton (neutron 9.0.0)

Pre-conditions:
  - standalone network node with l3-agent in legacy mode
  - dhcp agent running on another node
  - ovsdb_interface in vsctl mode (due to performance issues with IDL)

To reproduce:
  - have a L3 node with large amount of ports (we had about 1000)
  - have a DHCP agent running on some other node
  - issue a cold boot on the L3 node (no ports in br-int, no existing flows in br-tun). start ovs agent and l3 agent at the same time
  - observe incoming fdb entries before ports are actually provisioned

Expected behaviour: dhcp agent should not cause these ports to transition into ACTIVE. fdb entries should be emitted only when OVS agent issues update_device_list call

Impact: if a network node is rebooted (due to hardware failure or some other reason), the node is left in an inconsistent state after the reboot. Random number of fdb entries are missing causing disruption to user traffic.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/413500

Changed in neutron:
assignee: nobody → Anton Aksola (aakso)
status: New → In Progress
tags: added: l3-ipam-dhcp ovs
Revision history for this message
Kevin Benton (kevinbenton) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Anton Aksola (aakso) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/413500
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.