Race condition between destroy instance and ovs_neutron_agent

Bug #1390620 reported by Paul Ward
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

There's a race condition between the time compute deletes a neutron port during destroy instance processing and the time the ovs_neutron_agent rpc_loop notices the ovs port removed and tries to update the associated neutron port to set the state to DOWN.

In our scenario, controller node is separate from the host, so these calls come over rest or rpc.

It appears that normally the ovs_neutron_agent wins and the rpc call for the update of the neutron port happens before compute does the rest api delete call. However, once in a while, compute's delete gets in first and deletes the port before ovs agent tries to update. In this case, the update fails, the failure is reported back via rpc, and the ovs agent then does a full resync on the next iteration in rpc_loop.

In a large scale environment, this is a problem because that resync can take a very long time due to a very large number of ports to reprocess. And while that single iteration is occurring (I have seen it take 10 minutes), new deploys start failing because the vif plug event timeout happens (since the agent will not process the port created for the plug until next iteration, which could be 10 minutes away, at which point the deploy has failed and cleaned up the port).

I think the fix for this, which I will create a patch for, is to not have treat_devices_removed part of the decision to resync. If we're removing a port, why do we care that it failed to find the neutron port? Not sure if there are other considerations to think about....

Tags: ovs
Revision history for this message
Sudipta Biswas (sbiswas7) wrote :

One intent that i can think of is w.r.t invalidating the cache. But doing a complete re-sync for such conditions is undesirable.
Even if we referesh the cache - there should probably be way to diff the sets - rather than re-create all the flows all over again.

Revision history for this message
yong sheng gong (gongysh) wrote :

I think neutron server should ignore the udpate port's status failure if the port does not exist at all.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Yong, I think that the problem is on agent side.

The question is why is full sync is taking ~10 minutes?
Can you provide details about cloud configuration with respect to neutron objects?
Is there many sec. groups with lots of rules?

Changed in neutron:
importance: Undecided → Medium
tags: added: ovs
Changed in neutron:
status: New → Incomplete
Revision history for this message
Paul Ward (wpward) wrote :

No security groups. I think the 10 minutes is simply because of the sheer number of OVS ports to recreate.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.