A race condition may occur when concurrent agent scheduling happens

Bug #1780357 reported by Kailun Qin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Medium
Kailun Qin

Bug Description

Agent scheduling operations for DHCP servers and routers can be initiated from many sources such as API, RPC workers etc. Neutron-server may use multiple child processes to handle the requests simultaneously.

It brings about the possibility that a race condition between the scheduling under decision and the scheduling already in process might occur, when agent scheduling operations happen concurrently. This may lead to resources being scheduled multiple times if two operations occur at the exact same time.

Kailun Qin (kailun.qin)
Changed in neutron:
assignee: nobody → Kailun Qin (kailun.qin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/580543

Changed in neutron:
status: New → In Progress
zhaobo (zhaobo6)
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Kailun Qin (kailun.qin) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/580543
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Kailun Qin (kailun.qin)
Changed in neutron:
assignee: nobody → Kailun Qin (kailun.qin)
Kailun Qin (kailun.qin)
Changed in neutron:
status: New → In Progress
Kailun Qin (kailun.qin)
tags: removed: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Kailun Qin (<email address hidden>) on branch: master
Review: https://review.openstack.org/580543
Reason: Brian/garyk/Miguel,
Thanks for all your comments and guidance on the resolution. Personally I learnt quite a lot of things from this patch.

After checking with the production environment and doing further investigation, we found that:

1) For the L3 part, the issue reported can be mostly addressed by https://review.openstack.org/364278 and it is enough to cover our current scenarios.
2) For the DHCP agent scheduling part, we do not face further negative impacts when multiple DHCP server instances scheduled to different agents.
3) The fix we were trying to propose may cause a significant bottle-neck for large deployments when there was a large number of agents requesting networks to be scheduled. This was observed during full system recovery scenarios when the a large number of compute hosts were rebooting simultaneously. Thus it is definitely NOT a good idea to have this in neutron, as most of you've already pointed out.

Considering the cost benefit of fixing this so we support the case indicated by garyk and based on the findings/tests cited above, I'd like to abandon this patch accordingly.

Let me know if any further concern. Thanks again.

Kailun Qin (kailun.qin)
Changed in neutron:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.