race conditions with server group scheduler policies

Bug #1423648 reported by Chris Friesen
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Low
Unassigned

Bug Description

In git commit a79ecbe Russel Bryant submitted a partial fix for a race condition when booting an instance as part of a server group with an "anti-affinity" scheduler policy.

That fix only solves part of the problem, however. There are a number of issues remaining:

1) It's possible to hit a similar race condition for server groups with the "affinity" policy. Suppose we create a new group and then create two instances simultaneously. The scheduler sees an empty group for each, assigns them to different compute nodes, and the policy is violated. We should add a check in _validate_instance_group_policy() to cover the "affinity" case.

2) It's possible to create two instances simultaneously, have them be scheduled to conflicting hosts, both of them detect the problem in _validate_instance_group_policy(), both of them get sent back for rescheduling, and both of them get assigned to conflicting hosts *again*, resulting in an error. In order to fix this I propose that instead of checking against all other instances in the group, we only check against instances that were created before the current instance.

Tags: compute
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Changed in nova:
assignee: nobody → Pawel Koniszewski (pawel-koniszewski)
Chris Friesen (cbf123)
Changed in nova:
assignee: Pawel Koniszewski (pawel-koniszewski) → Chris Friesen (cbf123)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/162746

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/164762

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/169489

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/162746
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=36a703516251c7268ebceb414ed71e4cab4794b0
Submitter: Jenkins
Branch: master

commit 36a703516251c7268ebceb414ed71e4cab4794b0
Author: Chris Friesen <email address hidden>
Date: Mon Mar 16 09:35:16 2015 -0600

    Validate server group affinity policy

    In git commit a79ecbe Russell Bryant submitted a partial fix for a race
    condition when booting an instance as part of a server group with an
    "anti-affinity" scheduler policy.

    It's possible to hit a similar race condition for server groups with
    the "affinity" policy. Suppose we create a new group and then create two
    instances simultaneously. The scheduler sees an empty group for each,
    assigns them to different compute nodes, and the policy is violated.

    To guard against this, we extend _validate_instance_group_policy()
    to cover the "affinity" case as well as "anti-affinity".

    Partial-Bug: #1423648
    Change-Id: Icf95390a128e2062293e1f5b7b78fe79747f5f27

Changed in nova:
assignee: Chris Friesen (cbf123) → Jay Pipes (jaypipes)
Jay Pipes (jaypipes)
Changed in nova:
assignee: Jay Pipes (jaypipes) → Chris Friesen (cbf123)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/169489
Reason: This patch has been stalled for a long time, so I am abandoning it. Please feel free to restore it when the code is ready for review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/164762
Reason: This patch has been stalled for a long time, so I am abandoning it. Please feel free to restore it when the code is ready for review.

Changed in nova:
assignee: Chris Friesen (cbf123) → nobody
status: In Progress → Confirmed
Charlotte Han (hanrong)
Changed in nova:
assignee: nobody → Charlotte Han (hanrong)
Revision history for this message
Miguel Alejandro Cantu (miguel-cantu) wrote :

Hi Charlotte,

Any updates on this change? I would be more than willing to help out with testing if need be.

-Alex

Charlotte Han (hanrong)
Changed in nova:
assignee: Charlotte Han (hanrong) → nobody
Revision history for this message
Miguel Alejandro Cantu (miguel-cantu) wrote :

I'm not too familiar with the nova codebase, but I can learn ^.^.

I'll work off of suggestions made here :
https://review.openstack.org/#/c/164762/9

If anyone could point me in the right direction to some more useful information, that would be great.

Changed in nova:
assignee: nobody → Miguel Alejandro Cantu (miguel-cantu)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Miguel Alejandro Cantu (miguel-cantu) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.