Event execution failures for back to back leases

Bug #1785841 reported by Pierre Riteau
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Blazar
Fix Released
High
Pierre Riteau

Bug Description

If two leases have at least one compute host in common, and the second lease starts when the first lease ends, there is the possibility of a race. The Blazar manager can first run the start_lease event of the second lease. This would fail since the end_lease event of the first lease wouldn't have been run yet: the compute host(s) in common would still be in aggregate(s) associated with the first lease, instead of being in the freepool.

Pierre Riteau (priteau)
Changed in blazar:
assignee: nobody → Pierre Riteau (priteau)
importance: Undecided → High
Revision history for this message
Masahito Muroi (muroi-masahito) wrote :

I imagine one of the purposes of cleaning time BP is resolving this issue. The first usecase of the new feature comes from Ironic usecase, but basically it fixes this issue, too.

https://blueprints.launchpad.net/blazar/+spec/cleaning-time-allowance

Revision history for this message
Masahito Muroi (muroi-masahito) wrote :

Oops, I mean we've already fixed the bug and don't need to care it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to blazar (master)

Fix proposed to branch: master
Review: https://review.openstack.org/589899

Changed in blazar:
status: New → In Progress
Revision history for this message
Pierre Riteau (priteau) wrote :

Cleaning time would help, but it's not enabled by default. Please check my patch instead: it fixes other issues, such as running before_end_lease after start_lease has completed.

Revision history for this message
Pierre Riteau (priteau) wrote :

I will push an updated patch.

Pierre Riteau (priteau)
Changed in blazar:
milestone: none → stein-1
Pierre Riteau (priteau)
Changed in blazar:
milestone: stein-1 → stein-2
Pierre Riteau (priteau)
Changed in blazar:
milestone: stein-2 → stein-3
Pierre Riteau (priteau)
Changed in blazar:
milestone: stein-3 → train-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to blazar (master)

Reviewed: https://review.opendev.org/c/openstack/blazar/+/589899
Committed: https://opendev.org/openstack/blazar/commit/c92edb8a177de51862ad2a4f9cbac2c50d31ef84
Submitter: "Zuul (22348)"
Branch: master

commit c92edb8a177de51862ad2a4f9cbac2c50d31ef84
Author: Pierre Riteau <email address hidden>
Date: Wed Aug 8 12:46:28 2018 +0200

    Prevent conflicting events from running concurrently

    If two leases have compute hosts in common, and the second lease starts
    exactly when the first lease ends, there is the possibility of a race.
    The Blazar manager can first run the start_lease event of the second
    lease. This event would fail since the end_lease event of the first
    lease would still be UNDONE, and the compute hosts in common would still
    be in the aggregate associated with the first lease, instead of being in
    the freepool.

    This patch changes event execution code so that events are executed
    concurrently if possible, with the following constraints:

    - events are executed strictly in order, i.e. events are started only
      after all previous events have completed
    - when events are at the same time, we first execute before_end_lease
      events (unless there is a start_lease at the same time), then
      end_lease events, followed by start_lease events, ensuring the bug
      described above does not happen. Finally, we run any before_end_lease
      which had a corresponding start_lease event at the same time.

    It also has the side effect of providing better stack traces for event
    execution failures, since we call wait() on all GreenThread objects.

    Co-Authored-By: Jason Anderson <email address hidden>
    Change-Id: Ie2339db18e8baee379fbea082f1238ec44fca6b1
    Closes-Bug: #1785841

Changed in blazar:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to blazar (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/blazar/+/831509

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/blazar 9.0.0.0rc1

This issue was fixed in the openstack/blazar 9.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to blazar (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/blazar/+/831509
Committed: https://opendev.org/openstack/blazar/commit/c3b851937ebf110184ef46d7bc2ad42e163f92b1
Submitter: "Zuul (22348)"
Branch: stable/xena

commit c3b851937ebf110184ef46d7bc2ad42e163f92b1
Author: Pierre Riteau <email address hidden>
Date: Wed Aug 8 12:46:28 2018 +0200

    Prevent conflicting events from running concurrently

    If two leases have compute hosts in common, and the second lease starts
    exactly when the first lease ends, there is the possibility of a race.
    The Blazar manager can first run the start_lease event of the second
    lease. This event would fail since the end_lease event of the first
    lease would still be UNDONE, and the compute hosts in common would still
    be in the aggregate associated with the first lease, instead of being in
    the freepool.

    This patch changes event execution code so that events are executed
    concurrently if possible, with the following constraints:

    - events are executed strictly in order, i.e. events are started only
      after all previous events have completed
    - when events are at the same time, we first execute before_end_lease
      events (unless there is a start_lease at the same time), then
      end_lease events, followed by start_lease events, ensuring the bug
      described above does not happen. Finally, we run any before_end_lease
      which had a corresponding start_lease event at the same time.

    It also has the side effect of providing better stack traces for event
    execution failures, since we call wait() on all GreenThread objects.

    Co-Authored-By: Jason Anderson <email address hidden>
    Change-Id: Ie2339db18e8baee379fbea082f1238ec44fca6b1
    Closes-Bug: #1785841
    (cherry picked from commit c92edb8a177de51862ad2a4f9cbac2c50d31ef84)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/blazar 8.0.1

This issue was fixed in the openstack/blazar 8.0.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.