Ironic builds fail when landing on a cleaning node, it doesn't try to reschedule

Bug #1974070 reported by John Garbutt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
John Garbutt

Bug Description

In a happy world, placement reserved gets updated when a node is not availabe any more, so the scheduler doesn't pick that one, everyone is happy.

Howerver, as is fairly well known, it takes a while for Nova to notice if a node has been marked as in maintenance or if it has started cleaning due to the instance now having been deleted, and you can still reach a node in a bad state.

This actually fails hard when setting the instance uuid, as expected here:
https://github.com/openstack/nova/blob/4939318649650b60dd07d161b80909e70d0e093e/nova/virt/ironic/driver.py#L378

You get a conflict errors, as the ironic node is in a transitioning state (i.e. its not actually available any more).

When people are busy rebuilding large numbers of nodes, they tend to hit this problem, even when only building when you know there available nodes, you sometimes pick the ones you just deleted.

In an idea world this would trigger a re-schedule, a bit like when you hit errors in the resource tracker such as ComputeResourcesUnavailable

Changed in nova:
status: New → In Progress
assignee: nobody → John Garbutt (johngarbutt)
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/842478

tags: added: ironic
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/864773

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/864773
Committed: https://opendev.org/openstack/nova/commit/3c022e968375c1b2eadf3c2dd7190b9434c6d4c1
Submitter: "Zuul (22348)"
Branch: master

commit 3c022e968375c1b2eadf3c2dd7190b9434c6d4c1
Author: John Garbutt <email address hidden>
Date: Wed Nov 16 17:12:40 2022 +0000

    Ironic nodes with instance reserved in placement

    Currently, when you delete an ironic instance, we trigger
    and undeploy in ironic and we release our allocation in placement.
    We do this well before the ironic node is actually available.

    We have attempted to fix this my marking unavailable nodes
    as reserved in placement. This works great until you try
    and re-image lots of nodes.

    It turns out, ironic nodes that are waiting for their automatic
    clean to finish, are returned as a valid allocation candidates
    for quite some time. Eventually we mark then as reserved.

    This patch takes a strange approach, if we mark all nodes as
    reserved as soon as the instance lands, we close the race.
    That is, when the allocation is removed the node is still
    unavailable until the next update of placement is done and
    notices that the node has become available. That may or may
    not have been after automatic cleaning. The trade off is
    that when you don't have automatic cleaning, we wait a bit
    longer to notice the node is available again.

    Note, this is also useful when a broken Ironic node is
    marked as in-maintainance while it is in-use by a nova
    instance. In a similar way, we mark the Nova as reserved
    immmeidately, rather than first waiting for the instance to be
    deleted before reserving the resources in Placement.

    Closes-Bug: #1974070
    Change-Id: Iab92124b5776a799c7f90d07281d28fcf191c8fe

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/nova/+/867642

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/842478
Committed: https://opendev.org/openstack/nova/commit/8a476061c5e034016668cd9e5a20c4430ef6b68d
Submitter: "Zuul (22348)"
Branch: master

commit 8a476061c5e034016668cd9e5a20c4430ef6b68d
Author: John Garbutt <email address hidden>
Date: Wed May 18 19:06:36 2022 +0100

    Ironic: retry when node not available

    After a baremetal instance is deleted, and its allocation is removed
    in placement, the ironic node might start cleaning. Eventually nova
    will notice and update the inventory to be reserved.
    During this window, a new instance may have already picked this
    ironic node.

    When that race happens today the build fails with an error:
    "Failed to reserve node ..."

    This change tries to ensure the remaining alternative hosts are
    attempted before aborting the build.
    Clearly the race is still there, but this makes it less painful.

    Related-Bug: #1974070
    Change-Id: Ie5cdc17219c86927ab3769605808cb9d9fa9fa4d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/867912

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/867338

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/xena)

Change abandoned by "Ruby Loo <email address hidden>" on branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/867338
Reason: not sure why, but my manual cherry pick is not quite the same as doing it via the UI...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/867913

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/867914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/zed)

Related fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/nova/+/867924

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/yoga)

Related fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/868010

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/868011

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/868012

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/nova/+/867642
Committed: https://opendev.org/openstack/nova/commit/c9de185ea1ac1e8d4435c5863b2ad7cefdb28c76
Submitter: "Zuul (22348)"
Branch: stable/zed

commit c9de185ea1ac1e8d4435c5863b2ad7cefdb28c76
Author: John Garbutt <email address hidden>
Date: Wed Nov 16 17:12:40 2022 +0000

    Ironic nodes with instance reserved in placement

    Currently, when you delete an ironic instance, we trigger
    and undeploy in ironic and we release our allocation in placement.
    We do this well before the ironic node is actually available.

    We have attempted to fix this my marking unavailable nodes
    as reserved in placement. This works great until you try
    and re-image lots of nodes.

    It turns out, ironic nodes that are waiting for their automatic
    clean to finish, are returned as a valid allocation candidates
    for quite some time. Eventually we mark then as reserved.

    This patch takes a strange approach, if we mark all nodes as
    reserved as soon as the instance lands, we close the race.
    That is, when the allocation is removed the node is still
    unavailable until the next update of placement is done and
    notices that the node has become available. That may or may
    not have been after automatic cleaning. The trade off is
    that when you don't have automatic cleaning, we wait a bit
    longer to notice the node is available again.

    Note, this is also useful when a broken Ironic node is
    marked as in-maintainance while it is in-use by a nova
    instance. In a similar way, we mark the Nova as reserved
    immmeidately, rather than first waiting for the instance to be
    deleted before reserving the resources in Placement.

    Closes-Bug: #1974070
    Change-Id: Iab92124b5776a799c7f90d07281d28fcf191c8fe
    (cherry picked from commit 3c022e968375c1b2eadf3c2dd7190b9434c6d4c1)

tags: added: in-stable-zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.1.0

This issue was fixed in the openstack/nova 26.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 27.0.0.0rc1

This issue was fixed in the openstack/nova 27.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/zed)

Reviewed: https://review.opendev.org/c/openstack/nova/+/867924
Committed: https://opendev.org/openstack/nova/commit/d71e9f6ec4933f9430db55537a36678b16ce895a
Submitter: "Zuul (22348)"
Branch: stable/zed

commit d71e9f6ec4933f9430db55537a36678b16ce895a
Author: John Garbutt <email address hidden>
Date: Wed May 18 19:06:36 2022 +0100

    Ironic: retry when node not available

    After a baremetal instance is deleted, and its allocation is removed
    in placement, the ironic node might start cleaning. Eventually nova
    will notice and update the inventory to be reserved.
    During this window, a new instance may have already picked this
    ironic node.

    When that race happens today the build fails with an error:
    "Failed to reserve node ..."

    This change tries to ensure the remaining alternative hosts are
    attempted before aborting the build.
    Clearly the race is still there, but this makes it less painful.

    Related-Bug: #1974070
    Change-Id: Ie5cdc17219c86927ab3769605808cb9d9fa9fa4d
    (cherry picked from commit 8a476061c5e034016668cd9e5a20c4430ef6b68d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/nova/+/868010
Committed: https://opendev.org/openstack/nova/commit/b881dd25b4abb3c54934d8ebbccb2ac602c83177
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit b881dd25b4abb3c54934d8ebbccb2ac602c83177
Author: John Garbutt <email address hidden>
Date: Wed May 18 19:06:36 2022 +0100

    Ironic: retry when node not available

    After a baremetal instance is deleted, and its allocation is removed
    in placement, the ironic node might start cleaning. Eventually nova
    will notice and update the inventory to be reserved.
    During this window, a new instance may have already picked this
    ironic node.

    When that race happens today the build fails with an error:
    "Failed to reserve node ..."

    This change tries to ensure the remaining alternative hosts are
    attempted before aborting the build.
    Clearly the race is still there, but this makes it less painful.

    Related-Bug: #1974070
    Change-Id: Ie5cdc17219c86927ab3769605808cb9d9fa9fa4d
    (cherry picked from commit 8a476061c5e034016668cd9e5a20c4430ef6b68d)
    (cherry picked from commit d71e9f6ec4933f9430db55537a36678b16ce895a)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/yoga)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/867912
Reason: stable/yoga branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/yoga if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/wallaby)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/868012
Reason: stable/wallaby branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/wallaby if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/867914
Reason: stable/wallaby branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/wallaby if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/xena)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/868011
Reason: stable/xena branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/xena if you want to further work on this patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/867913
Reason: stable/xena branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/xena if you want to further work on this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.