Nodes stuck on power state transitions

Bug #1588901 reported by Lucas Alvares Gomes
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Dmitry Tantsur

Bug Description

Reported internally at https://bugzilla.redhat.com/show_bug.cgi?id=1342581 (see comments).

If the conductor managing a node dies mid power state transition, that node will have the "reservation" and "target_power_state" fields set indefinitely because Ironic does not have a mechanism (periodic task) to time out nodes based on power state transitions.

Workaround(s)
=============

* While not ideal, operators can (re)start a conductor service with the same hostname that was managing that node and it will clean up the locks.

* Changing the database manually

Proposed solution
=================

Just like we do for certain provision states (*WAIT), we should have a periodic task that would check for a timeout on power state.

In order to implement that we would need:

1. A "power_updated_at" field in the nodes (we do have a "provision_updated_at") which will have the time of the last power state change.

2. A periodic task that will query nodes that are reserved by a conductor which is not currently online and have the target_power_state field set, based on the value of the "power_updated_at" field we will know whether it's timed out or not.

The number of seconds/minutes that we should wait for a timeout should be configurable as a config option.

Warn: a possible problem here, how does one conductor cleans the reservation from another conductor? We may need *something* here.

Outputs showing the error
=========================

http://paste.openstack.org/show/507728/

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
description: updated
description: updated
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Marking as high because the workarounds are not ideal, plus, we should get better on avoiding deadlocks such as this.

description: updated
Changed in ironic:
importance: Undecided → High
status: New → Confirmed
description: updated
description: updated
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

A note here, we shouldn't call it "power_updated_at" because that should be only filled when the power transition is done and not when it started. The "provision_updated_at" works for provision states because we have intermediate states such as DEPLOYING or DEPLOYWAIT and for power we don't have anything like POWERING_ON or REBOOTING.

I would suggest the new field to be called "power_transition_started_at".

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/327642

Changed in ironic:
status: Confirmed → In Progress
Revision history for this message
Matthias Runge (mrunge) wrote :

Hit by this via oooq -ovb

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (master)

Change abandoned by Lucas Alvares Gomes (<email address hidden>) on branch: master
Review: https://review.openstack.org/327642

Ruby Loo (rloo)
Changed in ironic:
status: In Progress → Triaged
Dmitry Tantsur (divius)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Dmitry Tantsur (divius)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/546273

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/546656

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/546273
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=5694b98fc81032933b112528bf08cb6688fa7c1a
Submitter: Zuul
Branch: master

commit 5694b98fc81032933b112528bf08cb6688fa7c1a
Author: Dmitry Tantsur <email address hidden>
Date: Tue Feb 20 19:47:52 2018 +0100

    Rework logic handling reserved orphaned nodes in the conductor

    If a conductor dies while holding a reservation, the node can get
    stuck in its current state. Currently the conductor that takes
    over the node only cleans it up if it's in the DEPLOYING state.

    This change applies the same logic for all nodes:

    1. Reservation is cleared by the conductor that took over the node
       no matter what provision state.

    2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with
       maintenance on.

    3. Target power state is cleared as well.

    The reservation is cleared even for nodes in maintenance mode,
    otherwise it's impossible to move them out of maintenance.

    Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5
    Closes-Bug: #1588901

Changed in ironic:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/554202

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.openstack.org/546656
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=b93e5b05c43bd1ce23c7ffa85ee0ef1e8aa582ea
Submitter: Zuul
Branch: master

commit b93e5b05c43bd1ce23c7ffa85ee0ef1e8aa582ea
Author: Dmitry Tantsur <email address hidden>
Date: Wed Feb 21 15:58:05 2018 +0100

    Prevent overwriting of last_error on cleaning failures

    This changes moves the call to tear_down_cleaning to before we set
    the last_error and maintenance_reason fields. Thus we avoid
    overwriting last_error by e.g. power actions.

    Related-Bug: #1588901
    Change-Id: Ia448431a2922ea6f7adc27065dbcab1ba8358daa

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/556836

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (stable/queens)

Reviewed: https://review.openstack.org/556836
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=26f1e784daa5210f0ca13b2aa49c8462a25bdf32
Submitter: Zuul
Branch: stable/queens

commit 26f1e784daa5210f0ca13b2aa49c8462a25bdf32
Author: Dmitry Tantsur <email address hidden>
Date: Wed Feb 21 15:58:05 2018 +0100

    Prevent overwriting of last_error on cleaning failures

    This changes moves the call to tear_down_cleaning to before we set
    the last_error and maintenance_reason fields. Thus we avoid
    overwriting last_error by e.g. power actions.

    Related-Bug: #1588901
    Change-Id: Ia448431a2922ea6f7adc27065dbcab1ba8358daa
    (cherry picked from commit b93e5b05c43bd1ce23c7ffa85ee0ef1e8aa582ea)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (stable/queens)

Reviewed: https://review.openstack.org/554202
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=a90b999a2d8afec0476dbd82d4eb2578ece313d0
Submitter: Zuul
Branch: stable/queens

commit a90b999a2d8afec0476dbd82d4eb2578ece313d0
Author: Dmitry Tantsur <email address hidden>
Date: Tue Feb 20 19:47:52 2018 +0100

    Rework logic handling reserved orphaned nodes in the conductor

    If a conductor dies while holding a reservation, the node can get
    stuck in its current state. Currently the conductor that takes
    over the node only cleans it up if it's in the DEPLOYING state.

    This change applies the same logic for all nodes:

    1. Reservation is cleared by the conductor that took over the node
       no matter what provision state.

    2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with
       maintenance on.

    3. Target power state is cleared as well.

    The reservation is cleared even for nodes in maintenance mode,
    otherwise it's impossible to move them out of maintenance.

    Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5
    Closes-Bug: #1588901
    (cherry picked from commit 5694b98fc81032933b112528bf08cb6688fa7c1a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 10.1.2

This issue was fixed in the openstack/ironic 10.1.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/ironic 11.0.0

This issue was fixed in the openstack/ironic 11.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.