Ironic

Nodes stuck on power state transitions

Bug #1588901 reported by Lucas Alvares Gomes on 2016-06-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Fix Released	High	Dmitry Tantsur

Bug Description

Reported internally at https://bugzilla.redhat.com/show_bug.cgi?id=1342581 (see comments).

If the conductor managing a node dies mid power state transition, that node will have the "reservation" and "target_power_state" fields set indefinitely because Ironic does not have a mechanism (periodic task) to time out nodes based on power state transitions.

Workaround(s)
=============

* While not ideal, operators can (re)start a conductor service with the same hostname that was managing that node and it will clean up the locks.

* Changing the database manually

Proposed solution
=================

Just like we do for certain provision states (*WAIT), we should have a periodic task that would check for a timeout on power state.

In order to implement that we would need:

1. A "power_updated_at" field in the nodes (we do have a "provision_updated_at") which will have the time of the last power state change.

2. A periodic task that will query nodes that are reserved by a conductor which is not currently online and have the target_power_state field set, based on the value of the "power_updated_at" field we will know whether it's timed out or not.

The number of seconds/minutes that we should wait for a timeout should be configurable as a config option.

Warn: a possible problem here, how does one conductor cleans the reservation from another conductor? We may need *something* here.

Outputs showing the error
=========================

http://paste.openstack.org/show/507728/

See original description

Tags:

Lucas Alvares Gomes (lucasagomes) on 2016-06-03

Changed in ironic:
assignee:	nobody → Lucas Alvares Gomes (lucasagomes)
description:	updated
description:	updated

Revision history for this message

Lucas Alvares Gomes (lucasagomes) wrote on 2016-06-03:

Marking as high because the workarounds are not ideal, plus, we should get better on avoiding deadlocks such as this.

description:	updated
Changed in ironic:
importance:	Undecided → High
status:	New → Confirmed

Lucas Alvares Gomes (lucasagomes) on 2016-06-07

description:	updated
description:	updated

Revision history for this message

Lucas Alvares Gomes (lucasagomes) wrote on 2016-06-09:

A note here, we shouldn't call it "power_updated_at" because that should be only filled when the power transition is done and not when it started. The "provision_updated_at" works for provision states because we have intermediate states such as DEPLOYING or DEPLOYWAIT and for power we don't have anything like POWERING_ON or REBOOTING.

I would suggest the new field to be called "power_transition_started_at".

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-09: Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/327642

Changed in ironic:
status:	Confirmed → In Progress

Revision history for this message

Matthias Runge (mrunge) wrote on 2017-06-15:

Hit by this via oooq -ovb

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-07-12: Change abandoned on ironic (master)

Change abandoned by Lucas Alvares Gomes (<email address hidden>) on branch: master
Review: https://review.openstack.org/327642

Ruby Loo (rloo) on 2017-10-18

Changed in ironic:
status:	In Progress → Triaged

Dmitry Tantsur (divius) on 2018-02-19

Changed in ironic:
assignee:	Lucas Alvares Gomes (lucasagomes) → Dmitry Tantsur (divius)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-20: Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/546273

Changed in ironic:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-02-21: Related fix proposed to ironic (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/546656

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-16: Fix merged to ironic (master)

Reviewed: https://review.openstack.org/546273
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=5694b98fc81032933b112528bf08cb6688fa7c1a
Submitter: Zuul
Branch: master

commit 5694b98fc81032933b112528bf08cb6688fa7c1a
Author: Dmitry Tantsur <email address hidden>
Date: Tue Feb 20 19:47:52 2018 +0100

Rework logic handling reserved orphaned nodes in the conductor

    If a conductor dies while holding a reservation, the node can get
    stuck in its current state. Currently the conductor that takes
    over the node only cleans it up if it's in the DEPLOYING state.

This change applies the same logic for all nodes:

1. Reservation is cleared by the conductor that took over the node
no matter what provision state.

2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with
maintenance on.

3. Target power state is cleared as well.

The reservation is cleared even for nodes in maintenance mode,
otherwise it's impossible to move them out of maintenance.

Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5
Closes-Bug: #1588901

Changed in ironic:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-19: Fix proposed to ironic (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/554202

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-21: Related fix merged to ironic (master)

#10

Reviewed: https://review.openstack.org/546656
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=b93e5b05c43bd1ce23c7ffa85ee0ef1e8aa582ea
Submitter: Zuul
Branch: master

commit b93e5b05c43bd1ce23c7ffa85ee0ef1e8aa582ea
Author: Dmitry Tantsur <email address hidden>
Date: Wed Feb 21 15:58:05 2018 +0100

Prevent overwriting of last_error on cleaning failures

    This changes moves the call to tear_down_cleaning to before we set
    the last_error and maintenance_reason fields. Thus we avoid
    overwriting last_error by e.g. power actions.

Related-Bug: #1588901
Change-Id: Ia448431a2922ea6f7adc27065dbcab1ba8358daa

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-27: Related fix proposed to ironic (stable/queens)

#11

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/556836

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-30: Related fix merged to ironic (stable/queens)

#12

Reviewed: https://review.openstack.org/556836
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=26f1e784daa5210f0ca13b2aa49c8462a25bdf32
Submitter: Zuul
Branch: stable/queens

commit 26f1e784daa5210f0ca13b2aa49c8462a25bdf32
Author: Dmitry Tantsur <email address hidden>
Date: Wed Feb 21 15:58:05 2018 +0100

Prevent overwriting of last_error on cleaning failures

    This changes moves the call to tear_down_cleaning to before we set
    the last_error and maintenance_reason fields. Thus we avoid
    overwriting last_error by e.g. power actions.

    Related-Bug: #1588901
    Change-Id: Ia448431a2922ea6f7adc27065dbcab1ba8358daa
    (cherry picked from commit b93e5b05c43bd1ce23c7ffa85ee0ef1e8aa582ea)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-03-30: Fix merged to ironic (stable/queens)

#13

Reviewed: https://review.openstack.org/554202
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=a90b999a2d8afec0476dbd82d4eb2578ece313d0
Submitter: Zuul
Branch: stable/queens

commit a90b999a2d8afec0476dbd82d4eb2578ece313d0
Author: Dmitry Tantsur <email address hidden>
Date: Tue Feb 20 19:47:52 2018 +0100

Rework logic handling reserved orphaned nodes in the conductor

This change applies the same logic for all nodes:

1. Reservation is cleared by the conductor that took over the node
no matter what provision state.

2. CLEANING is also aborted, nodes are moved to CLEAN FAIL with
maintenance on.

3. Target power state is cleared as well.

The reservation is cleared even for nodes in maintenance mode,
otherwise it's impossible to move them out of maintenance.

    Change-Id: I379c1335692046ca9423fda5ea68d2f10c065cb5
    Closes-Bug: #1588901
    (cherry picked from commit 5694b98fc81032933b112528bf08cb6688fa7c1a)