Node power while in "cleaning" is unmanageable, leaves no way to retry cleaning

Bug #1455825 reported by aeva black
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Lucas Alvares Gomes

Bug Description

While testing my NUC, the node got into a state where Ironic believed cleaning was in progress, but the node was actually booted into the instance (from local disk). While in this situation, I could not manage the node via Ironic -- the log of my attempts are below. The only way out was a manual reboot of the hardware, outside of Ironic. I did not see any errors in ironic-conductor.log during this whole time.

Perhaps we should allow power state changes while a node is in cleaning state?

 ironic node-list
+--------------------------------------+------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+------+---------------+-------------+--------------------+-------------+
| a8cb6624-0d9f-c882-affc-046ebb96ec01 | None | None | power on | cleaning | False |
+--------------------------------------+------+---------------+-------------+--------------------+-------------+

 ironic node-set-provision-state a8cb6624-0d9f-c882-affc-046ebb96ec01 deleted
The requested action "deleted" can not be performed on node "a8cb6624-0d9f-c882-affc-046ebb96ec01" while it is in state "cleaning". (HTTP 400)

 ironic node-set-power-state a8cb6624-0d9f-c882-affc-046ebb96ec01 off
The requested action "power off" can not be performed on node "a8cb6624-0d9f-c882-affc-046ebb96ec01" while it is in state "cleaning". (HTTP 400)

 ironic node-set-provision-state a8cb6624-0d9f-c882-affc-046ebb96ec01 manage
The requested action "manage" can not be performed on node "a8cb6624-0d9f-c882-affc-046ebb96ec01" while it is in state "cleaning". (HTTP 400)

 ironic node-set-power-state a8cb6624-0d9f-c882-affc-046ebb96ec01 reboot
The requested action "rebooting" can not be performed on node "a8cb6624-0d9f-c882-affc-046ebb96ec01" while it is in state "cleaning". (HTTP 400)

aeva black (tenbrae)
Changed in ironic:
importance: Undecided → Medium
assignee: nobody → Jim Rollenhagen (jim-rollenhagen)
importance: Medium → High
Revision history for this message
Dmitry Tantsur (divius) wrote :

Allowing changing power state in cleaning is a good way to break it :) also, IIRC we don't allow changing power state e.g. in deploy, do we?

maybe instead we should make cleaning retry itself?

Changed in ironic:
status: New → Incomplete
Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

Why am I assigned to this?

I'm thinking clean steps need a timeout that fails cleaning. However some steps may be variable, like shredding spinning rust. Hm.

Alternatively, we could have an endpoint with a really scary name that can do some of these things.

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

I believe that the solution for this is to make cleaning just like the deploy. The cleaning should be composed of CLEANWAIT and CLEANING states.

When node is in CLEANWAIT it means that the ramdisk running on the node is doing some work and it could be abort-able via Ironic's API. When the conductor is doing some work it should be at CLEANING state, which should be periodically checked to see if the conductor doing the work still up and running.

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

Lucas, that's probably the right way to do things, BUT! I think we'll need some sort of "abortable" flag on clean steps. For instance, you don't want to abort cleaning while a BIOS upgrade is happening :)

Changed in ironic:
assignee: Jim Rollenhagen (jim-rollenhagen) → nobody
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Jim, +1

I would say we need the clean_step decorator to have a parameter such as "abortable=True/False". So when the request to abort comes in we can get the current clean_step from Node.clean_step and check whether we can or not abort the operation.

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

I want to investigate what is needed... curiosity!

Changed in ironic:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/200152

Changed in ironic:
status: Incomplete → In Progress
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Jim Rollenhagen (jim-rollenhagen)
Changed in ironic:
assignee: Jim Rollenhagen (jim-rollenhagen) → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/201552

Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → John L. Villalovos (happycamp)
Changed in ironic:
assignee: John L. Villalovos (happycamp) → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/200152
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=b6ed09e297e84f6cc1ded13e9d9869d2effb3833
Submitter: Jenkins
Branch: master

commit b6ed09e297e84f6cc1ded13e9d9869d2effb3833
Author: Lucas Alvares Gomes <email address hidden>
Date: Thu Jul 9 16:29:12 2015 +0100

    Add CLEANWAIT state

    This patch adds the CLEANWAIT state. When a node is in CLEANWAIT means
    that the ramdisk is executing a clean step (async). When the node is
    in CLEANING state it means that the conductor is executing a clean step
    (sync).

    This is the first patch of a series that aim to make nodes in CLEANWAIT
    abortable. We still need a way need some way to tell if a step is
    abortable; aborting steps could have negative effects such as bricking
    things.

    Depends-On: I195ecd90e7e4165504da5ac330cee3fc7c3039c2
    Co-Authored-By: Jim Rollenhagen <email address hidden>
    Partial-Bug: #1455825
    Change-Id: Ic2bc4f147f68947f53d341fda5e0c8d7b594a553

Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

What's left to do here?

Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

We still not able to abort nodes in CLEANWAIT, tho now at least they have a timeout.

Is it something we want to do as part of this bug? Or maybe it will require a spec? (We probably will need to introduce a new verb in the API)

Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Jay Faulkner (jason-oldos)
Changed in ironic:
assignee: Jay Faulkner (jason-oldos) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
milestone: none → 4.2.0
Revision history for this message
Ruby Loo (rloo) wrote :

Patch to allow abort for nodes in CLEANWAIT provision state: https://review.openstack.org/#/c/201552/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/201552
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=795b5e37caddf375f92248cbae084b8a20dccd52
Submitter: Jenkins
Branch: master

commit 795b5e37caddf375f92248cbae084b8a20dccd52
Author: Lucas Alvares Gomes <email address hidden>
Date: Mon Sep 7 10:04:18 2015 +0100

    Allow abort for CLEANWAIT states

    This patch allows a node in CLEANWAIT to be aborted.

    The @clean_step decorator has been extended to accept a new parameter
    "abortable". By default this parameter defaults to False since a clean
    step could potentially brick a machine so we better make it explicit.

    If the clean step is abortable, the process of aborting will happen
    immediately; if the clean step is not abortable the abortion will happen
    as soon as the clean step is done. If a clean step is marked to have
    the abortion done after its completion but it is the final clean step in
    the cleaning operation the cleaning process will just finish successfully.

    A new verb 'abort' is being added to the API and the microversion is
    being bumped to 1.13.

    Closes-Bug: #1455825
    Change-Id: Ia6846c048b3dab44a8280366a7305aca1d3eb783

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic-python-agent (master)

Reviewed: https://review.openstack.org/202137
Committed: https://git.openstack.org/cgit/openstack/ironic-python-agent/commit/?id=cd70f514d6615d954dcc7e80c6438078a945b46e
Submitter: Jenkins
Branch: master

commit cd70f514d6615d954dcc7e80c6438078a945b46e
Author: Lucas Alvares Gomes <email address hidden>
Date: Wed Jul 15 15:23:45 2015 +0100

    Make the erase_devices clean step abortable

    This patches updates the get_clean_steps() method to make the
    erase_devices step abortable. Erasing devices is something that can be
    cancelled without damaging the machine.

    When a clean step is aborted the provision state of the Ironic node
    will go to CLEANFAIL state. The operator can then do what is needed to
    fix the problem (i.e network booting issues) and restart the cleaning
    later on.

    Partial-Bug: #1455825
    Change-Id: Ic181ac3712810c6f6925e8b627ee79e77ecf4d83

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.