a failed power_state change wedges the power_state

Bug #1237688 reported by aeva black
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
Critical
Ruby Loo

Bug Description

PUT'ing an invalid power state string to the API will wedge the node's power state. I suspect that any exception raised by a power_driver during set_node_power_state will wedge it right now, because it will leave target_power_state non-null, preventing a subsequent attempt to change the power_state.

There may be two separate issues at play here:
1) power_state and target_power_state being changed inappropriately for an invalid input
2) no way currently exists via the API to override a non-null target_power_state, even when the current power_state is ERROR.

Logs:

########### POST a bad string. POWEROFF should be "power off".

$ curl -X POST http://192.0.2.2:6385/v1/nodes/fbe3835c-0dfc-4ec0-a693-44a7b7dd4439/state/power?target=POWEROFF |json_pp
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 186 100 186 0 0 1426 0 --:--:-- --:--:-- --:--:-- 1441
{
   "error_message" : "<html>\n <head>\n <title>404 Not Found</title>\n </head>\n <body>\n <h1>404 Not Found</h1>\n The resource could not be found.<br /><br />\n\n\n\n </body>\n</html>
"
}

########### ironic-conductor log file

2013-10-09 21:55:30,135.135 17888 ERROR ironic.openstack.common.rpc.amqp [req-b4926ba4-9392-4ca9-b6e0-17b26abe79c0 None None] Exception during message handling
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp Traceback (most recent call last):
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/openstack/common/rpc/amqp.py", line 434, in _process_data
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp **args)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/openstack/common/rpc/dispatcher.py", line
 172, in dispatch
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/conductor/manager.py", line 169, in chang
e_node_power_state
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp task.driver.power.set_power_state(task, node_obj, new_state)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 93, in w
rapper
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp return f(*args, **kwargs)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/drivers/modules/ssh.py", line 324, in set
_power_state
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp "with invalid power state %s.") % pstate)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp InvalidParameterValue: set_power_state called with invalid power state OFF.

############### relevant node record at this point

mysql> select * from nodes\G
*************************** 1. row ***************************
                  uuid: fbe3835c-0dfc-4ec0-a693-44a7b7dd4439
           power_state: error
    target_power_state: POWEROFF

############# error if I try to change the state again

$ curl -X PUT -d '' http://192.0.2.2:6385/v1/nodes/fbe3835c-0dfc-4ec0-a693-44a7b7dd4439/state/power?target=power%20off |json_pp
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 133 100 133 0 0 1506 0 --:--:-- --:--:-- --:--:-- 1528
{
   "error_message" : "{\"debuginfo\": null, \"faultcode\": \"Client\", \"faultstring\": \"One power operation is already in process\"}"
}

aeva black (tenbrae)
Changed in ironic:
status: New → Triaged
importance: Undecided → High
aeva black (tenbrae)
Changed in ironic:
assignee: nobody → Devananda van der Veen (devananda)
Revision history for this message
aeva black (tenbrae) wrote :

We discussed this in IRC and agreed, there are two distinct problems here. I'm opening a separate bug for the "API doesn't validate requested state" issue.

To clarify this bug, the issue can be seen from two perspectives.
* power_state should represent the actual power state, so if a machine is OFF, it should be OFF, not ERROR when the last state change request failed
* there's no way at present to recover a node when target_power_state != NULL, but setting this to NULL when catching an error will eat the user's data, and they'll never know that their request failed.

It was suggested to redo the state representation in a way that allows us more flexibility. Something akin to this structure:

power_state: {
  'current':
  'updated_at':
  'requested':
  'requested_at':
  'error':
  'error_at':
}

aeva black (tenbrae)
Changed in ironic:
assignee: Devananda van der Veen (devananda) → nobody
Revision history for this message
aeva black (tenbrae) wrote :

I started working on this, but am unassigning myself as I dont have time to continue working on this right now.

My initial attempt at solving this used a JsonEncodedDict to store the power state. It would work for storage/retrieval, but would be very inefficient to search (eg, find nodes with a current state of POWEROFF) since this would force the database to parse JSON, so i've abandoned that approach.

This leaves two options:
- use six columns on the nodes table to store power state info (Eg., ps_current, ps_updated_at, ps_requested, ps_requested_at, ps_last_error, ps_error_at).
- use a separate table (E.g., node_power_state, with PRIMARY KEY 'node_id', FOREIGN KEY 'node_id' REFERENCES nodes.id) and either use an RPC-style object for it, or map it onto the node RPC object.

A third option is just to add a "reset-power-state" method to the API, and not change the data representation at all.

Ruby Loo (rloo)
Changed in ironic:
assignee: nobody → Ruby Loo (rloo)
Revision history for this message
Ruby Loo (rloo) wrote :

I've got some questions, musings. I want to make sure I'm not missing anything.

Wrt power_state: the code sets it to:
- NOSTATE initially
- POWER_ON if it is on
- POWER_OFF if it is off
- ERROR if unknown (eg timed out trying to change the power state; if the driver returns something for the state that we don't recognize) or some error happened trying to change the power state.

So, if it isn't possible to transition to a different power state, we should leave the power state as it had been, unless the transition may have caused the state to be in some indeterminate state. (In this particular case of an invalid target power state, the power state should remain what it had been set to, since it didn't change.)

wrt target_power_state: if the transaction finished (either successfully or not), it should be reset to NULL. But what happens in the case where the target_power_state is power_on, and there is a timeout waiting for the power to be turned on? Do we reset it, or leave it as power_on?

I'm not quite sure what the new 6 fields of information are, and why they are needed. Is it to figure out if/what error occurred? If the logs contain that information, would that be sufficient?

Deva's first two options seem to address something different from the third option.

I think it would be good to do the third option: have a reset-power-state (to be clear, is that reset power_state, reset target_power_state, or reset both, and reset them to what, NOSTATE?)

But that doesn't preclude doing option 1 or 2 if we want that additional data.

Revision history for this message
aeva black (tenbrae) wrote :
Download full text (3.2 KiB)

Ruby,

Good questions. I'll add some of my current thoughts --

> if it isn't possible to transition to a different power state, we should leave the power state as it had been,

Do we set the power state back to what it was? If so, how do we know what it was?
Or, do we set the power state to what ever it actually is when the attempted transition is over?

> unless the transition may have caused the state to be in some indeterminate state.

Hardware is flaky; assume we may get timeouts and unexpected responses from get_power_state()

> In this particular case of an invalid target power state, the power state should remain what it had been set to, since it didn't change.

Yes, but then how do we inform the user of the failed input? Remember, the RPC call for set_power_state is cast -- so it is asynchronous, so the API will always return "202 in progress". If, upon processing the invalid target power state, the conductor sets the target power state back to NOSTATE, and sets the power state to what it is (or was), then the user will never know there was an invalid request.

> wrt target_power_state: if the transaction finished (either successfully or not), it should be reset to NULL.

If the transition failed (finished unsuccessfully) how do we inform the user of this?

> But what happens in the case where the target_power_state is power_on, and there is a timeout waiting for the power to be turned on? Do we reset it, or leave it as power_on?

Another good question!

So, perhaps we just need to add a "last_error" field to the node table, and change how we handle ERROR state slightly? Eg:

* power_state -- always represents current power state. Any power operation sets this back to "actual" when done (whether successful or not). Also, we should add a periodic_task to sync this. Set to ERROR only when unable to get the power state from a node.
* target_power_state -- represents the requested destination of a state transition. Cleared when the transition window is over (whether successful or not).
* last_error -- general string field used to store the last error from any requested operation (eg, whether that was to change power state, or deploy a node, or anything else). Cleared when any new operation is started.

I'll step through a few example cases:

* request to change to valid state: 1. target_power_state set; 2. driver called and performs change; 3. driver returns new state; 4. manager saves it in db.power_state and clears target_power_state.

* request to change to valid state, but fails: 1. same; 2. same; 3. driver raises exception; 4. manager saves it in last_error, calls driver.get_power_state and sets power_state, then clears target_power_state

* request to change to invalid state: 1. target power state set to invalid state; 2. driver called, but raises error; 3. manager copies error to last_error field, clears target_power_state, and calls driver.get_power_state to reset actual power_state field.

* request to change when current state is ERROR: 1. manager calls driver.get_power_state; 2. if successful, driver sets power_state and proceeds as per any other request; if unsuccessful, manager sets last_error, eg, to "unable to change...

Read more...

Revision history for this message
Robert Collins (lifeless) wrote :

A few key things stand out to me - already said, I know.

- assume the environment will be noisy. -> ensure that Ironic can recover on its own || the admin can tell Ironic to recover from any state.

- assume crap will GET/PUT and POSTed to every API -> Use a schema validator and reject invalid inputs. Also allow drivers to reject invalid inputs. Don't log such user problems as ERROR (use INFO probably).

- hardware information should be preferred manually entered info as a matter of principle, but if the admin has cause to tell us they know better, we should listen. -> Allow overrides by admins, but probably make it stand out in some exceptional way to avoid mistakes. Also signal that it has been overridden to permit diagnostics.

Revision history for this message
aeva black (tenbrae) wrote :

changing priority to Critical as this is too easy to hit, and the only workaround is directly manipulating the database record.

Changed in ironic:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/54466

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/54466
Committed: http://github.com/openstack/ironic/commit/e935b2c4acc3cbc3258b8d64d560f5651baace1e
Submitter: Jenkins
Branch: master

commit e935b2c4acc3cbc3258b8d64d560f5651baace1e
Author: Ruby Loo <email address hidden>
Date: Wed Oct 30 02:47:38 2013 +0000

    Changes power_state and adds last_error field

    Made these changes:
    * power_state -- always represents current power state.
    Any power operation sets this back to "actual" when done
    (whether successful or not). It is set to ERROR only when
    unable to get the power state from a node.

    * target_power_state -- represents the requested destination
    of a state transition. Cleared when the transition window is
    over (whether successful or not).

    * last_error -- string field used to store the last error from
    any requested asynchronous operation (eg, whether that was to
    change power state, or deploy a node, or anything else) that
    started but failed to finish. Cleared when any new asynchronous
    operation is started.

    Closes-Bug: #1237688

    Change-Id: I7e079627b87b2cb1606e677e287dd08dcc87263a

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → icehouse-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.