Ironic

a failed power_state change wedges the power_state

Bug #1237688 reported by aeva black on 2013-10-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ironic	Fix Released	Critical	Ruby Loo	Ironic 2014.1 "icehouse"

Bug Description

PUT'ing an invalid power state string to the API will wedge the node's power state. I suspect that any exception raised by a power_driver during set_node_power_state will wedge it right now, because it will leave target_power_state non-null, preventing a subsequent attempt to change the power_state.

There may be two separate issues at play here:
1) power_state and target_power_state being changed inappropriately for an invalid input
2) no way currently exists via the API to override a non-null target_power_state, even when the current power_state is ERROR.

Logs:

########### POST a bad string. POWEROFF should be "power off".

$ curl -X POST http://192.0.2.2:6385/v1/nodes/fbe3835c-0dfc-4ec0-a693-44a7b7dd4439/state/power?target=POWEROFF |json_pp
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 186 100 186 0 0 1426 0 --:--:-- --:--:-- --:--:-- 1441
{
   "error_message" : "<html>\n <head>\n <title>404 Not Found</title>\n </head>\n <body>\n <h1>404 Not Found</h1>\n The resource could not be found.<br /><br />\n\n\n\n </body>\n</html>
"
}

########### ironic-conductor log file

2013-10-09 21:55:30,135.135 17888 ERROR ironic.openstack.common.rpc.amqp [req-b4926ba4-9392-4ca9-b6e0-17b26abe79c0 None None] Exception during message handling
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp Traceback (most recent call last):
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/openstack/common/rpc/amqp.py", line 434, in _process_data
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp **args)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/openstack/common/rpc/dispatcher.py", line
172, in dispatch
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/conductor/manager.py", line 169, in chang
e_node_power_state
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp task.driver.power.set_power_state(task, node_obj, new_state)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/conductor/task_manager.py", line 93, in w
rapper
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp return f(*args, **kwargs)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp File "/opt/stack/venvs/ironic/local/lib/python2.7/site-packages/ironic/drivers/modules/ssh.py", line 324, in set
_power_state
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp "with invalid power state %s.") % pstate)
2013-10-09 21:55:30,135.135 17888 TRACE ironic.openstack.common.rpc.amqp InvalidParameterValue: set_power_state called with invalid power state OFF.

############### relevant node record at this point

mysql> select * from nodes\G
*************************** 1. row ***************************
                  uuid: fbe3835c-0dfc-4ec0-a693-44a7b7dd4439
           power_state: error
    target_power_state: POWEROFF

############# error if I try to change the state again

$ curl -X PUT -d '' http://192.0.2.2:6385/v1/nodes/fbe3835c-0dfc-4ec0-a693-44a7b7dd4439/state/power?target=power%20off |json_pp
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 133 100 133 0 0 1506 0 --:--:-- --:--:-- --:--:-- 1528
{
   "error_message" : "{\"debuginfo\": null, \"faultcode\": \"Client\", \"faultstring\": \"One power operation is already in process\"}"
}

aeva black (tenbrae) on 2013-10-09

Changed in ironic:
status:	New → Triaged
importance:	Undecided → High

aeva black (tenbrae) on 2013-10-10

Changed in ironic:
assignee:	nobody → Devananda van der Veen (devananda)

Revision history for this message

aeva black (tenbrae) wrote on 2013-10-10:

We discussed this in IRC and agreed, there are two distinct problems here. I'm opening a separate bug for the "API doesn't validate requested state" issue.

To clarify this bug, the issue can be seen from two perspectives.
* power_state should represent the actual power state, so if a machine is OFF, it should be OFF, not ERROR when the last state change request failed
* there's no way at present to recover a node when target_power_state != NULL, but setting this to NULL when catching an error will eat the user's data, and they'll never know that their request failed.

It was suggested to redo the state representation in a way that allows us more flexibility. Something akin to this structure:

power_state: {
  'current':
  'updated_at':
  'requested':
  'requested_at':
  'error':
  'error_at':
}

aeva black (tenbrae) on 2013-10-22

Changed in ironic:
assignee:	Devananda van der Veen (devananda) → nobody

Revision history for this message

aeva black (tenbrae) wrote on 2013-10-22:

I started working on this, but am unassigning myself as I dont have time to continue working on this right now.

My initial attempt at solving this used a JsonEncodedDict to store the power state. It would work for storage/retrieval, but would be very inefficient to search (eg, find nodes with a current state of POWEROFF) since this would force the database to parse JSON, so i've abandoned that approach.

This leaves two options:
- use six columns on the nodes table to store power state info (Eg., ps_current, ps_updated_at, ps_requested, ps_requested_at, ps_last_error, ps_error_at).
- use a separate table (E.g., node_power_state, with PRIMARY KEY 'node_id', FOREIGN KEY 'node_id' REFERENCES nodes.id) and either use an RPC-style object for it, or map it onto the node RPC object.

A third option is just to add a "reset-power-state" method to the API, and not change the data representation at all.

Ruby Loo (rloo) on 2013-10-22

Changed in ironic:
assignee:	nobody → Ruby Loo (rloo)

Revision history for this message

Ruby Loo (rloo) wrote on 2013-10-24:

I've got some questions, musings. I want to make sure I'm not missing anything.

Wrt power_state: the code sets it to:
- NOSTATE initially
- POWER_ON if it is on
- POWER_OFF if it is off
- ERROR if unknown (eg timed out trying to change the power state; if the driver returns something for the state that we don't recognize) or some error happened trying to change the power state.

So, if it isn't possible to transition to a different power state, we should leave the power state as it had been, unless the transition may have caused the state to be in some indeterminate state. (In this particular case of an invalid target power state, the power state should remain what it had been set to, since it didn't change.)

wrt target_power_state: if the transaction finished (either successfully or not), it should be reset to NULL. But what happens in the case where the target_power_state is power_on, and there is a timeout waiting for the power to be turned on? Do we reset it, or leave it as power_on?

I'm not quite sure what the new 6 fields of information are, and why they are needed. Is it to figure out if/what error occurred? If the logs contain that information, would that be sufficient?

Deva's first two options seem to address something different from the third option.

I think it would be good to do the third option: have a reset-power-state (to be clear, is that reset power_state, reset target_power_state, or reset both, and reset them to what, NOSTATE?)

But that doesn't preclude doing option 1 or 2 if we want that additional data.

Revision history for this message

aeva black (tenbrae) wrote on 2013-10-24:

Download full text (3.2 KiB)

Ruby,

Good questions. I'll add some of my current thoughts --

> if it isn't possible to transition to a different power state, we should leave the power state as it had been,

Do we set the power state back to what it was? If so, how do we know what it was?
Or, do we set the power state to what ever it actually is when the attempted transition is over?

> unless the transition may have caused the state to be in some indeterminate state.

Hardware is flaky; assume we may get timeouts and unexpected responses from get_power_state()

> In this particular case of an invalid target power state, the power state should remain what it had been set to, since it didn't change.

Yes, but then how do we inform the user of the failed input? Remember, the RPC call for set_power_state is cast -- so it is asynchronous, so the API will always return "202 in progress". If, upon processing the invalid target power state, the conductor sets the target power state back to NOSTATE, and sets the power state to what it is (or was), then the user will never know there was an invalid request.

> wrt target_power_state: if the transaction finished (either successfully or not), it should be reset to NULL.

If the transition failed (finished unsuccessfully) how do we inform the user of this?

> But what happens in the case where the target_power_state is power_on, and there is a timeout waiting for the power to be turned on? Do we reset it, or leave it as power_on?

Another good question!

So, perhaps we just need to add a "last_error" field to the node table, and change how we handle ERROR state slightly? Eg:

* power_state -- always represents current power state. Any power operation sets this back to "actual" when done (whether successful or not). Also, we should add a periodic_task to sync this. Set to ERROR only when unable to get the power state from a node.
* target_power_state -- represents the requested destination of a state transition. Cleared when the transition window is over (whether successful or not).
* last_error -- general string field used to store the last error from any requested operation (eg, whether that was to change power state, or deploy a node, or anything else). Cleared when any new operation is started.

I'll step through a few example cases:

* request to change to valid state: 1. target_power_state set; 2. driver called and performs change; 3. driver returns new state; 4. manager saves it in db.power_state and clears target_power_state.

* request to change to valid state, but fails: 1. same; 2. same; 3. driver raises exception; 4. manager saves it in last_error, calls driver.get_power_state and sets power_state, then clears target_power_state

* request to change to invalid state: 1. target power state set to invalid state; 2. driver called, but raises error; 3. manager copies error to last_error field, clears target_power_state, and calls driver.get_power_state to reset actual power_state field.

* request to change when current state is ERROR: 1. manager calls driver.get_power_state; 2. if successful, driver sets power_state and proceeds as per any other request; if unsuccessful, manager sets last_error, eg, to "unable to change...