sync_power_states does not handle missing driver info well

Bug #1262912 reported by aeva black
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Ruby Loo

Bug Description

The ConductorManager._sync_power_states() periodic task will stop checking power states if it attempts to check the state of a node which is not properly configured, ie. for which node.driver.power.validate() raises an exception.

sync_power_state should catch the exceptions, log them, and continue checking other nodes.

aeva black (tenbrae)
Changed in ironic:
status: New → Triaged
importance: Undecided → High
milestone: none → icehouse-2
tags: added: low-hanging-fruit
Revision history for this message
Haomeng,Wang (whaom) wrote :

@Devananda, I checked code - https://github.com/openstack/ironic/blob/master/ironic/conductor/manager.py#L331, ConductorManager._sync_power_states() call driver.power.get_power_state() to get current power state, sync the state and save the state to database, and handle exception.NodeLocked, exception.NodeNotFound exceptions, will not call node.driver.power.validate().
And I try to recreate this issue with my local env, just set wrong driver_info, and our ConductorManager._sync_power_states() periodic task will continue checking power states if one node has wrong driver configurations. So just has concern here how to reproduce this issue, and try to fix it.

Ruby Loo (rloo)
Changed in ironic:
assignee: nobody → Ruby Loo (rloo)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/65297

Changed in ironic:
status: Triaged → In Progress
Revision history for this message
Ruby Loo (rloo) wrote :

@Haomeng, if the node has wrong driver info, the call to task.driver.power.get_power_state() should return an exception (diff from NodeLocked, NodeNotFound exceptions). This exception needs to be caught; otherwise sync_power_states stops processing any more nodes.

Revision history for this message
Haomeng,Wang (whaom) wrote :

@Ruby, yes, agree with you, we can catch more the exceptions raised from get_power_state. But I still can not recreate it, after we get exception with get_power_state() call, the periodic task will run in next round as well, I think the exception is caught by high level code[1], so it can run next round again with this round exception without breaking the task, my log can be found here[2]:

[1] https://github.com/openstack/ironic/blob/master/ironic/openstack/common/periodic_task.py#L182

[2]
2014-01-07 21:15:02.342 3753 ERROR ironic.openstack.common.periodic_task [-] Error during ConductorManager._sync_power_states: SSHPowerDriver requires virt_type be set.
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task Traceback (most recent call last):
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/ironic/openstack/common/periodic_task.py", line 182, in run_periodic_tasks
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task task(self, context)
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/ironic/conductor/manager.py", line 345, in _sync_power_states
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task power_state = task.driver.power.get_power_state(task, node)
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/ironic/drivers/modules/ssh.py", line 286, in get_power_state
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task driver_info = _parse_driver_info(node)
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task File "/usr/lib/python2.6/site-packages/ironic/drivers/modules/ssh.py", line 103, in _parse_driver_info
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task "SSHPowerDriver requires virt_type be set."))
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task InvalidParameterValue: SSHPowerDriver requires virt_type be set.
2014-01-07 21:15:02.342 3753 TRACE ironic.openstack.common.periodic_task

Revision history for this message
Ruby Loo (rloo) wrote :

@Haomeng, I suggest you try out the patch I added. Or just run the unit test I added w/o the code change. You'll see that it fails. Ie, after the 2nd node's power-state call raises an exception, the sync_power_states code doesn't process node 3. Yes, since this is a periodic task, the sync_power_states call will be invoked again, but if a particular node always causes an exception to be raised, the other nodes will never get processed.

Maybe this answers your question. in _sync_power_states(), the "for node_id, in node_list" loop doesn't go through all the node_ids if an uncaught exception is raised, and we want it to do so.

Revision history for this message
Haomeng,Wang (whaom) wrote :

@Ruby, understand, my env just have one node, so can not recreate, thanks for your quickly response.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/65297
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=ec7ff7d4e232515b20026a002748c91d6af7dd7d
Submitter: Jenkins
Branch: master

commit ec7ff7d4e232515b20026a002748c91d6af7dd7d
Author: Ruby Loo <email address hidden>
Date: Tue Jan 7 15:34:21 2014 +0000

    sync_power_states handles missing driver info

    The ConductorManager._sync_power_states() periodic task will stop
    checking power states if getting the power state of a node raises an
    exception. Eg, this could happen if a node is not properly configured
    (for which node.driver.power.validate() raises an exception).

    With this change, sync_power_state catches the exceptions, logs them,
    and continues checking other nodes.

    Change-Id: I343f722de8f79a50fbaa9fcdd180decdc5a43f95
    Closes-Bug: #1262912

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: icehouse-2 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.