node_power_action should sync DB state with real state even if real state = requested state

Bug #1403106 reported by Dmitry Tantsur
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
Dmitry Tantsur

Bug Description

See the interesting part of the logs: http://paste.openstack.org/show/151894/

The scenario is:
1. We set maintenance=True
2. Power on the machine using API
3. ... do something with it ...
4. Power off the machine using API
5. Disable maintenance mode

One would expect the nodes to remain powered off. However pretty often sync power state loop seems to somehow kick in and power on the machine back.

UPDATE:
Setting force_power_state_during_sync=False in ironic.conf helps: http://paste.openstack.org/show/151899/

UPDATE2:
The problem manifested itself with VERY slow power driver (SSH'ing too seconds). Maybe kind of a race condition...

UPDATE3:
So that's how I see it:

1. Maintenance mode is set, power_state=off, real power state = off
2. Node is powered on, power_state = on, real power state = on
3. Discoverd ramdisk does something and reports back to discoverd
4. Discoverd instructs Ironic to power off the machine.
5. While it takes loooong (remember: very slow SSH) for Ironic to get power state, the ramdisk shutdowns itself
6. Now: get_power_state returns real power state = off and manager.utils.node_power_action exits
7. But! Power state in database was not updated and is still = on!
8. So after maintenance mode is disabled, Ironic powers on the machine.

SUGGESTION
During node_power_action _always_ make sure database is in sync with both what we expect and what we actually have (independent of force_power_state_during_sync value)

Dmitry Tantsur (divius)
description: updated
Changed in ironic:
assignee: nobody → Devananda van der Veen (devananda)
status: New → In Progress
Revision history for this message
aeva black (tenbrae) wrote : Re: force_power_state_during_sync can force an incorrect power state after leaving maintenance mode

The previous message was the result of an incorrect bug number tag. I am reverting the status of this bug.

Changed in ironic:
status: In Progress → New
assignee: Devananda van der Veen (devananda) → nobody
Dmitry Tantsur (divius)
description: updated
Dmitry Tantsur (divius)
description: updated
summary: - force_power_state_during_sync can force an incorrect power state after
- leaving maintenance mode
+ node_power_action should sync real state with DB state even if real
+ state = requested state
summary: - node_power_action should sync real state with DB state even if real
+ node_power_action should sync DB state with real state even if real
state = requested state
aeva black (tenbrae)
Changed in ironic:
status: New → Confirmed
importance: Undecided → High
tags: added: low-hanging-fruit
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

I tried to reproduce it but I couldn't http://paste.openstack.org/show/157270/

I will take a better look into the code see if I can find something

Revision history for this message
Dmitry Tantsur (divius) wrote :

You need a slow power driver to reproduce it. Actually I think adding node.power_state = new_state after https://github.com/openstack/ironic/blob/master/ironic/conductor/utils.py#L88 should solve the problem.

In discoverd I do need power off command to be reliable.

Dmitry Tantsur (divius)
Changed in ironic:
status: Confirmed → In Progress
assignee: nobody → Dmitry "Divius" Tantsur (divius)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/147494

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/147494
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=d6787b2073eea16e29444581fa7b7791789be787
Submitter: Jenkins
Branch: master

commit d6787b2073eea16e29444581fa7b7791789be787
Author: Dmitry Tantsur <email address hidden>
Date: Thu Jan 15 13:53:44 2015 +0100

    Ensure we don't have stale power state in database after power action

    If the machine is shut down not via Ironic API (e.g. during maintenance)
    and then power off request is made, this request will succeed, but
    due to stale information in database the machine will be powered on
    as soon as power sync loop reaches it.

    Change-Id: Ia3ccaf1b3ab6399515c6eb6a98a0a85e3c2e6ddf
    Closes-Bug: #1403106

Changed in ironic:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in ironic:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
milestone: kilo-2 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.