Stopping an instance via the Nova API when using the Nova Ironic driver incorrectly reports powerstate

Bug #1310135 reported by Daryl Walleck
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Ironic
Fix Released
High
aeva black
OpenStack Compute (nova)
Fix Released
High
aeva black

Bug Description

When using the Ironic Nova driver, a stopped server is still presented as Running even when the server is stopped. Checking via the Ironic API correctly shows the instance as powered down:

stack@ironic:~/logs/screen$ nova list
+--------------------------------------+---------+--------+------------+-------------+-------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------+--------+------------+-------------+-------------------+
| 5b43d631-91e1-4384-9b87-93283b3ae958 | testing | ACTIVE | - | Running | private=10.1.0.10 |
+--------------------------------------+---------+--------+------------+-------------+-------------------+
stack@ironic:~/logs/screen$ nova stop 5b43d631-91e1-4384-9b87-93283b3ae958
stack@ironic:~/logs/screen$ nova list
+--------------------------------------+---------+---------+------------+-------------+-------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------+---------+------------+-------------+-------------------+
| 5b43d631-91e1-4384-9b87-93283b3ae958 | testing | SHUTOFF | - | Running | private=10.1.0.10 |
+--------------------------------------+---------+---------+------------+-------------+-------------------+
stack@ironic:~/logs/screen$ ping 10.1.0.10
PING 10.1.0.10 (10.1.0.10) 56(84) bytes of data.
From 172.24.4.2 icmp_seq=1 Destination Host Unreachable
From 172.24.4.2 icmp_seq=5 Destination Host Unreachable
From 172.24.4.2 icmp_seq=6 Destination Host Unreachable
From 172.24.4.2 icmp_seq=7 Destination Host Unreachable
From 172.24.4.2 icmp_seq=8 Destination Host Unreachable
--- 10.1.0.10 ping statistics ---
9 packets transmitted, 0 received, +5 errors, 100% packet loss, time 8000ms
stack@ironic:~/logs/screen$ ironic node-list
+--------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------------------------+-------------+--------------------+-------------+
| 91e81c38-4dce-412b-8a1b-a914d28943e4 | 5b43d631-91e1-4384-9b87-93283b3ae958 | power off | active | False |
+--------------------------------------+--------------------------------------+-------------+--------------------+-------------+

Changed in ironic:
importance: Undecided → Medium
status: New → Triaged
Rakesh H S (rh-s)
Changed in ironic:
assignee: nobody → Rakesh H S (rh-s)
Rakesh H S (rh-s)
Changed in ironic:
status: Triaged → Confirmed
Revision history for this message
Rakesh H S (rh-s) wrote :

I was able to reproduce this issue and did some debugging.

This issue is occurring because when the ironic driver makes an 'ironic client' call to do power action on the particular node, ironic spawns an thread to perform the power actions and the ironic client returns immediately.

Since ironic client returned, nova will query and see that the node is still in powered on state and sets the POWER STATE to running.
(we need to actually wait for the state transition to happen)

NOTE:
The POWER STATE is automatically corrected in the next 'sync power state' action that happens periodically.
But, we need to fix this issue, since user may not wait till this periodic task to happen.

Revision history for this message
aeva black (tenbrae) wrote :

If the request is received by Ironic (iow, does not return an error from the API call) should we make the assumption that it will succeed, and set the power state in Nova to POWEROFF? If, for some reason, Ironic failed to complete the requested action, and the node is still ON, then Nova will detect *that* on the next sync_power_state cycle.

I think this is better than the current behavior, but still not ideal.

Ideally, we should have a notification bus between the services, eg. so Ironic can actively inform Nova when the requested action is complete.

Changed in ironic:
milestone: none → juno-rc1
tags: added: nova-driver
aeva black (tenbrae)
tags: added: ironic
Revision history for this message
Sean Dague (sdague) wrote :

Eventual consistency of states is something that's normal for Nova. Is there a reason this is believe to be any worse than nova delete's normal delays? It seem like the real issue is there should be an ironic callback for when things are done.

Changed in nova:
status: New → Incomplete
importance: Undecided → Low
Revision history for this message
aeva black (tenbrae) wrote :

I agree with Sean - the real solution here is just to have a callback mechanism between Ironic and Nova. Waiting for a periodic task to sync the power state is not ideal, but it is "normal" for Nova today.

Changed in ironic:
milestone: juno-rc1 → none
status: Confirmed → Won't Fix
status: Won't Fix → Opinion
Revision history for this message
aeva black (tenbrae) wrote :

So I did some more investigation after Adam reported that the parallel tempest tests fail intermittently on reboot() for a similar reason. For reference, that bug is here:

  https://bugs.launchpad.net/ironic/+bug/1373650

Summary
=======

Ironic's API for requesting power state changes (on, off, reboot) is asynchronous - it returns 202 when the request is received and actionable, not when it is complete.

Nova's ComputeManager seems to require that virt driver power_on/power_off/reboot methods return when the action is complete, and polls the driver for status immediately afterwards. It then updates the instance state based on this, and marks the action as complete (regardless of the return).

The mismatch should be obvious now -- Ironic's API is async, but nova.compute.manager expects the virt driver to return only when the action is complete. So, today, the ironic virt driver is not giving the Ironic service any time to complete the request, and ComputeManager thinks the operation is complete.

Nova self-corrects for power on/off requests in the periodic sync_power_state method, but there is still a problem here. In a busy system, the action appears complete before it even starts, which is causing the tempest failures.

Proposal
========

Ironic represents the pending action in the "target_power_state" field, which is set when the request is received. When the requested action is complete, this field is reset to None, and the "power_state" field is updated.

I propose that the Ironic virt driver have a new method _wait_for_power_state(client, node, new_state) which polls Ironic in a loop and returns when node.target_power_state != new_state

Then, the driver methods would become:

def reboot(...):
  new_state = 'reboot'
  ironicclient = ...
  node = ...
  ironicclient.call("node.set_power_state", node.uuid, new_state)
 _wait_for_power_state(ironicclient, node, new_state)

Changed in nova:
status: Incomplete → Confirmed
aeva black (tenbrae)
Changed in nova:
status: Confirmed → Triaged
importance: Low → High
Changed in ironic:
status: Opinion → Won't Fix
status: Won't Fix → Invalid
Revision history for this message
Rakesh H S (rh-s) wrote :

Hi Devananda, I see that you have moved the bug to 'triaged' in nova!, so if we are fixing this issue, will the fix go to 'ironic driver' in nova repo?

Changed in nova:
assignee: nobody → Rakesh H S (rh-s)
Revision history for this message
Dmitry Tantsur (divius) wrote :

right, it should be fixed in Nova tree as the driver lives there

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/124162

Changed in nova:
assignee: Rakesh H S (rh-s) → Devananda van der Veen (devananda)
status: Triaged → In Progress
Revision history for this message
aeva black (tenbrae) wrote :

Digging further after proposing a fix to the Nova driver, there is *also* a race inside of ironic/conductor/manager.py and ironic/conductor/utils.py -- I am posting a fix for those now.

Changed in ironic:
status: Invalid → In Progress
assignee: Rakesh H S (rh-s) → Devananda van der Veen (devananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ironic (master)

Fix proposed to branch: master
Review: https://review.openstack.org/124225

aeva black (tenbrae)
Changed in ironic:
importance: Medium → High
Revision history for this message
Rakesh H S (rh-s) wrote :

Was already working towards the fix :(
But np, thanks for fixing it in quick time.

Changed in ironic:
assignee: Devananda van der Veen (devananda) → Lucas Alvares Gomes (lucasagomes)
Changed in ironic:
assignee: Lucas Alvares Gomes (lucasagomes) → Devananda van der Veen (devananda)
aeva black (tenbrae)
Changed in ironic:
milestone: none → juno-rc1
Joe Gordon (jogo)
Changed in nova:
milestone: none → juno-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ironic (master)

Reviewed: https://review.openstack.org/124225
Committed: https://git.openstack.org/cgit/openstack/ironic/commit/?id=f33a26fd81b1bfd57f6b3c437f10c83e5059f878
Submitter: Jenkins
Branch: master

commit f33a26fd81b1bfd57f6b3c437f10c83e5059f878
Author: Devananda van der Veen <email address hidden>
Date: Thu Sep 25 16:25:57 2014 -0700

    Conductor changes target_power_state before starting work

    A race was discovered between when Nova requests a power state change
    and when that request is visible in Ironic's API, which could lead to
    Nova believing the request was completely satisfied before it
    actually started.

    A partial fix has been proposed to Nova:
        https://review.openstack.org/124162

    This also requires that Ironic change the node's target_power_state from
    within ConductorManager *prior* to spawning the background thread which
    performs the actual work.

    Looking into this exposed another potential race in Ironic whereby
    a power state change request could be responded with 202-Accepted,
    but the background thread might fail to be started, and no error would
    be reported. The new behavior is to set node.last_error in the same
    way that this is handled for provision state changes that fail in the
    same way.

    Change-Id: I4d85c9230bcd9b6e36ffa9a326fda9908fee7e76
    Closes-bug: #1310135

Changed in ironic:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/124162
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6e1fd7b9b51498af5f688a41d6636ff90b9a56f5
Submitter: Jenkins
Branch: master

commit 6e1fd7b9b51498af5f688a41d6636ff90b9a56f5
Author: Devananda van der Veen <email address hidden>
Date: Thu Sep 25 09:07:23 2014 -0700

    Ironic driver must wait for power state changes

    The Ironic virt driver was previously returning from power_on,
    power_off, and reboot calls immediately after making the request to
    Ironic -- but before Ironic had necessarily acted upon it. This could
    lead to Nova acting as though the state change was complete before it
    actually was.

    This patch introduces a new private method which polls Ironic until it
    signals that the power state change is complete. With this, the
    power_on, power_off, and reboot methods wait for the action to complete
    before returning to ComputeManager.

    Closes-bug: #1310135

    Change-Id: I0bbdd8bc8807a0df43a420825f497842b387c4d5

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in ironic:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
Thierry Carrez (ttx)
Changed in ironic:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.