Race condition causes Nova to shut off a successfully deployed baremetal server

Bug #1815791 reported by Lars Kellogg-Stedman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Jim Rollenhagen
Rocky
Fix Committed
Medium
Jim Rollenhagen

Bug Description

When booting a baremetal server with Nova, we see Ironic report a successful power on:

  ironic-conductor.log:2019-02-13 10:52:15.901 7 INFO ironic.conductor.utils [req-774350ce-9392-4096-b66c-20ad3d794e4e 7a9b1ac45e084e7cbeb9cb740ffe8d08 41ea8af8d00e46438c7be3b182bbb53f - default default] Successfully set node a00696d5-32ba-475e-9528-59bf11cffea6 power state to power on by power on.

But seconds later, Nova (a) triggers a power state sync and then (b) decided the node is in state "power off" and shuts it down:

 nova-compute.log:2019-02-13 10:52:17.289 7 DEBUG nova.compute.manager [req-9bcae7d4-4201-40ea-a66c-c5954117f0e4 - - - - -] Triggering sync for uuid dcb4f055-cda4-4d61-ba8f-976645c4e92a _sync_power_states /usr/lib/python2.7/site-packages/nova/compute/manager.py:7516
 nova-compute.log:2019-02-13 10:52:17.295 7 DEBUG oslo_concurrency.lockutils [-] Lock "dcb4f055-cda4-4d61-ba8f-976645c4e92a" acquired by "nova.compute.manager.query_driver_power_state_and_sync" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:327
 nova-compute.log:2019-02-13 10:52:17.344 7 WARNING nova.compute.manager [-] [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 4, current VM power_state: 4
 nova-compute.log:2019-02-13 10:52:17.345 7 DEBUG nova.compute.api [-] [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Going to try to stop instance force_stop /usr/lib/python2.7/site-packages/nova/compute/api.py:2291

It looks like Nova is using stale cache data to make this decision.

jroll on irc suggests a solution may look like https://review.openstack.org/#/c/636699/ (bypass cache data to verify power state before shutting down the server).

This is with nova @ ad842aa and ironic @ 4404292.

Tags: compute ironic
Changed in nova:
assignee: nobody → Jim Rollenhagen (jim-rollenhagen)
status: New → In Progress
Changed in nova:
assignee: Jim Rollenhagen (jim-rollenhagen) → Lars Kellogg-Stedman (larsks)
melanie witt (melwitt)
Changed in nova:
importance: Undecided → Medium
tags: added: compute ironic
Changed in nova:
assignee: Lars Kellogg-Stedman (larsks) → Jim Rollenhagen (jim-rollenhagen)
Revision history for this message
Jim Rollenhagen (jim-rollenhagen) wrote :

Patch is here: https://review.openstack.org/#/c/636699/

This should be backported to Rocky as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/636699
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=19cb8280232fd3b0ba0000a475d061ea9fb10e1a
Submitter: Zuul
Branch: master

commit 19cb8280232fd3b0ba0000a475d061ea9fb10e1a
Author: Jim Rollenhagen <email address hidden>
Date: Wed Feb 13 12:59:53 2019 -0500

    ironic: check fresh data when sync_power_state doesn't line up

    We return cached data to sync_power_state to avoid pummeling the ironic
    API. However, this can lead to a race condition where an instance is
    powered on, but nova thinks it should be off and calls stop(). Check
    again without the cache when this happens to make sure we don't
    unnecessarily kill an instance.

    Closes-Bug: #1815791
    Change-Id: I907b69eb689cf6c169a4869cfc7889308ca419d5

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/640772

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/640772
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=81b0e5a0680883d68d12253af64f8e10200df746
Submitter: Zuul
Branch: stable/rocky

commit 81b0e5a0680883d68d12253af64f8e10200df746
Author: Jim Rollenhagen <email address hidden>
Date: Wed Feb 13 12:59:53 2019 -0500

    ironic: check fresh data when sync_power_state doesn't line up

    We return cached data to sync_power_state to avoid pummeling the ironic
    API. However, this can lead to a race condition where an instance is
    powered on, but nova thinks it should be off and calls stop(). Check
    again without the cache when this happens to make sure we don't
    unnecessarily kill an instance.

    Closes-Bug: #1815791
    Change-Id: I907b69eb689cf6c169a4869cfc7889308ca419d5
    (cherry picked from commit 19cb8280232fd3b0ba0000a475d061ea9fb10e1a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.0

This issue was fixed in the openstack/nova 18.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.