Inconsistent power and vm states for physical instances when doing nova start/stop

Bug #1832720 reported by Surya Seetharaman on 2019-06-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Surya Seetharaman
Rocky
Undecided
Surya Seetharaman
Stein
Undecided
Surya Seetharaman

Bug Description

For performance improvement, a cache was added in the ironic driver to store nodes (and hence their power states) in https://github.com/openstack/nova/commit/9d5fb1b58e908ccacbbbf29341918d0b0588a36f#diff-1e4547e2c3b36b8f836d8f851f85fde7. Later on through https://github.com/openstack/nova/commit/19cb8280232fd3b0ba0000a475d061ea9fb10e1a#diff-1e4547e2c3b36b8f836d8f851f85fde7 a "use_cache" option was added to remove inconsistencies during power_sync periodic task caused due to this cache. However when we do nova start/stop on an instance, the power_state of the instance is obtained from the cache (https://github.com/openstack/nova/blob/f298973520420710a617e4d79e853f2416b29786/nova/compute/manager.py#L1284) and this causes inconsistencies on the CLI listing/showing between the vm_states and power_states for a considerable amount of time (assuming until the next periodic power sync between nova and ironic that depends on sync_power_state_interval config option) before the cache gets refreshed to reflect the correct states:

+--------------------------------------+---------+--------+------------+-------------+-------------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------+--------+------------+-------------+-------------------------------------------------------+
| cd38b5c1-80dc-425d-8b8e-f523dc60e6ba | test000 | ACTIVE | - | Shutdown | private=fde8:a67c:e94e:0:5054:ff:fe28:5da1, 10.0.0.31 |
| cd38b5c1-80dc-425d-8b8e-f523dc60e6ba | test000 | SHUTOFF | - | Running | private=fde8:a67c:e94e:0:5054:ff:fe28:5da1, 10.0.0.31 |+--------------------------------------+---------+---------+------------+-------------+-------------------------------------------------------+

The code comment specifies that the refresh of the cache should happen during every RT periodic update which should be every 60 seconds (https://github.com/openstack/nova/blob/61558f274842b149044a14bbe7537b9f278035fd/nova/virt/ironic/driver.py#L989) but the inconsistency seems to last for more than a minute and this is confusing for the user. The "use_cache" should be set to False for these actions to avoid confusing vm and power states.

Surya Seetharaman (tssurya) wrote :

okay so the bug is because I don't think RT calls "get_available_nodes" function anymore (https://github.com/openstack/nova/blob/9f28727eb75e05e07bad51b6eecce667d09dfb65/nova/virt/ironic/driver.py#L725). I think the refresh should be updated either in the "get_available_resources" or passed separately per function call during start/stop etc..

Surya Seetharaman (tssurya) wrote :

Ignore the previous comment, the cache refreshal happens during every periodic task like stated in the code comment through "get_available_nodes" function (https://github.com/openstack/nova/blob/a628d2f09a42a0faa5fcb36793e2304de634638e/nova/compute/manager.py#L8123), the reason for inconsistency is because at the time of saving the power state during nova start/stop, the power_state is taken from the cache at the time when the cache is not upto date and then saves it to the database. So even if the cache gets refreshed, the database has already saved the wrong information which is why it waits for 10mins (the next power sync) to change this information. So we should grab the latest info during the start/stop actions.

Fix proposed to branch: master
Review: https://review.opendev.org/665975

Changed in nova:
status: New → In Progress

Reviewed: https://review.opendev.org/665975
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5b1c9dd05ccc0488e4d47bfc2442e46adb82b765
Submitter: Zuul
Branch: master

commit 5b1c9dd05ccc0488e4d47bfc2442e46adb82b765
Author: Surya Seetharaman <email address hidden>
Date: Tue Jun 18 13:59:47 2019 +0200

    Grab fresh power state info from the driver

    In drivers that use a cache to store the node info (presently only
    ironic since [1]), the "_get_power_state" function called during
    instance actions like start or stop grabs the information from the node
    cache and saves it in the nova database instead of getting fresh
    information from the driver. This leads to inconsistency between
    the vm_state and power_state for an instance in the nova database
    (which remains until a power_sync happens between nova and ironic).
    This can be confusing for a user when doing "nova list" where the
    power_state might still be shutdown when the vm_state has already
    become active. On a default environment this inconsistency lasts
    for about ten minutes which is the default value for the
    sync_power_state_interval interval.

    This patch changes the "use_cache" to False in the compute manager
    when triggering an action on an instance like start/stop/reboot.

    [1] I907b69eb689cf6c169a4869cfc7889308ca419d5

    Change-Id: I8bca5d84c37d02331d2f9968a674f3398c1a8f5b
    Closes-Bug: #1832720

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/667948
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=16449cbfe97eef1500f5a73a9d30e7abf4249b99
Submitter: Zuul
Branch: stable/stein

commit 16449cbfe97eef1500f5a73a9d30e7abf4249b99
Author: Surya Seetharaman <email address hidden>
Date: Tue Jun 18 13:59:47 2019 +0200

    Grab fresh power state info from the driver

    In drivers that use a cache to store the node info (presently only
    ironic since [1]), the "_get_power_state" function called during
    instance actions like start or stop grabs the information from the node
    cache and saves it in the nova database instead of getting fresh
    information from the driver. This leads to inconsistency between
    the vm_state and power_state for an instance in the nova database
    (which remains until a power_sync happens between nova and ironic).
    This can be confusing for a user when doing "nova list" where the
    power_state might still be shutdown when the vm_state has already
    become active. On a default environment this inconsistency lasts
    for about ten minutes which is the default value for the
    sync_power_state_interval interval.

    This patch changes the "use_cache" to False in the compute manager
    when triggering an action on an instance like start/stop/reboot.

    [1] I0069cbc327d952d42dbb8fe54949faab89995a7e

    Change-Id: I8bca5d84c37d02331d2f9968a674f3398c1a8f5b
    Closes-Bug: #1832720
    (cherry picked from commit 5b1c9dd05ccc0488e4d47bfc2442e46adb82b765)

Reviewed: https://review.opendev.org/667955
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0ac4a97204b129a10e9cc7fa1154dc24711281bd
Submitter: Zuul
Branch: stable/rocky

commit 0ac4a97204b129a10e9cc7fa1154dc24711281bd
Author: Surya Seetharaman <email address hidden>
Date: Tue Jun 18 13:59:47 2019 +0200

    Grab fresh power state info from the driver

    In drivers that use a cache to store the node info (presently only
    ironic since [1]), the "_get_power_state" function called during
    instance actions like start or stop grabs the information from the node
    cache and saves it in the nova database instead of getting fresh
    information from the driver. This leads to inconsistency between
    the vm_state and power_state for an instance in the nova database
    (which remains until a power_sync happens between nova and ironic).
    This can be confusing for a user when doing "nova list" where the
    power_state might still be shutdown when the vm_state has already
    become active. On a default environment this inconsistency lasts
    for about ten minutes which is the default value for the
    sync_power_state_interval interval.

    This patch changes the "use_cache" to False in the compute manager
    when triggering an action on an instance like start/stop/reboot.

    [1] I0069cbc327d952d42dbb8fe54949faab89995a7e

    Change-Id: I8bca5d84c37d02331d2f9968a674f3398c1a8f5b
    Closes-Bug: #1832720
    (cherry picked from commit 5b1c9dd05ccc0488e4d47bfc2442e46adb82b765)
    (cherry picked from commit 16449cbfe97eef1500f5a73a9d30e7abf4249b99)

This issue was fixed in the openstack/nova 19.0.2 release.

This issue was fixed in the openstack/nova 18.2.2 release.

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers