Inconsistent value for vcpu_used
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Maciej Jozefczyk | ||
Pike |
Fix Released
|
High
|
Balazs Gibizer | ||
Queens |
Fix Released
|
High
|
Radoslav Gerganov | ||
Rocky |
Fix Released
|
High
|
Radoslav Gerganov |
Bug Description
Description
===========
Nova updates hypervisor resources using function called ./nova/
In case of *shutdowned* instances it could impact inconsistent values for resources like vcpu_used.
Resources are taken from function self.driver.
https:/
https:/
This function calculates allocated vcpu's based on function _get_vcpu_total().
https:/
As we see in _get_vcpu_total() function calls *self._
At the end of resource update process function _update_
> /opt/stack/
677 @utils.
678 def _update_
679
681 # initialize the compute node object, creating it
682 # if it does not already exist.
683 self._init_
It initialize compute node object with resources that are calculated without shutdowned instances. If compute node object already exists it *UPDATES* its fields - *for a while nova-api has other resources values than it its in real.*
731 # update the compute_node
732 self._update(
The inconsistency is automatically fixed during other code execution:
https:/
But for heavy-loaded hypervisors (like 100 active instances and 30 shutdowned instances) it creates wrong informations in nova database for about 4-5 seconds (in my usecase) - it could impact other issues like spawning on already full hypervisor (because scheduler has wrong informations about hypervisor usage).
Steps to reproduce
==================
1) Start devstack
2) Create 120 instances
3) Stop some instances
4) Watch blinking values in nova hypervisor-show
nova hypervisor-show e6dfc16b-
Expected result
===============
Returned values should be the same during test.
Actual result
=============
while true; do echo -n "$(date) "; echo "select hypervisor_
Thu Nov 2 14:50:09 UTC 2017 example.
Thu Nov 2 14:50:10 UTC 2017 example.
Thu Nov 2 14:50:10 UTC 2017 example.
Thu Nov 2 14:50:10 UTC 2017 example.
Thu Nov 2 14:50:11 UTC 2017 example.
Thu Nov 2 14:50:11 UTC 2017 example.
Thu Nov 2 14:50:11 UTC 2017 example.
Thu Nov 2 14:50:11 UTC 2017 example.
Thu Nov 2 14:50:12 UTC 2017 example.
Thu Nov 2 14:50:12 UTC 2017 example.
Thu Nov 2 14:50:12 UTC 2017 example.
Thu Nov 2 14:50:13 UTC 2017 example.
Thu Nov 2 14:50:13 UTC 2017 example.
Thu Nov 2 14:50:13 UTC 2017 example.
Thu Nov 2 14:50:14 UTC 2017 example.
Thu Nov 2 14:50:14 UTC 2017 example.
Thu Nov 2 14:50:14 UTC 2017 example.
Thu Nov 2 14:50:15 UTC 2017 example.
Thu Nov 2 14:50:15 UTC 2017 example.
Thu Nov 2 14:50:15 UTC 2017 example.
Thu Nov 2 14:50:16 UTC 2017 example.
Thu Nov 2 14:50:16 UTC 2017 example.
Thu Nov 2 14:50:16 UTC 2017 example.
Thu Nov 2 14:50:17 UTC 2017 example.
Thu Nov 2 14:50:17 UTC 2017 example.
Thu Nov 2 14:50:17 UTC 2017 example.
Thu Nov 2 14:50:17 UTC 2017 example.
Thu Nov 2 14:50:18 UTC 2017 example.
Thu Nov 2 14:50:18 UTC 2017 example.
Thu Nov 2 14:50:18 UTC 2017 example.
Thu Nov 2 14:50:19 UTC 2017 example.
Thu Nov 2 14:50:19 UTC 2017 example.
Thu Nov 2 14:50:19 UTC 2017 example.
Bad values were stored in nova DB for about 5 seconds. During this time nova-scheduler could take this host.
Environment
===========
Devstack master (f974e3c3566f37
For sure releases down to Newton are impacted.
description: | updated |
tags: | added: resource-tracker |
Changed in nova: | |
assignee: | nobody → Maciej Jozefczyk (maciej.jozefczyk) |
status: | New → In Progress |
Changed in nova: | |
assignee: | Maciej Jozefczyk (maciej.jozefczyk) → Minho Ban (mhban) |
Changed in nova: | |
assignee: | Maciej Jozefczyk (maciej.jozefczyk) → Eric Fried (efried) |
Changed in nova: | |
assignee: | Eric Fried (efried) → Maciej Jozefczyk (maciej.jozefczyk) |
no longer affects: | nova/ocata |
I see solutions:
1. Change self._init_ compute_ node() in _update_ available_ resource( ) to not call self._update(), maybe by introducing new boolean parameter in _init_compute_ node() args to not call self._update().
2. Add some kind of db transaction (Its not a good idea I think)
3. Modify calls of self._host. list_guests( ) to list all instances (those shutdowned too) - but for sure it will break other things
4. Re-organize code (?)