empty usage information in numa_topology of compute_node table after 2 min (approximately)

Bug #1739349 reported by Minho Ban
This bug report is a duplicate of:  Bug #1729621: Inconsistent value for vcpu_used. Edit Remove
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Minho Ban
Pike
New
Undecided
Unassigned

Bug Description

Description
===========
Since Ocata, usage information in numa_toplogy of compute_nodes in DB disappears around 2 minutes after a VM is spawned.

Steps to reproduce
==================
* Enable NUMATopologyFilter to use vcpu pining
* Launch a VM with flavor having NUMA context like hw:cpu_policy=dedicated or hw:mem_page_size=large
* Check numa_topology of compute_nodes in nova DB to check whether NUMA usage is applied
* wait for 2 minutes (more or less)
* Check numa_topology of compute_nodes in nova DB to check whether NUMA usage has been reset

Expected result
===============

There should have no changes in the DB.

Actual result
=============

numa_topology of compute_nodes has been reset (usage information has gone)

Environment
===========
1. RDO Ocata

2. CentOS

Logs & Configs
==============

NUMA usage information is alive right after a VM is spawned. (focusing on pinned_cpus and memory_usage)

$ mysql -s nova -e "select numa_topology from compute_nodes where host='ocata1';"
numa_topology
{"nova_object.version": "1.2", "nova_object.changes": ["cells"], "nova_object.name": "NUMATopology", "nova_object.data": {"cells": [{"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "pinned_cpus", "siblings", "memory", "mempages", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 4, "memory_usage": 1024, "cpuset": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], "pinned_cpus": [16, 17, 10, 11], "siblings": [[16, 17], [10, 11], [4, 5], [8, 9], [12, 13], [2, 3], [14, 15], [6, 7], [18, 19]], "memory": 20479, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["used", "total", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4456317, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 1, "total": 3, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 0}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "pinned_cpus", "siblings", "memory", "mempages", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 0, "memory_usage": 0, "cpuset": [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], "pinned_cpus": [], "siblings": [[32, 33], [36, 37], [22, 23], [24, 25], [28, 29], [30, 31], [38, 39], [26, 27], [34, 35]], "memory": 20480, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["used", "total", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4718592, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["used", "total", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 2, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 1}, "nova_object.namespace": "nova"}]}, "nova_object.namespace": "nova"}

But after 2 minutes (approximately), the usage information of numa_topology was missing.

# mysql -s nova -e "select numa_topology from compute_nodes where host='ocata1';"
numa_topology
{"nova_object.version": "1.2", "nova_object.changes": ["cells"], "nova_object.name": "NUMATopology", "nova_object.data": {"cells": [{"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "mempages", "pinned_cpus", "memory", "siblings", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 0, "memory_usage": 0, "cpuset": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], "pinned_cpus": [], "siblings": [[16, 17], [10, 11], [4, 5], [8, 9], [12, 13], [2, 3], [14, 15], [6, 7], [18, 19]], "memory": 20479, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4456317, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 3, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 0}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "mempages", "pinned_cpus", "memory", "siblings", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 0, "memory_usage": 0, "cpuset": [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], "pinned_cpus": [], "siblings": [[32, 33], [36, 37], [22, 23], [24, 25], [28, 29], [30, 31], [38, 39], [26, 27], [34, 35]], "memory": 20480, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4718592, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 2, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 1}, "nova_object.namespace": "nova"}]}, "nova_object.namespace": "nova"}

Revision history for this message
Minho Ban (mhban) wrote :

At first round of RT it is OK since usage information is recovered and updated by _update_usage_from_instances() followed by _update(). But 2nd round the information is clobbered by _copy_resources() overwriting CN with resources that has empty usage information about NUMA. ( https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L561)

That's why this issue happens after 2 minutes later.

Revision history for this message
Minho Ban (mhban) wrote :

After the 2nd round of RT the information never been recovered (updated) since there are no changes indeed.

summary: - empty usage information in numa_topology of compute_node table after
- restart nova-compute
+ empty usage information in numa_topology of compute_node table after 2
+ min
Revision history for this message
Minho Ban (mhban) wrote :

This bug has been appeared since the commit below.

commit b6edbce9b3b092581eeda6be15059a1ba2bfcfdf
Author: He Jie Xu <email address hidden>
Date: Thu Sep 8 19:46:24 2016 +0800

    Ensure ResourceProvider/Inventory created before add Allocations record

diff --git a/nova/compute/resource_tracker.py b/nova/compute/resource_tracker.py
index 972b084..8857862 100644
--- a/nova/compute/resource_tracker.py
+++ b/nova/compute/resource_tracker.py
@@ -419,6 +419,7 @@ class ResourceTracker(object):
         if self.compute_node:
             self._copy_resources(resources)
             self._setup_pci_tracker(context, resources)
+ self.scheduler_client.update_resource_stats(self.compute_node)
             return

         # now try to get the compute node record from the
@@ -427,6 +428,7 @@ class ResourceTracker(object):
         if self.compute_node:
             self._copy_resources(resources)
             self._setup_pci_tracker(context, resources)
+ self.scheduler_client.update_resource_stats(self.compute_node)
             return

         # there was no local copy and none in the database
@@ -441,6 +443,7 @@ class ResourceTracker(object):
                  {'host': self.host, 'node': self.nodename})

         self._setup_pci_tracker(context, resources)
+ self.scheduler_client.update_resource_stats(self.compute_node)

I'm not quite sure why such DB updating is required at here but because of that NUMA usage information has gone away. I think if the patch just aimed to provide RP then it should have implemented taking other way.

But for now there seem to be no such error like bug #1621437 anymore even without the patch.

summary: empty usage information in numa_topology of compute_node table after 2
- min
+ min (approximately)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/529236

Changed in nova:
assignee: nobody → Minho Ban (mhban)
status: New → In Progress
Revision history for this message
Minho Ban (mhban) wrote :

Looks to be related (or duplicated) of https://bugs.launchpad.net/nova/+bug/1729621 but it took different approach.

Matt Riedemann (mriedem)
tags: added: compute resour
tags: added: resource-tracker
removed: resour
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/529236
Reason: This was resolved in https://review.opendev.org/#/c/520024/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.