empty usage information in numa_topology of compute_node table after 2 min (approximately)

Bug #1739349 reported by Minho Ban on 2017-12-20
This bug report is a duplicate of:  Bug #1729621: Inconsistent value for vcpu_used. Edit Remove
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Minho Ban
Pike
New
Undecided
Unassigned

Bug Description

Description
===========
Since Ocata, usage information in numa_toplogy of compute_nodes in DB disappears around 2 minutes after a VM is spawned.

Steps to reproduce
==================
* Enable NUMATopologyFilter to use vcpu pining
* Launch a VM with flavor having NUMA context like hw:cpu_policy=dedicated or hw:mem_page_size=large
* Check numa_topology of compute_nodes in nova DB to check whether NUMA usage is applied
* wait for 2 minutes (more or less)
* Check numa_topology of compute_nodes in nova DB to check whether NUMA usage has been reset

Expected result
===============

There should have no changes in the DB.

Actual result
=============

numa_topology of compute_nodes has been reset (usage information has gone)

Environment
===========
1. RDO Ocata

2. CentOS

Logs & Configs
==============

NUMA usage information is alive right after a VM is spawned. (focusing on pinned_cpus and memory_usage)

$ mysql -s nova -e "select numa_topology from compute_nodes where host='ocata1';"
numa_topology
{"nova_object.version": "1.2", "nova_object.changes": ["cells"], "nova_object.name": "NUMATopology", "nova_object.data": {"cells": [{"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "pinned_cpus", "siblings", "memory", "mempages", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 4, "memory_usage": 1024, "cpuset": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], "pinned_cpus": [16, 17, 10, 11], "siblings": [[16, 17], [10, 11], [4, 5], [8, 9], [12, 13], [2, 3], [14, 15], [6, 7], [18, 19]], "memory": 20479, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["used", "total", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4456317, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 1, "total": 3, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 0}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "pinned_cpus", "siblings", "memory", "mempages", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 0, "memory_usage": 0, "cpuset": [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], "pinned_cpus": [], "siblings": [[32, 33], [36, 37], [22, 23], [24, 25], [28, 29], [30, 31], [38, 39], [26, 27], [34, 35]], "memory": 20480, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["used", "total", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4718592, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["used", "total", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 2, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 1}, "nova_object.namespace": "nova"}]}, "nova_object.namespace": "nova"}

But after 2 minutes (approximately), the usage information of numa_topology was missing.

# mysql -s nova -e "select numa_topology from compute_nodes where host='ocata1';"
numa_topology
{"nova_object.version": "1.2", "nova_object.changes": ["cells"], "nova_object.name": "NUMATopology", "nova_object.data": {"cells": [{"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "mempages", "pinned_cpus", "memory", "siblings", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 0, "memory_usage": 0, "cpuset": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], "pinned_cpus": [], "siblings": [[16, 17], [10, 11], [4, 5], [8, 9], [12, 13], [2, 3], [14, 15], [6, 7], [18, 19]], "memory": 20479, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4456317, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 3, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 0}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.2", "nova_object.changes": ["cpu_usage", "memory_usage", "cpuset", "mempages", "pinned_cpus", "memory", "siblings", "id"], "nova_object.name": "NUMACell", "nova_object.data": {"cpu_usage": 0, "memory_usage": 0, "cpuset": [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], "pinned_cpus": [], "siblings": [[32, 33], [36, 37], [22, 23], [24, 25], [28, 29], [30, 31], [38, 39], [26, 27], [34, 35]], "memory": 20480, "mempages": [{"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 4718592, "reserved": 0, "size_kb": 4}, "nova_object.namespace": "nova"}, {"nova_object.version": "1.1", "nova_object.changes": ["total", "used", "reserved", "size_kb"], "nova_object.name": "NUMAPagesTopology", "nova_object.data": {"used": 0, "total": 2, "reserved": 0, "size_kb": 1048576}, "nova_object.namespace": "nova"}], "id": 1}, "nova_object.namespace": "nova"}]}, "nova_object.namespace": "nova"}

Minho Ban (mhban) wrote :

At first round of RT it is OK since usage information is recovered and updated by _update_usage_from_instances() followed by _update(). But 2nd round the information is clobbered by _copy_resources() overwriting CN with resources that has empty usage information about NUMA. ( https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L561)

That's why this issue happens after 2 minutes later.

Minho Ban (mhban) wrote :

After the 2nd round of RT the information never been recovered (updated) since there are no changes indeed.

summary: - empty usage information in numa_topology of compute_node table after
- restart nova-compute
+ empty usage information in numa_topology of compute_node table after 2
+ min
Minho Ban (mhban) wrote :

This bug has been appeared since the commit below.

commit b6edbce9b3b092581eeda6be15059a1ba2bfcfdf
Author: He Jie Xu <email address hidden>
Date: Thu Sep 8 19:46:24 2016 +0800

    Ensure ResourceProvider/Inventory created before add Allocations record

diff --git a/nova/compute/resource_tracker.py b/nova/compute/resource_tracker.py
index 972b084..8857862 100644
--- a/nova/compute/resource_tracker.py
+++ b/nova/compute/resource_tracker.py
@@ -419,6 +419,7 @@ class ResourceTracker(object):
         if self.compute_node:
             self._copy_resources(resources)
             self._setup_pci_tracker(context, resources)
+ self.scheduler_client.update_resource_stats(self.compute_node)
             return

         # now try to get the compute node record from the
@@ -427,6 +428,7 @@ class ResourceTracker(object):
         if self.compute_node:
             self._copy_resources(resources)
             self._setup_pci_tracker(context, resources)
+ self.scheduler_client.update_resource_stats(self.compute_node)
             return

         # there was no local copy and none in the database
@@ -441,6 +443,7 @@ class ResourceTracker(object):
                  {'host': self.host, 'node': self.nodename})

         self._setup_pci_tracker(context, resources)
+ self.scheduler_client.update_resource_stats(self.compute_node)

I'm not quite sure why such DB updating is required at here but because of that NUMA usage information has gone away. I think if the patch just aimed to provide RP then it should have implemented taking other way.

But for now there seem to be no such error like bug #1621437 anymore even without the patch.

summary: empty usage information in numa_topology of compute_node table after 2
- min
+ min (approximately)

Fix proposed to branch: master
Review: https://review.openstack.org/529236

Changed in nova:
assignee: nobody → Minho Ban (mhban)
status: New → In Progress
Minho Ban (mhban) wrote :

Looks to be related (or duplicated) of https://bugs.launchpad.net/nova/+bug/1729621 but it took different approach.

Matt Riedemann (mriedem) on 2018-01-29
tags: added: compute resour
tags: added: resource-tracker
removed: resour

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/529236
Reason: This was resolved in https://review.opendev.org/#/c/520024/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers