Resource Tracker failed to update usage in case numa topology conflict happen

Bug #1829349 reported by leehom
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
leehom

Bug Description

Let me describe when this bug will happen first.

Assume there are 2 running VMs they are booted with flavor that contain metadata 'hw:cpu_policy=dedicated'.

And assume some of these 2 VMs' vCPUs are pinned to the same physical CPU.
Let's say VM1 pinned to {"0": 50, "1": 22, "2": 49, "3": 21}
and VM2 pinned to {"0": 27, "1": 55, "2": 50, "3": 22}

Refer to patch https://opendev.org/openstack/nova/commit/52b89734426253f64b6d4797ba4d849c3020fb52 merged in Rocky Release.
By default live migration is disabled if instance has a numa topology. But can still be enabled by configure CONF.workarounds.enable_numa_live_migration. As live migration is really very important for us in daily operation work.

So when I do live migration to these 2 VMs at the same time what will happen?

In my case, I will encounter 3 problems.
#1. Because numa related information is not reported to placement. Placement API will return same candidates, and as schedule action is async so there is probability that VM1 and VM2 will pick the same dest host. In my case, these 2 VMs both passed scheduler and picked the same host.

#2. And then as BP numa-aware-live-migration[https://review.opendev.org/#/q/topic:bp/numa-aware-live-migration] is not implement completed yet, VM1 and VM2 will use the same numa-topology from their src host. So as a result after VM1 and VM2 start up. Conflict will happen on host CPU 50 and 22. About numa-aware-live-migration related bug can be found at https://bugs.launchpad.net/nova/+bug/1289064

#3. And as VM1 and VM2 have numa-topology conflict, we will hit the third problem. That is as the title says resource tracker failed to update usage. That is because when call _update_usage in RT, it will eventually call numa_usage_from_instances.

nova.compute.resource_tracker:_update_usage
` nova.virt.hardware:get_host_numa_usage_from_instance
  ` nova.virt.hardware:numa_usage_from_instances

And numa_usage_from_instances will

                    if free:
                        if (instancecell.cpu_thread_policy ==
                                fields.CPUThreadAllocationPolicy.ISOLATE):
                            newcell.unpin_cpus_with_siblings(pinned_cpus)
                        else:
                            newcell.unpin_cpus(pinned_cpus)
                    else:
                        if (instancecell.cpu_thread_policy ==
                                fields.CPUThreadAllocationPolicy.ISOLATE):
                            newcell.pin_cpus_with_siblings(pinned_cpus)
                        else:
                            newcell.pin_cpus(pinned_cpus)

And in pin_cpus, pin_cpus_with_siblings, unpin_cpus and unpin_cpus_with_siblings, if there is numa-topology conflict, they will raise an Exception. The result is RT failed to update usage to Scheduler. And Eventually cause scheduler always think this host has enough resource to boot new VMs. So the result is disaster.

So I think, to completed solve problem with VMs has a numa-topology.
For
Problem #1, we need to report numa topology to placement API as well, and take numa-topology into account when get candidates from placement.

Problem #2,we need to continue complete BP numa-aware-live-migration

Problem #3, numa_usage_from_instances is used in RT and scheduler.
In scheduler numa_usage_from_instances will not hit this problem because it is used right after virt.hardware.numa_fit_instance_to_host. So I think raise an exception has no meaning. We can just change the exception to an Error Log instead.

Above is the summary about the live migration issue in my mind.
And this bug is focused on solving the problem#3.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Just to make sure I'm following, since we don't claim resources in the resource tracker (yet) for NUMA during live migration on the dest host, both VMs landed there and are running there but with conflicting pinned CPUs and that's what causes the update_available_resource periodic task on the destination compute host to fail, until one of those VMs is deleted or migrated elsewhere, is that correct?

tags: added: live-migration numa resource-tracker
Revision history for this message
Matt Riedemann (mriedem) wrote :

Regarding this comment:

"The result is RT failed to update usage to Scheduler. And Eventually cause scheduler always think this host has enough resource to boot new VMs. So the result is disaster."

There is the BuildFailureWeigher:

https://docs.openstack.org/nova/latest/user/filter-scheduler.html#weights

https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.build_failure_weight_multiplier

Are the repeated build failure attempts on that host not eventually taking the host out of scheduling decisions, because it should.

Revision history for this message
leehom (feli5) wrote :

For #1
Yes

For #2
In my ENV for some reason placement is not enabled.
So scheduler can only depend on information RT provided to make decision. And because NUMA conflict RT failed to report usage.

As a result scheduler will always think this host has enough resource to boot new VMs.
At this situation
resize or create will claim fail, when there is really not enough resource available or there is numa conflict. But still will boot more VMs than expected.

but live migration do not claim, it will always success, so in my case, this is the problem.

If placement is enabled, seems
"The result is RT failed to update usage to Scheduler. And Eventually cause scheduler always think this host has enough resource to boot new VMs. So the result is disaster."
can be prevented by placement.

So to summary,
The problem is when there is numa conflict, RT will not able to update usage to scheduler, which will cause hypervisor resource usage info not correct.

Revision history for this message
leehom (feli5) wrote :

One thing I get confused is we have
instance_claim for new build
rebuild_claim for rebuild
and resize_claim for migrate and resize.

Why we don't create a live_migrate_claim for live migration as well?
Is this because we want to implement this part in placement API?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661208

Changed in nova:
assignee: nobody → leehom (feli5)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

As of the Train release we support NUMA-aware live migration for the libvirt driver:

https://review.opendev.org/#/c/634606/

However, someone reported a race with resource claims in the mailing list while testing that series:

http://lists.openstack.org/pipermail/openstack-discuss/2019-September/009447.html

I wonder if the race described in this bug is the same thing?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/661208
Reason: I think this is a duplicate of #1879878. Closing as such.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.