Nova sync power state on large clusters causes poor performance

Bug #1835958 reported by tom king
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Gary Kotton

Bug Description

In a vSphere-backed OpenStack, a fair number of instance actions (creating an instance, live-migrating an instance, and detaching interfaces for examples) cause a Nova sync power process to be kicked off.

Large compute clusters (thousands of VMs) suffer bad performance when an instance action kicks off multiple syncs. For example, instance creation can take anywhere from 5 - 20 minutes if many of these syncs are in flight.

Some of these syncs shouldn't process the entire cluster if they are specific to the instance, unless there are some other hidden/unknown considerations. A sync of a single instance that is having its interface detached shouldn't require a sync of the entire cluster.

Tags: compute vmware
Revision history for this message
Chris Dent (cdent) wrote :

There's been some work in this area to try to resolve some of those issues, but they may not have made it in to any release yet. I'm looking around to find more information on what patches were made when.

In the meantime can you report on which release you are using and how you deploy your OpenStack?

Thans.

Revision history for this message
Matt Riedemann (mriedem) wrote :

> cause a Nova sync power process to be kicked off.

I'm not sure if you're referring to the _sync_power_states periodic task:

https://github.com/openstack/nova/blob/ff0feed25d56c8ccd2298d5b5b82e636880fa986/nova/compute/manager.py#L7907

If you are, that's a mischaracterization of what happens when you create or live migrate a server. Since it's a periodic task, it runs like a cron.

Maybe you mean "claims" performed in the resource tracker, like during instance create?

https://github.com/openstack/nova/blob/ff0feed25d56c8ccd2298d5b5b82e636880fa986/nova/compute/manager.py#L2223

But again, those claims aren't (yet) made for live migration and interface attach/detach operations. So I'm a bit confused on what you're saying is taking a long time.

Regarding the _sync_power_states periodic task I've seen a few vcenter-specific patches to deal with a large number of instances when nova-compute is managing a large vcenter cluster, e.g.:

https://review.opendev.org/#/c/575034/

Also see bug 1513735.

tags: added: compute vmware
Revision history for this message
tom king (kingttx) wrote :

> I'm not sure if you're referring to the _sync_power_states periodic task...

Not in this bug, no. I've watched the nova-compute.log file during some of those example operations and the req-#### for the operation is added to the sync. It isn't part of the periodic sync, it gets kicked off as part of those example operations.

> Maybe you mean "claims" performed in the resource tracker, like during instance create?

No, it literally takes 5-20 minutes to create an instance, especially if we create more than one instance at a time (which happens a LOT in our environment).

Lately, I deployed three test instances.
1st instance took 4 minutes
2nd instance took 10 minutes
3rd instance took 11 minutes

> In the meantime can you report on which release you are using and how you deploy your OpenStack?

Queens via VMware's VIO 5.x installer.

Thanks!
Tom

Revision history for this message
Richard Boswell (richard-boswell) wrote :

Tom,

There are two things that _should_ help:

- Disable power state checks in nova.conf.
- Break up your ESXi clusters into smaller ones.

The problem is the distribution of the power state checks themselves, if there are ESXi clusters that are large (greater than say 500 VMs) the nova-compute instances, in a few cases, begin to experience service degradation due to the number of power state checks. I wrote about tangentially at http://www.revolutionlabs.net/vio_resource_schedulers_part2/.

Revision history for this message
tom king (kingttx) wrote :

>There are two things that _should_ help:
>
>- Disable power state checks in nova.conf.
>- Break up your ESXi clusters into smaller ones.

With as much churn our environment has, disabling the power state checks wouldn't be good. We use it for dev/test, and the users do a lot of stuff within the instance operating system. We are changing the periodic check to 1 hour rather than the default 10 minutes, though.

Breaking up the large clusters into smaller ones will likely help a ton but will really mask the underlying issue: actions taken on a specific instance causes a _full_ power sync to be kicked off for every single instance.

I know this would mean a pretty involved rewrite, but an action taken on a particular instance should only do a power sync for that particular instance rather than every single instance. Also, I'm not understanding the need/benefit of doing a whole power sync when creating instances. Scaling up a cluster would perform better if that isn't necessary.

Thanks!
Tom

Revision history for this message
Chris Dent (cdent) wrote :

Tom,

> Breaking up the large clusters into smaller ones will likely help a ton but will really mask the underlying issue: actions taken on a specific instance causes a _full_ power sync to be kicked off for every single instance.

It's not clear how this could be happening. syncing power states is supposed to only happen during an independent periodic job, without regard to incoming instances.

If your evidence of this is solely the logs and the request id that is involved, it could be that this patch <https://review.opendev.org/#/c/575034/>, which is merged in the VIO release (plus a several other power sync related changes) but not upstream, is disrupting the context variable used to provide the request-id used in the logs. That is, there could be a flow like this:

* power sync starts
* it sleeps, relinquishing the thread
* an instance is spawned, context is updated
* power sync starts back up with the new context

However: The VIO code for _sync_power_states in 5.0 is quite a lot different from the upstream code so it is hard to make any satisfying conclusions without much more information. I think you've already got a support issue opened with the VIO support people. If you can provide detailed logs _there_ (if you haven't already), inspecting those logs is probably going to be the most useful thing and won't be usable by non-VMware people upstream as they do not have access to the customised code.

If I had to guess (and that's all it is without more data) what's going on with the performance slowdown, there's been an unintended consequence from the optimisation that is present in the VIO code. Some situation where what normally does help ends up cluttering up the works.

But finally, to support what Richard said: There are a lot of loops in nova-compute over the number of active instances. These are developed with an eye towards what many of the developers think as the "normal" environment, libvirt+KVM on host where the number of instances is in the tens or hundreds, not thousands. This can sometimes lead to an unfortunate but true impedance mismatch between the way nova-compute wants to behave and the way vSphere would like it to.

You're completely correct that an individual instance action shouldn't cause an update of all the instances. If that is indeed what is happening, tracing it down and fixing it would be a great improvement for any virt driver. Which is a long way of saying: Thanks for working to get this fixed and provide the information needed to make it happen.

Revision history for this message
Matt Riedemann (mriedem) wrote :

The request ID from the API request getting re-used when the periodic task runs is a separate and yes very confusing serviceability issue. I thought I had reported a bug for this but can't find it. This change is related though:

https://review.opendev.org/#/c/524306/

Anyway, I think that's a red herring and has to do with the request ID being stored in a global thread context:

https://opendev.org/openstack/oslo.context/src/tag/2.22.1/oslo_context/context.py#L40

Revision history for this message
tom king (kingttx) wrote :
Download full text (9.9 KiB)

Which process is doing this? I'm attaching the compute node's full nova-compute.log file as well.

viouser@compute01:/var/log/nova$ rgrep "Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed"
nova-compute.log.1:2019-07-23 18:29:03.397 13699 DEBUG nova.compute.resource_tracker [req-ad1ccb2c-62d1-49de-a67b-2e910421e64b - - - - -] Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed on this compute host and has allocations in placement: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 8192, u'DISK_GB': 240}}. _remove_deleted_instances_allocations /usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py:1286
nova-compute.log.1:2019-07-23 18:31:07.005 13699 DEBUG nova.compute.resource_tracker [req-ad1ccb2c-62d1-49de-a67b-2e910421e64b - - - - -] Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed on this compute host and has allocations in placement: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 8192, u'DISK_GB': 240}}. _remove_deleted_instances_allocations /usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py:1286
nova-compute.log.1:2019-07-23 18:31:34.189 13699 DEBUG nova.compute.resource_tracker [req-ad1ccb2c-62d1-49de-a67b-2e910421e64b - - - - -] Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed on this compute host and has allocations in placement: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 8192, u'DISK_GB': 240}}. _remove_deleted_instances_allocations /usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py:1286
nova-compute.log.1:2019-07-23 18:41:46.361 13699 DEBUG nova.compute.resource_tracker [req-ad1ccb2c-62d1-49de-a67b-2e910421e64b - - - - -] Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed on this compute host and has allocations in placement: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 8192, u'DISK_GB': 240}}. _remove_deleted_instances_allocations /usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py:1286
nova-compute.log.1:2019-07-23 18:43:11.220 13699 DEBUG nova.compute.resource_tracker [req-ad1ccb2c-62d1-49de-a67b-2e910421e64b - - - - -] Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed on this compute host and has allocations in placement: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 8192, u'DISK_GB': 240}}. _remove_deleted_instances_allocations /usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py:1286
nova-compute.log.1:2019-07-23 18:44:09.366 13699 DEBUG nova.compute.resource_tracker [req-ad1ccb2c-62d1-49de-a67b-2e910421e64b - - - - -] Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed on this compute host and has allocations in placement: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 8192, u'DISK_GB': 240}}. _remove_deleted_instances_allocations /usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py:1286
nova-compute.log.1:2019-07-23 18:45:11.148 13699 DEBUG nova.compute.resource_tracker [req-ad1ccb2c-62d1-49de-a67b-2e910421e64b - - - - -] Instance 004667e4-a4b4-46e9-beb0-cefd7d5f7ee2 actively managed on this compute host and has allocations in placement: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 8192, u'DISK_GB': 240}}. _remove_deleted_instances_allocations /usr/lib/python2.7/dist-packages/nova/comput...

Revision history for this message
tom king (kingttx) wrote :

Adding log file.

Revision history for this message
melanie witt (melwitt) wrote :

> The request ID from the API request getting re-used when the periodic task runs is a separate and yes very confusing serviceability issue. I thought I had reported a bug for this but can't find it. This change is related though:

> https://review.opendev.org/#/c/524306/

> Anyway, I think that's a red herring and has to do with the request ID being stored in a global thread context:

> https://opendev.org/openstack/oslo.context/src/tag/2.22.1/oslo_context/context.py#L40

I just wanted to note that AFAICT, the request ID from the API request shouldn't be able to be re-used by a periodic task unless the deployed version of queens is before version 17.0.6 released on 2018-09-24 when the following bug was fixed:

https://bugs.launchpad.net/nova/+bug/1773102

Revision history for this message
tom king (kingttx) wrote :

@melanie, it does indeed look this way.

While this is still a heavy process, it's looking like the req-#### is a red herring and this ticket is no longer necessary. Feel free to change the ticket to whatever fits this situation.

If anyone is interested, we ultimately found a different issue that should be in the next patch. I may have some details wrong here: When an instance is created or its interfaces modified, a process goes through the _entire_ set of port groups to find the proper one(s) to drop the interface(s) into. This happens every time.

With us having hundreds (possibly thousands) of networks because of the kind of work done on our VIO, this takes a LONG time for every single one.

Thank you, folks!
Tom

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/682242

Changed in nova:
assignee: nobody → Gary Kotton (garyk)
status: New → In Progress
Revision history for this message
Gary Kotton (garyk) wrote :

The root cause of the problem is the background method update_available_resources. This one takes a global lock and during that time all instances being booted are waiting on the lock to be released.
Patch for review - https://review.opendev.org/682242

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.