Comment 6 for bug 1835958

Revision history for this message
Chris Dent (cdent) wrote :

Tom,

> Breaking up the large clusters into smaller ones will likely help a ton but will really mask the underlying issue: actions taken on a specific instance causes a _full_ power sync to be kicked off for every single instance.

It's not clear how this could be happening. syncing power states is supposed to only happen during an independent periodic job, without regard to incoming instances.

If your evidence of this is solely the logs and the request id that is involved, it could be that this patch <https://review.opendev.org/#/c/575034/>, which is merged in the VIO release (plus a several other power sync related changes) but not upstream, is disrupting the context variable used to provide the request-id used in the logs. That is, there could be a flow like this:

* power sync starts
* it sleeps, relinquishing the thread
* an instance is spawned, context is updated
* power sync starts back up with the new context

However: The VIO code for _sync_power_states in 5.0 is quite a lot different from the upstream code so it is hard to make any satisfying conclusions without much more information. I think you've already got a support issue opened with the VIO support people. If you can provide detailed logs _there_ (if you haven't already), inspecting those logs is probably going to be the most useful thing and won't be usable by non-VMware people upstream as they do not have access to the customised code.

If I had to guess (and that's all it is without more data) what's going on with the performance slowdown, there's been an unintended consequence from the optimisation that is present in the VIO code. Some situation where what normally does help ends up cluttering up the works.

But finally, to support what Richard said: There are a lot of loops in nova-compute over the number of active instances. These are developed with an eye towards what many of the developers think as the "normal" environment, libvirt+KVM on host where the number of instances is in the tens or hundreds, not thousands. This can sometimes lead to an unfortunate but true impedance mismatch between the way nova-compute wants to behave and the way vSphere would like it to.

You're completely correct that an individual instance action shouldn't cause an update of all the instances. If that is indeed what is happening, tracing it down and fixing it would be a great improvement for any virt driver. Which is a long way of saying: Thanks for working to get this fixed and provide the information needed to make it happen.