nova-powervm

Reduce overhead for redundant PartitionState events

Bug #1694784 reported by Jeremy Arnold on 2017-05-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	nova-powervm	Fix Released	Undecided	Unassigned

Bug Description

Changes were made to pypowervm in https://github.com/powervm/pypowervm/commit/99b3c2de22281292a1691beeb429276ea5bd3f84 to eliminate the 15 second polling interval for NovaLink REST events. This results in more immediate notification of events. Unfortunately, it also increases overhead at scale because we now receive multiple related events that could previously be merged together but now must be handled individually.

One particular example of this is the handling of PartitionState events in nova_powervm/virt/powervm/event.py. During deploy, we expect to see as many as a dozen PartitionState events during the VM's initial boot. Most of these are low-level transient state changes that make no difference to OpenStack's view of the partition's state.

The PowerVMLifecycleEventHandler already has some special handling to delay notification of certain events in order to avoid unexpected transitions. This support could perhaps be extended to avoid additional redundant transitions.

I would recommend the following changes:

1) Eliminate the temporary instance caching in nova_powervm/virt/powervm/event.py. This was a significant benefit when we had a 15 second event polling interval, because we would often receive multiple events for the same instance. With the polling interval eliminated, we rarely receive more than one event at a time, so the instance caching doesn't do any good.

2) Don't actually retrieve the instance from nova until we are actually going to issue a notification. For NVRAM events, this would be just before calling the nvram_mgr. For PartitionState events, the instance shouldn't be retrieved until just before we call _driver.emit_event from the PowerVMLifecycleEventHandler (which avoids the get_instance calls in the case where a queued event gets canceled). Eliminating these extra get_instance calls reduces overhead on the controller, which can be substantial in a busy environment when we receive multiple events for the same instance over a relatively short period of time.

3) Either treat all PartitionState events as "delayed" events (allowing several seconds for new events to replace old ones before actually emitting an event) or maintain a cache of the last observed PartitionState for each LPAR, and only retrieve the instance and emit the event if the new state is different. Since we will typically only see a sequence of PartitionState changes for a small percentage of LPARs, this cache could be either fixed-size or use a time-based eviction policy to limit memory consumption and so we don't have to monitor deletes.
I'm sure there are other implementation alternatives. The key is to reduce the number of calls to vm.get_instance so that the controller doesn't get bogged down with large numbers of events.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-06: Fix merged to nova-powervm (master)

Reviewed: https://review.openstack.org/469982
Committed: https://git.openstack.org/cgit/openstack/nova-powervm/commit/?id=db759ce5158446a918e76c049d0efc753e0bbc72
Submitter: Jenkins
Branch: master

commit db759ce5158446a918e76c049d0efc753e0bbc72
Author: Eric Fried <email address hidden>
Date: Thu Jun 1 14:10:48 2017 -0500

Performance improvements for Lifecycle events

Implement various performance improvements in the event handler.

    - Since get_instance is expensive, delay it as long as possible (see #2
      in the bug report). Only retrieve the instance right before we're
      going to use it.

- Delay all PartitionState events (see #3 in the bug report).

    - Skip PartitionState-driven events entirely if nova is in the middle of
      an operation, since nova is already aware of the appropriate state
      changes.

- Only retrieve the admin context once, and cache it.

We keep the instance cache (see #1 in the bug report) since scale
testing showed it was indeed being used a nontrivial amount of the time.

Change-Id: I1f1634215b4c269842584c59f2c14c119c282b7e
Closes-Bug: #1694784

Changed in nova-powervm:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.