OpenStack Compute (nova)

Major performance issue with libvirt related to number of VMs on a host

Bug #928910 reported by Phil Day on 2012-02-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Critical	Jay Pipes	OpenStack Compute (nova) 2012.1 "essex"

Bug Description

Hi All,

We hit a performance issue with KVM on our Diablo based build, which as far as I can see still applies to Essex.

Whilst I have a fix for our baseline, I’m unlikely to produce an Essex based fix in the time that the community would want. As it took quite me quite a while to work out what was happening I’m hoping that by positing the details of the problem and how we’ve addressed it someone else can quickly produce an Essex equivalent. Happy to answer any questions or help out in any other way.

The symptom is that we saw a degradation in VM start-up time on KVM that is directly proportional to the number of VMs already running on the host, and reaches a critical tipping point when that number reaches around 30.

What we found is that there are two places in the compute manager code that make calls to libvirt which take around 2 seconds, and these calls are made for all VMs on the host. These loops do not get pre-empted.

1) During creation of a VM the compute manager checks if a VM with the same name already exists (in Daiblo this was in _run_instance, in Essex its been refactored as _check_instance_not_already_created)

2) The _sync_power_states periodic task which updates the power_state value in the DB.

Note that neither of these gets pre-empted, so the compute manager will block all other processing for ~2 seconds x number_of_VMs. Once the number of VMs reached 30 then the compute manager can block for over 60 seconds, which in turn means that the service update tread will not run in time and the compute service will become un-available (in itself this isn’t a bad thing as it stops the scheduler from sending any more requests to that host).

The Periodic task has the biggest impact as it runs with an interval of 60 seconds – so at minute intervals the compute_manager is blocked for over a minute. Also requests for new VMs not take over a minute to get through the _check_instance_not_already_created – so by the time they do reach a pre-emption point (such as the call to the network_manager) the periodic task will be ready to run again and block for a further minute. This explains the dramatic degredation we saw in start up time once the number of running instances got past 30.

Its possible that a more efficient way of getting the power state from libvirt can be found (it seems that like me libvirt is build for comfort rather that speed) – but we haven’t found one yet, so instead we took the following approach:

- Refactor _check_instance_not_already_created into a direct virt driver method (rather that iterating through the list) so that different implementations can provide a direct check for a specific instance. For example it may be possible to just look directly at the file system rather than rely on libvirt. We’ve also added a flag to disable this check, which is what we’re doing for now on the basis that the low risk of a failure later in creation is better that the cost of this check as it stands.

- Modify list_instances_details in the libvirt driver so that it has a sleep(0) within the loop. This prevents this from blocking all other eventlets

- Create a separate eventlet to run the _sync_power_state – this is because even with the sleep(0) in place it would otherwise still block other periodic tasks, some of which (such as checks on resource allocation and billing data generation) we want to run every minute even if the _sync_power_state is going to take longer than that to run. This new eventlet now runs less freqeuently – say every 10 minutes. Note that just changing the interval for _sync_power_state isn’t enough, it need to be in a separate eventlet to avoid blocking other periodic tasks. Maybe a more general fix would be to create separate eventlets for every periodic task – rather than just have them run at different intervals in the same eventlet

In seems that in Essex the EC2 API no longer returns power_state as it did in Diablo, so the user experience isn’t affected by the increased refresh interval. However there are some cases in the code that do rely on the value in the DB. In all but one case these can be changed to just refreshes it first from libvirt:

compute/manager.py:init_host() : to see which VMs need to be restarted. Can’t change this as it’s a direct comparison of DB state with virt state, so the risk that some VMs fail to be started increases with a longer interval between sync_power-state

compute/manager.py:_shutdown_instance() : - To raise an exception if the instance is already shutdown. – can add an explicit call to self_get_power_state()

compute/manager.py:set_admin_password(): To check is the state is running – can add an explicit call to self_get_power_state()

compute/manager.py:inject_file(): To check is the state is running – can add an explicit call to self_get_power_state()

compute/manager.py:update_agent(): To check is the state is running – can add an explicit call to self_get_power_state()

compute/manager.py:get_diagnostics(): To check is the state is running – can add an explicit call to self_get_power_state()

Hope this helps and is clear – as I said happy to answer any questions (<email address hidden>)

Vish Ishaya (vishvananda) on 2012-02-08

Changed in nova:
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

Jay Pipes (jaypipes) wrote on 2012-02-08:

The existing method in _check_instance_not_already_created() of looking for a supplied instance name by looping over the return from driver.list_instances() is horribly inefficient.

Run the attached script to show how inefficient it is...

On a host with just 10 tiny instances running, doing a simple libvirt.connection.lookupByName(XXX) is an order of magnitude faster than looping through the results of listDomainsID() and looking up the name of the instance using lookupByID():

jpipes@librebox:~/repos/junk$ python test_check_instance_name.py
Num running domains: 10

name in [conn.lookupByID(i).name()
for i in conn.listDomainsID()]
Found: False
took 0.00712 seconds

try: conn.lookupByName(x)
found = True
except libvirt.libvirtError:
found = False

libvir: QEMU error : Domain not found: no domain with matching name 'instance-1234567'
Found: False
took 0.00083 seconds

Revision history for this message

Jay Pipes (jaypipes) wrote on 2012-02-08:

test_check_instance_name.py Edit (948 bytes, text/x-python)

Revision history for this message

Jay Pipes (jaypipes) wrote on 2012-02-08:

Based on the difference in efficiency shown by the attached patch, I think an easy win would be to simply change the implementation of _check_instance_not_already_created() to instead check to see if the driver supports a lookup by Name, and if so, use that.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-02-08: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/3919

Changed in nova:
assignee:	nobody → Jay Pipes (jaypipes)
status:	Triaged → In Progress

Revision history for this message

Phil Day (philip-day) wrote on 2012-02-09:

Jay's change to _check_instance_not_already_created() is a good one (thanks Jay), but wanted to be sure that everyone understood that its just one part of the problem.

The time spent in _sync_power_state() is at least as big if not a bigger issue - and the refactor to make this a seperate thread with an explict yield between each instance is also needed.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-02-11: Fix merged to nova (master)

Reviewed: https://review.openstack.org/3919
Committed: http://github.com/openstack/nova/commit/27c11c4bb4e5c54282caf49cba666f45cfc590c2
Submitter: Jenkins
Branch: master

commit 27c11c4bb4e5c54282caf49cba666f45cfc590c2
Author: Jay Pipes <email address hidden>
Date: Wed Feb 8 15:51:59 2012 -0500

Remedies LP Bug #928910 - Use libvirt lookupByName() to check existence

    Make determining if an instance exists on a host
    more efficient by adding an instance_exists() method to the
    base virt driver that can be overridden by drivers that
    have a more efficient mechanism of looking up an instance
    by its ID / name. Modifies the _check_instance_already_created
    method of the compute manager to use this new instance_exists() method.

    Someone from Citrix should look into how to make the instance_exists()
    method in the Xen and VMWare virt drivers more efficient than the
    base "loop over all domains and see if the instance ID exists" method
    now in the base driver class.

Change-Id: Ibf219788f9c104698057367da89300a060945778

Changed in nova:
status:	In Progress → Fix Committed

Revision history for this message

Jay Pipes (jaypipes) wrote on 2012-02-11:

Setting back to In Progress while I address part 2 of the bug.

Changed in nova:
status:	Fix Committed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-02-12: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/4061

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-02-17: Fix merged to nova (master)

Reviewed: https://review.openstack.org/4061
Committed: http://github.com/openstack/nova/commit/1c8ad4553b4b8d404f941c5297e3f6e42c9f7e6a
Submitter: Jenkins
Branch: master

commit 1c8ad4553b4b8d404f941c5297e3f6e42c9f7e6a
Author: Jay Pipes <email address hidden>
Date: Sun Feb 12 13:34:14 2012 -0500

Completes fix for LP #928910 - libvirt performance

This patch adds the remainder of the recommended fixes
from the original bug report:

    * Modifies methods in the compute manager that relied on
      the DB power state to be in sync with the virt driver to
      instead just query the power state of the instance from the
      virt driver. This enables us to set the periodic tick to 10
      for the problematic compute.manager.Manager._sync_power_states()
      method.
    * Modifies the _sync_power_states method in the following ways:
     ** Replace the call to driver.list_instances_detail() to a new,
        driver-overrideable get_num_instances() call
     ** For each instance known by the database, call driver.get_info()
        separately inside the loop instead of calling the expensive
        list_instances_detail() method that can take a very long time
        to complete on hosts with lots of instances
     ** Call greenthread.sleep(0) before each call to update the
        database power state, enabling other periodic tasks to do work

    Once again, I left an inefficient default implementation of the
    new driver.get_num_instances() method in the base driver class. I
    need help from folks who understand the Xen/VMWare drivers to do
    an override for get_num_instances() in those drivers that calls
    the underlying XenAPI or VMWare API.

Change-Id: I88002689cdda32124423da320f8c542e286be51b