libvirt driver connection validation causes unnecessary process execution with libvirt/qemu

Bug #1100446 reported by Attila Fazekas
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Vish Ishaya
Folsom
Fix Released
High
Daniel Berrange

Bug Description

A VM transits from BUILD to ACTIVE status can take 26 second with libvirt/qemu.

This transition is critical in the gate system's performance too.

https://github.com/openstack/nova/blob/c215b5ec79516111456dfc2a63fa0facf5946ab0/nova/virt/libvirt/driver.py#L365
This call should replaced to something cheaper, Like LibVirt Version (or Hostname query .)
Or by an something even cheaper solution.

Note:
The one minute periodical status update also leads to this expensive call. I do not think the architecture changes frequently.
Consider query it only on service start-up.

If you just use the getCapabilies only at startup, you can reduce the ~26 second to ~13 second!

If your qemu supports multiple architecture it is much slower, and by fixing this issue, you can have even greater performance.

You can see the executions done by the libvirtd by this.
strace -Ff -p <libvirtd_pid> -e execve

You will see several hundred/ or thousands(multi arch) of similar execve lines:

29010 +++ exited with 0 +++
5382 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29010, si_status=0, si_utime=1, si_stime=0} ---
29011 execve("/usr/bin/qemu-system-x86_64", ["/usr/bin/qemu-system-x86_64", "-device", "?", "-device", "pci-assign,?", "-device", "virtio-blk-pci,?", "-device", "virtio-net-pci,?", "-device", "scsi-disk,?"], [/* 2 vars */]) = 0
29011 +++ exited with 0 +++
5382 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29011, si_status=0, si_utime=1, si_stime=0} ---
29012 execve("/usr/bin/qemu-system-x86_64", ["/usr/bin/qemu-system-x86_64", "-cpu", "?"], [/* 2 vars */]) = 0
29012 +++ exited with 0 +++
5382 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29012, si_status=0, si_utime=1, si_stime=0} ---
29013 execve("/usr/bin/qemu-system-x86_64", ["/usr/bin/qemu-system-x86_64", "-help"], [/* 2 vars */]) = 0
29013 +++ exited with 0 +++
5382 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=29013, si_status=0, si_utime=1, si_stime=0} ---
29014 execve("/usr/bin/qemu-system-x86_64", ["/usr/bin/qemu-system-x86_64", "-device", "?", "-device", "pci-assign,?", "-device", "virtio-blk-pci,?", "-device", "virtio-net-pci,?", "-device", "scsi-disk,?"], [/* 2 vars */]) = 0
29014 +++ exited with 0 +++

(you can add a -ttt argument for time measurement )

Revision history for this message
Vish Ishaya (vishvananda) wrote :

This seems like a huge performance win.

tags: added: folsom-backport-potential
Changed in nova:
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/19880

Changed in nova:
assignee: nobody → Vish Ishaya (vishvananda)
summary: - libvirt driver connection validation causes planty of unnecessary
- process execution with libvirt/qemu
+ libvirt driver connection validation causes unnecessary process
+ execution with libvirt/qemu
Revision history for this message
Daniel Berrange (berrange) wrote :

What version of libvirt are you using which exhibits this behaviour ? Libvirt will cache the results of querying QEMU so repeated calls to getCapabilities should not cause any problems, unless you have a fairly old libvirt.

Revision history for this message
Attila Fazekas (afazekas) wrote :

Known affected libvirt versions are: 0.9.8 and 0.9.11.8.

Probably the 0.10.2 is not affected (it tested with kvm , not with the soft emu).

getCapabilities transfers more data than the LibVersion query anyway, so the validation should be changed anyway.

getCapabilities probably does not have significant/measurable performance impact in the periodic status updates.

Revision history for this message
Daniel Berrange (berrange) wrote :

Also what version of QEMU / KVM are involved & what emulators are installed

Revision history for this message
Daniel Berrange (berrange) wrote :
Download full text (4.6 KiB)

Using systemtap I counted & timed the libvirt API calls that Nova is making. On Fedora 18 with qemu 1.2.0 and libvirt 1.0.0, I get the following (nb times are cumulative execution time for all counted API calls, in milliseconds)

Current code, during startup

auth_list:66 count=2 time=2
auth_polkit:70 count=2 time=33
open:1 count=2 time=14
get_lib_version:157 count=1 time=1
get_capabilities:7 count=57 time=393
num_of_domains:51 count=7 time=8
num_of_defined_domains:25 count=2 time=2
domain_lookup_by_name:23 count=1 time=0
node_get_info:6 count=15 time=25
get_type:3 count=2 time=2
get_version:4 count=5 time=5
get_hostname:59 count=5 time=5

Clearly there are far too many calls to getCapabilities here, but it is still only 400ms total time on my machine

Just changing getCapablities to getLibVersion in the test connection code changes the results to look like

auth_list:66 count=2 time=2
auth_polkit:70 count=2 time=29
open:1 count=2 time=13
get_lib_version:157 count=49 time=59
num_of_domains:51 count=7 time=8
num_of_defined_domains:25 count=2 time=2
domain_lookup_by_name:23 count=1 time=1
node_get_info:6 count=15 time=26
get_capabilities:7 count=9 time=64
get_type:3 count=2 time=2
get_version:4 count=5 time=6
get_hostname:59 count=5 time=6

So as expected, there are far fewer calls to getCapabilities now, and correspondingly large number to getLibVersion(). Approx 300ms ha...

Read more...

Revision history for this message
Sean Dague (sdague) wrote :

I'm bumping up the priority of this one. The devstack on qemu in our CI environment would get dramatically faster with this change. That's Ubuntu 12.04 using qemu (no kvm acceleration, as it's running in kvm or xen guests).

It was a very good find by Attila, and something I'd like to see us get in soon.

Changed in nova:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/19880
Committed: http://github.com/openstack/nova/commit/ec3d7e4cb882eff42fa8f1e5f8f52723fb909b0e
Submitter: Jenkins
Branch: master

commit ec3d7e4cb882eff42fa8f1e5f8f52723fb909b0e
Author: Vishvananda Ishaya <email address hidden>
Date: Wed Jan 16 16:50:47 2013 -0800

    libvirt: Optimize test_connection and capabilities

    The getCapabilities call can be very slow so it is not a good choice
    for testing the libvirt connection. This patch switches to
    getLibVersion and also caches the result of getCapabilities so it
    doesn't need to be requested every time. Note that this means that
    nova-compute will need to be restarted if the capaabilities of the
    host changes. This is an acceptable risk because capabilities
    changes should be very rare and nova-compute should be restarted
    if libvirt is restarted or reinstalled.

    This simple change lowers boot time in my devstack install from
    22 seconds down to 8 seconds!

    Fixes bug 1100446

    Change-Id: I1b5072a906b19c6130957cf255e8d35b20990828

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → grizzly-3
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/folsom)

Fix proposed to branch: stable/folsom
Review: https://review.openstack.org/23304

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/folsom)

Reviewed: https://review.openstack.org/23304
Committed: http://github.com/openstack/nova/commit/f8c5492bef9e6057f4bd3cdc0b94e5bff4d7e5d8
Submitter: Jenkins
Branch: stable/folsom

commit f8c5492bef9e6057f4bd3cdc0b94e5bff4d7e5d8
Author: Vishvananda Ishaya <email address hidden>
Date: Wed Jan 16 16:50:47 2013 -0800

    libvirt: Optimize test_connection and capabilities

    The getCapabilities call can be very slow so it is not a good choice
    for testing the libvirt connection. This patch switches to
    getLibVersion and also caches the result of getCapabilities so it
    doesn't need to be requested every time. Note that this means that
    nova-compute will need to be restarted if the capaabilities of the
    host changes. This is an acceptable risk because capabilities
    changes should be very rare and nova-compute should be restarted
    if libvirt is restarted or reinstalled.

    This simple change lowers boot time in my devstack install from
    22 seconds down to 8 seconds!

    Fixes bug 1100446
    (cherry picked from commit ec3d7e4cb882eff42fa8f1e5f8f52723fb909b0e)

    Conflicts:
     nova/virt/libvirt/driver.py

    Change-Id: I3adbb48a2859a54ed93503f26de684acbd157841

Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-3 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers