If libguestfs hangs when provisioning an instance, nova will wait forever.

Bug #1286256 reported by Lars Kellogg-Stedman
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Daniel Berrange

Bug Description

In some situations (using nested KVM in an environment where nested KVM support is buggy), the appliance started by libguestfs will hang and the libguestfs-launched qemu process will never exit. This will cause the launched instance to get stuck in state=spawning forver (or until someone explicitly kills the libguestfs appliance).

We should wrap the call to guestfs.Guestfs.launch with some sort of timeout to detect this situation and at least report an error.

Revision history for this message
Richard Jones (rjones-redhat) wrote :

Summarise what I discussed with Lars yesterday about this bug:

(1) If libguestfs hangs because of nested KVM, then it likely indicates your guest is going to hang too, so libguestfs is just being the canary in the mine here. However:

(2) Some users select libvirt_type=qemu to use software emulation when they know nested KVM is broken.

Libguestfs doesn't honor setting (2), but it certainly should, and it's easy to achieve that. After creating the handle but before calling g.launch(), you need to add the following bit of code:

    if // some test here that libvirt_type=qemu:
        try:
            g.set_backend_settings ("force_tcg")
        except AttributeError:
            # g.set_backend_settings method doesn't exist, ignore
            pass

----

Note for RHOS: If you want g.set_backend_settings to be backported, you need to open a bug in https://bugzilla.redhat.com

Revision history for this message
Lars Kellogg-Stedman (larsks) wrote :

Note that bug https://bugs.launchpad.net/nova/+bug/1286257 is open on the issue of using force_tcg when libvirt_type=qemu. This bug is explicitly about waiting forever for the libguestfs process. Since force_tcg isn't available in any shipping version of libguestfs in either Ubuntu, RHEL, or Fedora, I think that timing out this operation gracefully is going to be an important fix.

Tracy Jones (tjones-i)
tags: added: libvirt
Solly Ross (sross-7)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :
tags: added: libguestfs
Revision history for this message
Qin Zhao (zhaoqin) wrote :

I do not know which version of OpenStack code was running in https://bugzilla.redhat.com/show_bug.cgi?id=1087127. If it uses Havana code, probably https://bugs.launchpad.net/nova/+bug/1270304 can be a possible root cause.

In my bug https://bugs.launchpad.net/nova/+bug/1313477, Icehouse code is running. I also used to encounter https://bugs.launchpad.net/nova/+bug/1270304 with Icehouse code in Feb this year. After the patch is applied, the possibility of stuck guestfs become low. However, it still occurred in In my bug https://bugs.launchpad.net/nova/+bug/1313477.

Revision history for this message
Qin Zhao (zhaoqin) wrote :

I think I catch the root cause of https://bugs.launchpad.net/nova/+bug/1313477. It is a deadlocking issue. Is this one a similar deadlocking?

Revision history for this message
Lars Kellogg-Stedman (larsks) wrote : Re: [Bug 1286256] Re: If libguestfs hangs when provisioning an instance, nova will wait forever.

On Thu, May 29, 2014 at 03:35:15PM -0000, Qin Zhao wrote:
> I think I catch the root cause of
> https://bugs.launchpad.net/nova/+bug/1313477. It is a deadlocking
> issue. Is this one a similar deadlocking?

I don't think this is the same issue. This is clearly a problem with
the emulator hanging -- one can replicate the problem by simply
running "libguestfs-test-tool". It only happens when attempting to
use nested KVM in an environment in which nested KVM does not work
correctly.

--
Lars Kellogg-Stedman <email address hidden> | larsks @ irc
Cloud Engineering / OpenStack | " " @ twitter

Revision history for this message
Jason Brooks (jasonbrooks) wrote :

I'm experiencing this issue now. Nested KVM, libguestfs hangs and nova waits forever.

Interestingly, if I kill the first qemu-kvm process, the hanging libguestfs one, then nova proceeds to start up my instance without issue.

With a qemu wrapper that swaps in the Westmere cpu type, I can get the guestfs test tool to complete, but can't figure out how to make whichever openstack component that's responsible for kicking this off to use the wrapper.

Revision history for this message
Deepak C Shetty (dpkshetty) wrote :

@Jason,
   What is your virt_type in nova.conf set to .. qemu or kvm ? Just curious if it works for you because its set to qemu ?
I couldn't get it to work with kvm.

Revision history for this message
Deepak C Shetty (dpkshetty) wrote :

@Lars,

>I don't think this is the same issue. This is clearly a problem with
>the emulator hanging -- one can replicate the problem by simply
>running "libguestfs-test-tool". It only happens when attempting to
>use nested KVM in an environment in which nested KVM does not work
>correctly.

 What do you mean by the above "use nested KVM in an environment in which nested KVM does not work
correctly" ? How does one know that ?

Using qemu/libvirt tools, nested KVM works just fine and I have been using nested KVM for while now, until I hit the libguests issue thats only thru openstack Nova usecase.

Revision history for this message
Deepak C Shetty (dpkshetty) wrote :

Some more observations regarding this issue that I am seeign in my devstack-on-F20 setup

1) The libguestfs qemu process/instance created has -machine accel=kvm inspite of my nova.conf having virt_type = qemu, is this correct ?

2) Killing the libguestfs process, causes Nova to go ahead and spawn instnace successfully and the qemu instance process _does_not_ have accel=kvm (which is expected and right) as I have virt_type=qemu. Instnace is spawned fine with ACTIVE/Running state in nova list

3) Connecting to VNC console of instance, OS has booted fine and is @ the login prompt

Then i tried nova boot the exact same thing as before but with virt_type = kvm and ...

1) I see the same Nova instnace hung at 'spawning' state

2) libguestfs qmeu process has accel=kvm

3) Killing libguestfs qemu process causes Nova to proceed fine and Instance is ACTIVE/Running state

4) BUT... looking at the vnc console, OS isn't booted... its stuck at "Starting up...." for ever!
Is this issue of OS not booting when nested KVM is enabled, related to this bug or not ?

thanx,
deepak

Revision history for this message
Richard Jones (rjones-redhat) wrote :

> 1) The libguestfs qemu process/instance created has -machine accel=kvm
> inspite of my nova.conf having virt_type = qemu, is this correct ?

No, it's not correct and needs to be fixed. See comment 1.

Changed in nova:
assignee: nobody → Daniel Berrange (berrange)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/104262

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/104262
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8b6ea606d9dd883857c13ae43baf1e80aa0e8c58
Submitter: Jenkins
Branch: master

commit 8b6ea606d9dd883857c13ae43baf1e80aa0e8c58
Author: Daniel P. Berrange <email address hidden>
Date: Wed Jul 2 18:06:30 2014 +0100

    virt: force TCG with libguestfs unless KVM is enabled in libvirt

    If the libvirt driver has not been configured to use KVM, then
    the libguestfs module should be forced to use TCG. This is
    particularly important when running Nova inside a VM, which
    might claim to have VMX/SVM support when it is in fact broken.
    This will avoid libguestfs hanging in such scenarios.

    Resolves-bug: #1286256
    Change-Id: I9316dcedd65244c60d468b270311f032b45b051f

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → juno-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-3 → 2014.2
Revision history for this message
Jian Wen (wenjianhn) wrote :

g.set_backend_settings has not been backported to libguestfs-1.20.

A workaround is to disable the kvm module by adding the following lines to /etc/modprobe.d/blacklist.conf.

blacklist kvm
blacklist kvm-intel
install kvm /bin/true
install kvm-intel /bin/true

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.