LibVirtD crashing after many hours (100+)

Bug #953656 reported by justinsb on 2012-03-13
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned
libvirt (Ubuntu)
High
Unassigned

Bug Description

Running devstack, after many hours (100+?) of uptime, libvirt appears to crash (but this is not under heavy usage, just long uptime). When I launch instances, nothing is printed on nova-cpu, but the state is ERROR.

Nova scheduler prints this:

2012-03-12 19:08:10 WARNING nova.scheduler.manager [-] Failed to schedule_run_instance: No valid host was found.
2012-03-12 19:08:10 WARNING nova.scheduler.manager [-] Setting instance 0c96bc97-aa90-4501-98bc-b615c52899ef to ERROR state.

It looks like n-cpu is hanging here:

2012-03-12 14:48:22 DEBUG nova.manager [-] Running periodic task ComputeManager._poll_bandwidth_usage from (pid=16597) periodic_tasks /opt/stack/nova/nova/manager.py:152
2012-03-12 14:48:22 DEBUG nova.manager [-] Running periodic task ComputeManager.update_available_resource from (pid=16597) periodic_tasks /opt/stack/nova/nova/manager.py:152

If I Control-C nova-cpu, the following is printed:

^Clibvir: Remote error : poll on socket failed: Interrupted system call

Attempting to restart rapidly hangs here:

2012-03-12 19:11:01 DEBUG nova.compute.manager [req-e4f296a6-27e0-4441-b1e2-8366fbc79a75 None None] [instance: 2c3e8fc4-5938-4c3a-856b-83ea338c16cc] Checking state from (pid=19695) _get_power_state /opt/stack/nova/nova/compute/manager.py:260
2012-03-12 19:11:01 DEBUG nova.virt.libvirt.connection [req-e4f296a6-27e0-4441-b1e2-8366fbc79a75 None None] Connecting to libvirt: qemu:///system from (pid=19695) _get_connection /opt/stack/nova/nova/virt/libvirt/connection.py:253

Again, Control C gives the same error:

^Clibvir: Remote error : poll on socket failed: Interrupted system call

"virsh list" also fails; it also hangs.

This is with the 'recommended' configuration: DevStack, KVM; Ubuntu Oneiric

There are two libvirtd processes running (is this normal)?

If I gdb each libvirtd, I can get stack traces:

(gdb) thread apply all bt

Thread 1 (Thread 0x7f579dafe700 (LWP 22512)):
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1 0x00007f57a3afa404 in _L_lock_2263 () at tzset.c:616
#2 0x00007f57a3afa217 in __tz_convert (timer=0x7f57a3df5e28, use_localtime=1, tp=0x7f579dafc820) at tzset.c:619
#3 0x00007f57a54cef47 in virLogMessage (category=0x7f57a55b766a "file.util/command.c", priority=1, funcname=0x7f57a55b82c3 "virCommandHook", linenr=1842, flags=0,
    fmt=<optimized out>) at util/logging.c:727
#4 0x00007f57a54bdf0e in virCommandHook (data=0x7f5798109f00) at util/command.c:1842
#5 0x00007f57a54bec4e in virExecWithHook (argv=0x7f579800f7e0, envp=0x7f57980b69d0, keepfd=<optimized out>, retpid=0x0, infd=-1, outfd=<optimized out>,
    errfd=0x7f579810a06c, flags=4, hook=0x7f57a54bdc30 <virCommandHook>, data=0x7f5798109f00, pidfile=0x0) at util/command.c:531
#6 0x00007f57a54c0bea in virCommandRunAsync (cmd=0x7f5798109f00, pid=0x0) at util/command.c:1911
#7 0x00007f57a54c12e3 in virCommandRun (cmd=0x7f5798109f00, exitstatus=0x0) at util/command.c:1721
#8 0x0000000000452d62 in qemuCapsExtractVersionInfo (qemu=0x7f57980e6fa0 "/usr/bin/qemu", arch=0x4cfc27 "i686", retversion=0x0, retflags=0x7f579dafd338)
    at qemu/qemu_capabilities.c:1238
#9 0x00000000004535f5 in qemuCapsInitGuest (caps=0x7f5798109bd0, old_caps=0x7f5798109ac0, hostmachine=<optimized out>, info=0x6ff7e0, hvm=1) at qemu/qemu_capabilities.c:572
#10 0x0000000000453dc9 in qemuCapsInit (old_caps=0x7f5798109ac0) at qemu/qemu_capabilities.c:826
#11 0x000000000043e9a0 in qemuCreateCapabilities (oldcaps=<optimized out>, driver=0xc39fb0) at qemu/qemu_driver.c:234
#12 0x000000000044ac36 in qemudGetCapabilities (conn=<optimized out>) at qemu/qemu_driver.c:967
#13 0x00007f57a553e77c in virConnectGetCapabilities (conn=0xbbc4d0) at libvirt.c:5327
#14 0x000000000042ebf3 in remoteDispatchGetCapabilities (server=<optimized out>, client=<optimized out>, conn=<optimized out>, hdr=<optimized out>, rerr=0x7f579dafdc10,
    args=<optimized out>, ret=0x7f579dafdd00) at remote_dispatch_bodies.h:2658
#15 0x000000000043299b in remoteDispatchClientCall (qemu_protocol=<optimized out>, msg=0x7f5790104ff0, client=0x7f5798001a60, server=0xba42a0) at dispatch.c:516
#16 remoteDispatchClientRequest (server=0xba42a0, client=0x7f5798001a60, msg=0x7f5790104ff0) at dispatch.c:394
#17 0x000000000041f0bf in qemudWorker (data=0xbacbf0) at libvirtd.c:1619
#18 0x00007f57a3e01efc in start_thread (arg=0x7f579dafe700) at pthread_create.c:304
#19 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#20 0x0000000000000000 in ?? ()

The other libvirt process shows this:

(gdb) thread apply all bt

Thread 7 (Thread 0x7f57a0303700 (LWP 15496)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f57a3e041e5 in _L_lock_883 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f57a3e0403a in __pthread_mutex_lock (mutex=0xc39fb0) at pthread_mutex_lock.c:61
#3 0x000000000044ad6d in qemudClose (conn=0xc73f50) at qemu/qemu_driver.c:863
#4 0x00007f57a552dd8b in virReleaseConnect (conn=0xc73f50) at datatypes.c:114
#5 0x00007f57a552e0e8 in virUnrefConnect (conn=0xc73f50) at datatypes.c:149
#6 0x00007f57a5535818 in virConnectClose (conn=0xc73f50) at libvirt.c:1363
#7 0x000000000041cee2 in qemudFreeClient (client=0x7f5798043c20) at libvirtd.c:2327
#8 0x000000000041d294 in qemudRunLoop (opaque=0xba42a0) at libvirtd.c:2402
#9 0x00007f57a3e01efc in start_thread (arg=0x7f57a0303700) at pthread_create.c:304
#10 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f579fb02700 (LWP 15497)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacb90) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579fb02700) at pthread_create.c:304
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f579f301700 (LWP 15498)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacba8) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579f301700) at pthread_create.c:304
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f579eb00700 (LWP 15499)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacbc0) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579eb00700) at pthread_create.c:304
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f579e2ff700 (LWP 15500)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacbd8) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579e2ff700) at pthread_create.c:304
---Type <return> to continue, or q <return> to quit---
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f579dafe700 (LWP 15501)):
#0 0x00007f57a3b30773 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00007f57a54c163b in virCommandProcessIO (cmd=0x7f5798109f00) at util/command.c:1513
#2 virCommandRun (cmd=0x7f5798109f00, exitstatus=0x0) at util/command.c:1735
#3 0x0000000000452d62 in qemuCapsExtractVersionInfo (qemu=0x7f57980e6fa0 "/usr/bin/qemu", arch=0x4cfc27 "i686", retversion=0x0, retflags=0x7f579dafd338)
    at qemu/qemu_capabilities.c:1238
#4 0x00000000004535f5 in qemuCapsInitGuest (caps=0x7f5798109bd0, old_caps=0x7f5798109ac0, hostmachine=<optimized out>, info=0x6ff7e0, hvm=1) at qemu/qemu_capabilities.c:572
#5 0x0000000000453dc9 in qemuCapsInit (old_caps=0x7f5798109ac0) at qemu/qemu_capabilities.c:826
#6 0x000000000043e9a0 in qemuCreateCapabilities (oldcaps=<optimized out>, driver=0xc39fb0) at qemu/qemu_driver.c:234
#7 0x000000000044ac36 in qemudGetCapabilities (conn=<optimized out>) at qemu/qemu_driver.c:967
#8 0x00007f57a553e77c in virConnectGetCapabilities (conn=0xbbc4d0) at libvirt.c:5327
#9 0x000000000042ebf3 in remoteDispatchGetCapabilities (server=<optimized out>, client=<optimized out>, conn=<optimized out>, hdr=<optimized out>, rerr=0x7f579dafdc10,
    args=<optimized out>, ret=0x7f579dafdd00) at remote_dispatch_bodies.h:2658
#10 0x000000000043299b in remoteDispatchClientCall (qemu_protocol=<optimized out>, msg=0x7f5790104ff0, client=0x7f5798001a60, server=0xba42a0) at dispatch.c:516
#11 remoteDispatchClientRequest (server=0xba42a0, client=0x7f5798001a60, msg=0x7f5790104ff0) at dispatch.c:394
#12 0x000000000041f0bf in qemudWorker (data=0xbacbf0) at libvirtd.c:1619
#13 0x00007f57a3e01efc in start_thread (arg=0x7f579dafe700) at pthread_create.c:304
#14 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f57a5c23800 (LWP 15495)):
#0 0x00007f57a3e031f8 in pthread_join (threadid=140014326396672, thread_return=0x0) at pthread_join.c:89
#1 0x000000000041bdb3 in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:3418

If I kill the first (single thread process) then libvirt starts responding again.

This looks similar to this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=757382

As libvirt seems to have threading issues, perhaps we should scope every libvirt call in a global lock?

Thierry Carrez (ttx) wrote :

Looks like a libvirt issue -- keeping task open since it may affect minimum dependencies versions.

Changed in nova:
status: New → Incomplete
Thierry Carrez (ttx) wrote :

@Dan: since you are the author of the upstream patch, do you think it applies for us ? Which version of libvirt introduces the fix, if any ?

Daniel Berrange (berrange) wrote :

> As libvirt seems to have threading issues, perhaps we should scope every libvirt call in a global lock?

This won't solve the problem. The flaw seen here is a deadlock between 1 thread in libvirtd, and a child process it forks during VM startup, so can be seen even if you serialize all libvirt API calls.

> @Dan: since you are the author of the upstream patch, do you think it applies for us ? Which version of libvirt introduces the fix, > if any ?

The fixes are present in libvirt >= 0.9.8.

The fixes can be applied to older libvirt releases (eg 0.9.6/0.9.7) without too much trouble. For reference the fixes that Ubuntu libvirt maintainers would need to backport are:

commit 3ec128989606278635a7c5dfbeee959692d12e15
Author: Daniel P. Berrange <email address hidden>
Date: Tue Nov 29 12:11:01 2011 +0000

    Add internal APIs for dealing with time

commit 32d3ec7466a554f7a4a3e9d4017a24aa540ecf18
Author: Daniel P. Berrange <email address hidden>
Date: Tue Nov 29 12:32:31 2011 +0000

    Make logging async signal safe wrt time stamp generation

commit a8bb75a3e65f0ae866f3b3fd60c57b2aa2050017
Author: Daniel P. Berrange <email address hidden>
Date: Tue Nov 29 12:33:23 2011 +0000

    Remove time APIs from src/util/util.h

Thierry Carrez (ttx) wrote :

Not a Nova bug, 0.9.8 is now minimum version since Essex.

Changed in nova:
status: Incomplete → Invalid
Serge Hallyn (serge-hallyn) wrote :

Can the bug submitter confirm which ubuntu release and which libvirt version he is running?

Serge Hallyn (serge-hallyn) wrote :

setting to incomplete while awaiting feedback

Changed in libvirt (Ubuntu):
importance: Undecided → High
status: New → Incomplete
Launchpad Janitor (janitor) wrote :

[Expired for libvirt (Ubuntu) because there has been no activity for 60 days.]

Changed in libvirt (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers