LibVirtD crashing after many hours (100+)

Bug #953656 reported by justinsb
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
libvirt (Ubuntu)
Expired
High
Unassigned

Bug Description

Running devstack, after many hours (100+?) of uptime, libvirt appears to crash (but this is not under heavy usage, just long uptime). When I launch instances, nothing is printed on nova-cpu, but the state is ERROR.

Nova scheduler prints this:

2012-03-12 19:08:10 WARNING nova.scheduler.manager [-] Failed to schedule_run_instance: No valid host was found.
2012-03-12 19:08:10 WARNING nova.scheduler.manager [-] Setting instance 0c96bc97-aa90-4501-98bc-b615c52899ef to ERROR state.

It looks like n-cpu is hanging here:

2012-03-12 14:48:22 DEBUG nova.manager [-] Running periodic task ComputeManager._poll_bandwidth_usage from (pid=16597) periodic_tasks /opt/stack/nova/nova/manager.py:152
2012-03-12 14:48:22 DEBUG nova.manager [-] Running periodic task ComputeManager.update_available_resource from (pid=16597) periodic_tasks /opt/stack/nova/nova/manager.py:152

If I Control-C nova-cpu, the following is printed:

^Clibvir: Remote error : poll on socket failed: Interrupted system call

Attempting to restart rapidly hangs here:

2012-03-12 19:11:01 DEBUG nova.compute.manager [req-e4f296a6-27e0-4441-b1e2-8366fbc79a75 None None] [instance: 2c3e8fc4-5938-4c3a-856b-83ea338c16cc] Checking state from (pid=19695) _get_power_state /opt/stack/nova/nova/compute/manager.py:260
2012-03-12 19:11:01 DEBUG nova.virt.libvirt.connection [req-e4f296a6-27e0-4441-b1e2-8366fbc79a75 None None] Connecting to libvirt: qemu:///system from (pid=19695) _get_connection /opt/stack/nova/nova/virt/libvirt/connection.py:253

Again, Control C gives the same error:

^Clibvir: Remote error : poll on socket failed: Interrupted system call

"virsh list" also fails; it also hangs.

This is with the 'recommended' configuration: DevStack, KVM; Ubuntu Oneiric

There are two libvirtd processes running (is this normal)?

If I gdb each libvirtd, I can get stack traces:

(gdb) thread apply all bt

Thread 1 (Thread 0x7f579dafe700 (LWP 22512)):
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1 0x00007f57a3afa404 in _L_lock_2263 () at tzset.c:616
#2 0x00007f57a3afa217 in __tz_convert (timer=0x7f57a3df5e28, use_localtime=1, tp=0x7f579dafc820) at tzset.c:619
#3 0x00007f57a54cef47 in virLogMessage (category=0x7f57a55b766a "file.util/command.c", priority=1, funcname=0x7f57a55b82c3 "virCommandHook", linenr=1842, flags=0,
    fmt=<optimized out>) at util/logging.c:727
#4 0x00007f57a54bdf0e in virCommandHook (data=0x7f5798109f00) at util/command.c:1842
#5 0x00007f57a54bec4e in virExecWithHook (argv=0x7f579800f7e0, envp=0x7f57980b69d0, keepfd=<optimized out>, retpid=0x0, infd=-1, outfd=<optimized out>,
    errfd=0x7f579810a06c, flags=4, hook=0x7f57a54bdc30 <virCommandHook>, data=0x7f5798109f00, pidfile=0x0) at util/command.c:531
#6 0x00007f57a54c0bea in virCommandRunAsync (cmd=0x7f5798109f00, pid=0x0) at util/command.c:1911
#7 0x00007f57a54c12e3 in virCommandRun (cmd=0x7f5798109f00, exitstatus=0x0) at util/command.c:1721
#8 0x0000000000452d62 in qemuCapsExtractVersionInfo (qemu=0x7f57980e6fa0 "/usr/bin/qemu", arch=0x4cfc27 "i686", retversion=0x0, retflags=0x7f579dafd338)
    at qemu/qemu_capabilities.c:1238
#9 0x00000000004535f5 in qemuCapsInitGuest (caps=0x7f5798109bd0, old_caps=0x7f5798109ac0, hostmachine=<optimized out>, info=0x6ff7e0, hvm=1) at qemu/qemu_capabilities.c:572
#10 0x0000000000453dc9 in qemuCapsInit (old_caps=0x7f5798109ac0) at qemu/qemu_capabilities.c:826
#11 0x000000000043e9a0 in qemuCreateCapabilities (oldcaps=<optimized out>, driver=0xc39fb0) at qemu/qemu_driver.c:234
#12 0x000000000044ac36 in qemudGetCapabilities (conn=<optimized out>) at qemu/qemu_driver.c:967
#13 0x00007f57a553e77c in virConnectGetCapabilities (conn=0xbbc4d0) at libvirt.c:5327
#14 0x000000000042ebf3 in remoteDispatchGetCapabilities (server=<optimized out>, client=<optimized out>, conn=<optimized out>, hdr=<optimized out>, rerr=0x7f579dafdc10,
    args=<optimized out>, ret=0x7f579dafdd00) at remote_dispatch_bodies.h:2658
#15 0x000000000043299b in remoteDispatchClientCall (qemu_protocol=<optimized out>, msg=0x7f5790104ff0, client=0x7f5798001a60, server=0xba42a0) at dispatch.c:516
#16 remoteDispatchClientRequest (server=0xba42a0, client=0x7f5798001a60, msg=0x7f5790104ff0) at dispatch.c:394
#17 0x000000000041f0bf in qemudWorker (data=0xbacbf0) at libvirtd.c:1619
#18 0x00007f57a3e01efc in start_thread (arg=0x7f579dafe700) at pthread_create.c:304
#19 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#20 0x0000000000000000 in ?? ()

The other libvirt process shows this:

(gdb) thread apply all bt

Thread 7 (Thread 0x7f57a0303700 (LWP 15496)):
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007f57a3e041e5 in _L_lock_883 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f57a3e0403a in __pthread_mutex_lock (mutex=0xc39fb0) at pthread_mutex_lock.c:61
#3 0x000000000044ad6d in qemudClose (conn=0xc73f50) at qemu/qemu_driver.c:863
#4 0x00007f57a552dd8b in virReleaseConnect (conn=0xc73f50) at datatypes.c:114
#5 0x00007f57a552e0e8 in virUnrefConnect (conn=0xc73f50) at datatypes.c:149
#6 0x00007f57a5535818 in virConnectClose (conn=0xc73f50) at libvirt.c:1363
#7 0x000000000041cee2 in qemudFreeClient (client=0x7f5798043c20) at libvirtd.c:2327
#8 0x000000000041d294 in qemudRunLoop (opaque=0xba42a0) at libvirtd.c:2402
#9 0x00007f57a3e01efc in start_thread (arg=0x7f57a0303700) at pthread_create.c:304
#10 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#11 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x7f579fb02700 (LWP 15497)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacb90) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579fb02700) at pthread_create.c:304
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7f579f301700 (LWP 15498)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacba8) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579f301700) at pthread_create.c:304
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7f579eb00700 (LWP 15499)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacbc0) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579eb00700) at pthread_create.c:304
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7f579e2ff700 (LWP 15500)):
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1 0x00007f57a54dc82a in virCondWait (c=<optimized out>, m=<optimized out>) at util/threads-pthread.c:117
#2 0x000000000041f02d in qemudWorker (data=0xbacbd8) at libvirtd.c:1598
#3 0x00007f57a3e01efc in start_thread (arg=0x7f579e2ff700) at pthread_create.c:304
---Type <return> to continue, or q <return> to quit---
#4 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5 0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f579dafe700 (LWP 15501)):
#0 0x00007f57a3b30773 in __GI___poll (fds=<optimized out>, nfds=<optimized out>, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
#1 0x00007f57a54c163b in virCommandProcessIO (cmd=0x7f5798109f00) at util/command.c:1513
#2 virCommandRun (cmd=0x7f5798109f00, exitstatus=0x0) at util/command.c:1735
#3 0x0000000000452d62 in qemuCapsExtractVersionInfo (qemu=0x7f57980e6fa0 "/usr/bin/qemu", arch=0x4cfc27 "i686", retversion=0x0, retflags=0x7f579dafd338)
    at qemu/qemu_capabilities.c:1238
#4 0x00000000004535f5 in qemuCapsInitGuest (caps=0x7f5798109bd0, old_caps=0x7f5798109ac0, hostmachine=<optimized out>, info=0x6ff7e0, hvm=1) at qemu/qemu_capabilities.c:572
#5 0x0000000000453dc9 in qemuCapsInit (old_caps=0x7f5798109ac0) at qemu/qemu_capabilities.c:826
#6 0x000000000043e9a0 in qemuCreateCapabilities (oldcaps=<optimized out>, driver=0xc39fb0) at qemu/qemu_driver.c:234
#7 0x000000000044ac36 in qemudGetCapabilities (conn=<optimized out>) at qemu/qemu_driver.c:967
#8 0x00007f57a553e77c in virConnectGetCapabilities (conn=0xbbc4d0) at libvirt.c:5327
#9 0x000000000042ebf3 in remoteDispatchGetCapabilities (server=<optimized out>, client=<optimized out>, conn=<optimized out>, hdr=<optimized out>, rerr=0x7f579dafdc10,
    args=<optimized out>, ret=0x7f579dafdd00) at remote_dispatch_bodies.h:2658
#10 0x000000000043299b in remoteDispatchClientCall (qemu_protocol=<optimized out>, msg=0x7f5790104ff0, client=0x7f5798001a60, server=0xba42a0) at dispatch.c:516
#11 remoteDispatchClientRequest (server=0xba42a0, client=0x7f5798001a60, msg=0x7f5790104ff0) at dispatch.c:394
#12 0x000000000041f0bf in qemudWorker (data=0xbacbf0) at libvirtd.c:1619
#13 0x00007f57a3e01efc in start_thread (arg=0x7f579dafe700) at pthread_create.c:304
#14 0x00007f57a3b3c89d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f57a5c23800 (LWP 15495)):
#0 0x00007f57a3e031f8 in pthread_join (threadid=140014326396672, thread_return=0x0) at pthread_join.c:89
#1 0x000000000041bdb3 in main (argc=<optimized out>, argv=<optimized out>) at libvirtd.c:3418

If I kill the first (single thread process) then libvirt starts responding again.

This looks similar to this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=757382

As libvirt seems to have threading issues, perhaps we should scope every libvirt call in a global lock?

Revision history for this message
Thierry Carrez (ttx) wrote :

Looks like a libvirt issue -- keeping task open since it may affect minimum dependencies versions.

Changed in nova:
status: New → Incomplete
Revision history for this message
Thierry Carrez (ttx) wrote :

@Dan: since you are the author of the upstream patch, do you think it applies for us ? Which version of libvirt introduces the fix, if any ?

Revision history for this message
Daniel Berrange (berrange) wrote :

> As libvirt seems to have threading issues, perhaps we should scope every libvirt call in a global lock?

This won't solve the problem. The flaw seen here is a deadlock between 1 thread in libvirtd, and a child process it forks during VM startup, so can be seen even if you serialize all libvirt API calls.

> @Dan: since you are the author of the upstream patch, do you think it applies for us ? Which version of libvirt introduces the fix, > if any ?

The fixes are present in libvirt >= 0.9.8.

The fixes can be applied to older libvirt releases (eg 0.9.6/0.9.7) without too much trouble. For reference the fixes that Ubuntu libvirt maintainers would need to backport are:

commit 3ec128989606278635a7c5dfbeee959692d12e15
Author: Daniel P. Berrange <email address hidden>
Date: Tue Nov 29 12:11:01 2011 +0000

    Add internal APIs for dealing with time

commit 32d3ec7466a554f7a4a3e9d4017a24aa540ecf18
Author: Daniel P. Berrange <email address hidden>
Date: Tue Nov 29 12:32:31 2011 +0000

    Make logging async signal safe wrt time stamp generation

commit a8bb75a3e65f0ae866f3b3fd60c57b2aa2050017
Author: Daniel P. Berrange <email address hidden>
Date: Tue Nov 29 12:33:23 2011 +0000

    Remove time APIs from src/util/util.h

Revision history for this message
Thierry Carrez (ttx) wrote :

Not a Nova bug, 0.9.8 is now minimum version since Essex.

Changed in nova:
status: Incomplete → Invalid
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Can the bug submitter confirm which ubuntu release and which libvirt version he is running?

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

setting to incomplete while awaiting feedback

Changed in libvirt (Ubuntu):
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for libvirt (Ubuntu) because there has been no activity for 60 days.]

Changed in libvirt (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.