nova-compute stops processing compute.$HOSTNAME occasionally on libvirt

Bug #985489 reported by Nick Moffitt
60
This bug affects 10 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Unassigned
nova (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

 root@novamanager:~# /usr/sbin/rabbitmqctl list_queues | awk '$1 ~ /^compute/ && $2 != 0 { print }'
 compute.nodexyzzy 12

Occasionally on canonistack, we find that a compute node simply stops processing its rabbit queues. A check of the logs will show no nova-compute.log activity for hours, but a restart of nova-compute will cause it to check all the instances and then process all the requests in rabbit (usually lots of duplicates from frustrated users trying to re-send delete requests and get their quota back for another deployment).

In fact, while I was typing this (having restarted nova-compute on nodexyzzy before starting), I re-ran the above command to find it now silent.

This happens often enough (once every couple of days at least) but we're not sure of how to debug this. Is there any information we can get you about a nova-compute process that is in this unhappy state?

For the record, here is the last entry in the example node's nova-compute.log when I bounced things around 09:00Z:

 2012-04-19 06:35:35 DEBUG nova.virt.libvirt.connection [-] Updating host stats from (pid=3428) update_status /usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py:2467

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

The symptoms are similar to what we experienced in LP#903212, however I can confirm that libvirtd seems to be responding correctly in Precise.

Is there further information that we can provide?

$ dpkg-query --show nova-*
nova-api 2012.1~e4~20120210.12574-0ubuntu1
nova-common 2012.1-0ubuntu2
nova-compute 2012.1-0ubuntu2
nova-compute-hypervisor
nova-compute-kvm 2012.1-0ubuntu2

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu precise (development branch)"

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

This has happened again.

Process listing:

$ ps auxwwwf | grep [n]ova-compute

nova 25735 0.0 0.0 48040 4 ? Ss Apr16 0:00 su -s /bin/sh -c exec nova-compute --flagfile=/etc/nova/nova.conf --flagfile=/etc/nova/nova-compute.conf nova
nova 25746 1.2 0.1 1725088 32604 ? Sl Apr16 53:22 \_ /usr/bin/python /usr/bin/nova-compute --flagfile=/etc/nova/nova.conf --flagfile=/etc/nova/nova-compute.conf

Strace'ing the parent process:

$ sudo strace -p 25735
Process 25735 attached - interrupt to quit
wait4(-1,
^C

Strace'ing the child process

$ sudo strace -p 25746
Process 25746 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>
^C

Checking libvirtd:

$ time sudo virsh list | wc -l
33

real 0m0.170s
user 0m0.020s
sys 0m0.012s

Last few lines from /var/log/nova/nova-compute.log:

2012-04-19 07:04:28 DEBUG nova.manager [-] Running periodic task ComputeManager._poll_bandwidth_usage from (pid=25746) periodic_tasks /usr/lib/python2.7/dist-packages/nova/manager.py:152
2012-04-19 07:04:28 INFO nova.compute.manager [-] Updating bandwidth usage cache
2012-04-19 07:04:28 DEBUG nova.manager [-] Running periodic task ComputeManager.update_available_resource from (pid=25746) periodic_tasks /usr/lib/python2.7/dist-packages/nova/manager.py:152
--- restart happens now ---
2012-04-19 13:54:25 DEBUG nova.service [-] Full set of FLAGS: from (pid=1012) wait /usr/lib/python2.7/dist-packages/nova/service.py:402
2012-04-19 13:54:25 DEBUG nova.service [-] default_floating_pool : nova from (pid=1012) wait /usr/lib/python2.7/dist-packages/nova/service.py:411

Conclusion: libvirtd is responding, but nova-compute is not.

James Page (james-page)
Changed in nova (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Russell Bryant (russellb) wrote :

This is certainly concerning. It would be nice to figure out where nova-compute is stuck. Could you try to attach to the process with gdb and get a Python stack trace?

http://wiki.python.org/moin/DebuggingWithGdb

Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

Fortunately (or not) this has just recently occurred. We do not have debug symbols installed.

It looks to me to be stuck on virDomainGetInfo(). Interestingly, libvirtd seems to be responding when I query it. Perhaps there is a missing timeout somewhere?

Backtrace from python:

(gdb) bt
#0 0x00007f57e2424b03 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f57df7e757c in ?? () from /usr/lib/libvirt.so.0
#2 0x00007f57df7e81d1 in ?? () from /usr/lib/libvirt.so.0
#3 0x00007f57df7e9400 in ?? () from /usr/lib/libvirt.so.0
#4 0x00007f57df7e9af7 in ?? () from /usr/lib/libvirt.so.0
#5 0x00007f57df7cdaf0 in ?? () from /usr/lib/libvirt.so.0
#6 0x00007f57df7cdc44 in ?? () from /usr/lib/libvirt.so.0
#7 0x00007f57df7d47f3 in ?? () from /usr/lib/libvirt.so.0
#8 0x00007f57df7a4b7a in virDomainGetInfo () from /usr/lib/libvirt.so.0
#9 0x00007f57dfabb355 in ?? () from /usr/lib/python2.7/dist-packages/libvirtmod.so
#10 0x0000000000566df4 in ?? ()
#11 0x00007fff7952f7c0 in ?? ()
#12 0x00007fff7952f6f0 in ?? ()
#13 0x0000000002daa3c0 in ?? ()
#14 0x00000000021610a0 in ?? ()
#15 0x0000000004f342d8 in ?? ()
#16 0x7f8579e6e212c67e in ?? ()
#17 0x00007f57e3a29e10 in ?? ()
#18 0x00000000039c1ab8 in ?? ()
#19 0x00000000021610a0 in ?? ()
#20 0x0000000002da89f0 in ?? ()
#21 0x0000000000000000 in ?? ()

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@Andrew,

I'd like to try to reproduce this on a local cluster. Can you give me as much information as possible about what may have happened when this was triggered?

Is this on a cluster intalled on precise? I'll try with two nodes (one separate compute node).

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

@Serge,

As mentioned in #1 the cluster is running Precise + Essex. Unfortunately I cannot consistently reproduce the problem, however the problem has been recurring approximately once per month.

As mentioned in #5 I think that perhaps nova-compute is attempting to query libvirtd and libvirtd is not responding. Since the query doesn't seem to time out nova-compute "hangs" waiting for a response which will never come.

I can only query the system after the problem occurs and I only have current state and log files to try infer the problem.

Revision history for this message
Rafi Khardalian (rkhardalian) wrote :

We ran into this issue as well and as you have correctly stated, it's a result of libvirtd becoming unresponsive. Thus far we have not been able to resolve the issue at its core but I can offer you a workaround.

It's a simple script which calls virsh help (which will block indefinitely when the libvirtd process is unresponsive), wait 30 seconds for a response and restart libvirtd after the timeout has elapsed. Run it out of cron every 5 minutes or whatever you're comfortable doing.

https://github.com/metacloud/openstack-tools/blob/master/libvirt/check_fix_libvirtd

By no means is this a fix but it'll mitigate the issue for you operationally for now. In every situation we've had the script trigger, nova-compute recovers as you would expect, without needing a restart.

Hope that helps.

Revision history for this message
Andrew Glen-Young (aglenyoung) wrote :

@Rafi,

Thanks for the suggestion. What you describe used to happen to me when we were using Oneiric's libvirtd (LP#903212). Since upgrading to Precise I haven't experienced the exact problem you're having.

My issue is that libvirtd seems to temporarily stop responding, enough to block nova-compute, but it then responds to external commands later when I probe it.

Looking through my monitoring logs, I can see that libvirtd occasionally stops responding, but not for long enough to generate an alert. I speculate that this is a timing issue, where libvirtd has stopped responding for a very limited period and during that moment nova-compute attempts to poll libvirtd?

summary: - nova-compute stops processing compute.$HOSTNAME occasionally
+ nova-compute stops processing compute.$HOSTNAME occasionally on libvirt
Revision history for this message
wangpan (hzwangpan) wrote :

I find this issue in my stable folsom, but unortunately it occurs randomly, and my libvirt version is 0.9.12-5(debian wheezy).
log in the nova-compute.log:
2013-02-26 02:48:45 DEBUG nova.manager [-] Running periodic task ComputeManager.update_available_resource from (pid=35879) periodic_tasks /usr/local/lib/python2.7/dist-packages/nova/manager.py:200
2013-02-26 02:48:45 DEBUG nova.utils [-] Got semaphore "compute_resources" for method "update_available_resource"... from (pid=35879) inner /usr/local/lib/python2.7/dist-packages/nova/utils.py:803
==============and then the periodic task stopped until============
2013-02-26 09:37:48 ERROR nova.manager [-] Error during ComputeManager.update_available_resource: Domain not found: no domain with matching id 1519
......
2013-02-26 09:37:48 DEBUG nova.manager [-] Running periodic task ComputeManager._report_driver_status from (pid=35879) periodic_tasks /usr/local/lib/python2.7/dist-packages/nova/manager.py:200
2013-02-26 09:37:48 INFO nova.compute.manager [-] Updating host status
2013-02-26 09:37:48 DEBUG nova.virt.libvirt.driver [-] Updating host stats from (pid=35879) update_status /usr/local/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py:3602
==============and then the periodic task stopped again until I restart nova-compute service============

At the second time when this issue occurs, I try to restart libvirtd and don't restart nova-compute, the nova-compute continues to running after I restart libvirtd! so I believe this issue is resulted from libvirtd.
We may need to add a timeout when nova-compute call libvirt apis.

Changed in nova:
assignee: nobody → Michael H Wilson (geekinutah)
Changed in nova:
assignee: Michael H Wilson (geekinutah) → wangpan (hzwangpan)
status: Confirmed → In Progress
wangpan (hzwangpan)
Changed in nova:
assignee: wangpan (hzwangpan) → nobody
Revision history for this message
wangpan (hzwangpan) wrote :

Sorry for Michael H Wilson, I commited a patch relative to this bug, so OpenStack Hudson automatically assign it to me, but the patch can't reslove this bug eventually, and I can't reassign it to you now, you may need to get it again.

Changed in nova:
status: In Progress → Confirmed
Changed in nova:
assignee: nobody → wangpan (hzwangpan)
status: Confirmed → In Progress
wangpan (hzwangpan)
Changed in nova:
assignee: wangpan (hzwangpan) → nobody
status: In Progress → Confirmed
Revision history for this message
You, Ji (jiyou09) wrote :

Description of this problem

I have met this problem before. This bug is mainly caused by libvirt & openstack-compute driver for libvirt.

When you meet this problem. You may type command.

$ virsh list --all

If this command get stucked, you may face this problem. Another interesting appearance is that, if you type

$ ps aux | grep libvirtd

You may find two libvirtd service is running. Such as,

...... /usr/sbin/libvirtd -d
...... /usr/sbin/libvirtd -d

You kill one process, then libvirt will be ok. But sometimes later, the bug will come out again.

Reasons of this problem

Why this bug comes out?
1 This bug mainly caused by libvirt. The older verion of libvirt has multi-process/multi-threads lock bugs. Sometimes, when multi-processes access libvirt/libvirtd services by libvirt-api, this error will happen. Especially, if you access libvirt from differe process or threads, then this bug will come out. All accesses of libvirt will get stucked.

2 The driver of libvirt in OpenStack, access libvirt/libvirtd service too frequently. Especially, when you have a lot of vms running on libvirt.

Solutions for this bug:

1 You have to update you libvirt/libvirtd services. My advice is your libvirt version should > 0.9.13.
2 We may have some methods to get lower frequent of these accesses.

I think I can fix this Bug. ^_^. I have fix it in my env of openstack.

Revision history for this message
wangpan (hzwangpan) wrote :

Hi Ji you,
two things:
1. Can you tell me more info about the multi-process/multi-threads lock bugs of libvirt? such as bug report URL or something like this.
2. In our env, the 'virsh' command is running OK and there is only one libvirtd process when this bug occurs, this may be a difference with yours.

Revision history for this message
You, Ji (jiyou09) wrote :

@wangpan (hzwangpan)

two things.

1 I've forgot the URL of multi-thread/process lock of libvirt (I faced this problem in essex, one year more time.). My solution is compiling libvirt from source code which is 1.XXX version.

2 Seems different with, I'll try my best to submit a patch. I hope it can solve you problem. This patch maybe a little long. I solve it in essex, I have to move it to grizzly.

Revision history for this message
wangpan (hzwangpan) wrote :

Hi all, after we updating our qemu to "qemu-kvm 1.1.2+dfsg-6" under debian wheezy 7.0(linux kernel is 3.2.39), this question has disappeared for more than three months, someone else can try this way if you encountered this issue.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

We just saw this with havana on precise.

ii libvirt0 1.1.1-0ubuntu8.5~cloud0 library for interfacing with different virtualization systems
ii nova-compute-kvm 1:2013.2.2-0ubuntu1~cloud0 OpenStack Compute - compute node (KVM)

As you can see we're using the Ubuntu Cloud Archive for these havana packages.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

In addition, this seemed to hit three compute nodes (the full set in a test deployment) in relatively quick succession after a few days of normal operation.

Revision history for this message
Michael H Wilson (geekinutah) wrote :

@wangpan

I'm revisiting this bug, do you have a link to the change that you proposed? Even if you didn't think it was adequate I'd like to see your approach.

-Mike

Revision history for this message
Sean Dague (sdague) wrote :

Moving to incomplete, I feel like we're waiting on the patch from @wangpan. And it's not clear this bug is moving forward without it.

tags: removed: canonistack
Changed in nova:
status: Confirmed → Incomplete
importance: High → Medium
tags: added: compute messaging
Joe Gordon (jogo)
Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.