libvirt-bin sometimes hangs

Bug #931540 reported by David Kranz
This bug report is a duplicate of:  Bug #903212: libvirtd stops responding in oneiric. Edit Remove
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

I have a diablo-stable cluster using kvm that has been running for a long time with multiple users. It works well but I just saw, for the second time, libvirt-bin hang on a compute node when trying to reboot a vm. There is a traceback in the compute log which I assume is a failed attempt to connect to libvirt. The vm gets stuck in the REBOOT state. When I restart libvirt-bin the vm continues to ACTIVE and all seems good. Here is nova.conf and log excerpt:

--flagfile=/etc/nova/nova-compute.conf
--use_deprecated_auth
--dhcpbridge_flagfile=/etc/nova/nova.conf
--dhcpbridge=/usr/bin/nova-dhcpbridge
--sql_connection=mysql://nova:notnova@172.18.0.131/nova
--s3_host=172.18.0.131
--rabbit_host=172.18.0.131
--glance_api_servers=172.18.0.131:9292
--logdir=/var/log/nova
--state_path=/var/lib/nova
--lock_path=/var/lock/nova
--verbose
--ec2_url=http://172.18.0.131:8773/services/Cloud
--fixed_range=10.0.0.0/24
--network_size=256
--image_service=nova.image.glance.GlanceImageService
--bridge_interface=eth1
--flat_network_bridge=br100
--flat_interface=eth1
--network_manager=nova.network.manager.FlatDHCPManager
--force_dhcp_release
--public_interface=eth0
--multi_host=1
--osapi_host=172.18.0.131
--quota_instances=1000000
--quota_ram=1000000
--quota_cores=1000000
--iscsi_ip_prefix=172.18.0

2012-02-13 11:59:34,081 INFO nova.compute.manager [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] check_instance\
_lock: admin: |True|
2012-02-13 11:59:34,081 INFO nova.compute.manager [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] check_instance\
_lock: executing: |<function reboot_instance at 0x2c6ba28>|
2012-02-13 11:59:34,081 AUDIT nova.compute.manager [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] Rebooting ins\
tance 174
2012-02-13 11:59:34,127 DEBUG nova.compute.manager [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] Checking stat\
e of instance-000000ae from (pid=1124) _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:189
2012-02-13 11:59:34,855 DEBUG nova.rpc [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] Making asynchronous call \
on network ... from (pid=1124) multicall /usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py:730
2012-02-13 11:59:34,855 DEBUG nova.rpc [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] MSG_ID is b85e173a817f492\
e9eeb2313dfb076ce from (pid=1124) multicall /usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py:733
2012-02-13 11:59:37,602 DEBUG nova.utils [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] Attempting to grab sema\
phore "iptables" for method "apply"... from (pid=1124) inner /usr/lib/python2.7/dist-packages/nova/utils.py:674
2012-02-13 11:59:37,602 DEBUG nova.utils [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] Attempting to grab file\
 lock "iptables" for method "apply"... from (pid=1124) inner /usr/lib/python2.7/dist-packages/nova/utils.py:679
2012-02-13 11:59:37,603 DEBUG nova.utils [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] Running cmd (subprocess\
): sudo iptables-save -t filter from (pid=1124) execute /usr/lib/python2.7/dist-packages/nova/utils.py:167
2012-02-13 11:59:38,459 INFO nova.virt.libvirt_conn [-] Instance instance-000000ae destroyed successfully.
2012-02-13 11:59:38,462 DEBUG nova.utils [97342dd6-0194-4a7f-b67d-c99b60c5b531 tester testproject] Running cmd (subprocess\
): sudo iptables-restore from (pid=1124) execute /usr/lib/python2.7/dist-packages/nova/utils.py:167
2012-02-13 11:59:39,331 ERROR nova.rpc [5524d009-f0f5-483b-a665-d296dc9fd653 tester testproject] Exception during message \
handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE: return f(*args, **kw)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 118, in decorated_function
(nova.rpc): TRACE: function(self, context, instance_id, *args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 645, in reboot_instance
(nova.rpc): TRACE: self.driver.reboot(instance_ref, network_info)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE: return f(*args, **kw)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line 501, in reboot
(nova.rpc): TRACE: virt_dom = self._conn.lookupByName(instance['name'])
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1870, in lookupByName
(nova.rpc): TRACE: if ret is None:raise libvirtError('virDomainLookupByName() failed', conn=self)
(nova.rpc): TRACE: libvirtError: Domain not found: no domain with matching name 'instance-000000ae'
(nova.rpc): TRACE:

Revision history for this message
Alvaro Lopez (aloga) wrote :

Have you tried to check if libvirt is working properly for the nova user?

What is the output of "virsh uri" if you execute it as the nova user? And the output of "virsh list"?

Revision history for this message
David Kranz (david-kranz) wrote :

I will try what you suggest next time I see it happen. Running virsh list hung when run as root.

Revision history for this message
Daniel Berrange (berrange) wrote :

If libvirtd itself appears to hang, gather a stack trace of all threads. Make sure the debug symbols are available, then connect with GDB and run 'thread apply all bt' to capture a stack trace across all threads.

Revision history for this message
Razique Mahroua (razique) wrote :

Hello David, I've encountered that issue once. Nova removes unactive instances (eg deletes a domain)
does the file /var/lib/nova/instances/instance-000000ae/libvirt.xml exists ?
if so a
# virsh define /var/lib/nova/instances/instance-000000ae/libvirt.xml from the compute-then then a nova reboot $instance from the controller should fix it.

Revision history for this message
David Kranz (david-kranz) wrote :

In both cases I was ultimately able to recover the vms, either by restarting libvirt or rebooting the compute node. The problem is that this is a serious bug that has not really been reproduced in a reliable way but which a number of deployers have experienced. I will try to debug more if my users report it again but as I said, it has only happened twice since mid-Januaryy. I posted on the openstack list to see if there were any deployers of real long-lived clusters with kvm that have *not* ever seen this happen. I realize this bug ticket does not have enough information to understand the problem.

Revision history for this message
Valeriy Belavin (vvbelavin) wrote :

The same problem here. Can be fixed by:

sudo service libvirt-bin stop && sudo service libvirt-bin start && sudo service nova-compute restart

Waiting for bug fix.

Some info:

Distributor ID: Ubuntu
Description: Ubuntu 11.10
Release: 11.10
Codename: oneiric

Package: libvirt-bin
Priority: optional
Section: devel
Installed-Size: 3680
Maintainer: Ubuntu Developers <email address hidden>
Original-Maintainer: Debian Libvirt Maintainers <email address hidden>
Architecture: amd64
Source: libvirt
Version: 0.9.2-4ubuntu15

Package: nova-compute
Priority: extra
Section: net
Installed-Size: 96
Maintainer: Ubuntu Developers <email address hidden>
Original-Maintainer: Openstack Maintainers <email address hidden>
Architecture: all
Source: nova
Version: 2011.3-0ubuntu6

Can give more info. Logs, configs, ...

Revision history for this message
livemoon (mwjpiero) wrote :

yes, I always found libvirt-bin hang, maybe once an hour. I write a cron to monitor nova-compute service , and restart libvirt-bin, then restart nova-compute if find nova-compute hang since of libvirt-bin hang.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers