nova-compute doesn't reconnect to libvirtd

Bug #1411278 reported by Peter Sabaini
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
nova (Ubuntu)
Confirmed
Medium
Unassigned
nova-compute (Juju Charms Collection)
Invalid
Undecided
Unassigned

Bug Description

I've found nova-compute disabled on a prod system with this in the log:

2015-01-09 20:14:53.906 26500 WARNING nova.virt.libvirt.driver [-] Connection to libvirt lost: 1

And this in libvirt.log:

2015-01-09 20:14:53.647+0000: 6646: error : netcfStateCleanup:109 : internal error: Attempt to close netcf state driver with open connections

However, libvirtd seems to operate normally now. After restarting nova-compute it connected successfully to libvirtd.

Shouldn't nova-compute try reconnect automatically?

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :

On Trusty with the 3.16 LTS-U kernel, and Icehouse 2014.1.3, after some time (< 1 day), instances become unreachable, cannot be deleted, stopped or started. Restarting libvirt-bin and nova compute on the affected compute node appears to restore the ability to perform operations on the instances. I have 6 identical compute nodes doing the same thing, fully updated as of this date 2015 Apr 1. [Nope, not an April fools bug.]

# Tried to nova delete an instance, here is the
# instance's nova show output, while libvirt on the
# compute node is logging "netcfStateCleanup" errors:
http://paste.ubuntu.com/10718343/

# logs at/near the time of the crime
libvirt: http://paste.ubuntu.com/10718361/
nova compute: http://paste.ubuntu.com/10718443/
syslog: http://paste.ubuntu.com/10718490/

# workaround-ish steps
sudo service libvirt-bin stop
sudo service nova-compute stop
sudo service libvirt-bin start
sudo service nova-compute start

I am then able to delete, stop, start instances on the affected compute node for a while. The issue reappears within hours. Even more quickly if I create and destroy ~20 new instances. No crash dumps around.

# version info
ubuntu@fat-machine:~$ dpkg-query --show *libvirt* *nova*
libvirt-bin 1.2.2-0ubuntu13.1.9
libvirt0 1.2.2-0ubuntu13.1.9
nova-common 1:2014.1.3-0ubuntu2
nova-compute 1:2014.1.3-0ubuntu2
nova-compute-hypervisor
nova-compute-kvm 1:2014.1.3-0ubuntu2
nova-compute-libvirt 1:2014.1.3-0ubuntu2
python-libvirt 1.2.2-0ubuntu2
python-nova 1:2014.1.3-0ubuntu2
python-novaclient 1:2.17.0-0ubuntu1
python2.7-nova

ubuntu@fat-machine:~$ uname -a
Linux fat-machine 3.16.0-33-generic #44~14.04.1-Ubuntu SMP Fri Mar 13 10:33:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

# resources
ubuntu@fat-machine:~$ free -m
             total used free shared buffers cached
Mem: 48289 4545 43743 1 232 899
-/+ buffers/cache: 3413 44875
Swap: 8191 0 8191

ubuntu@fat-machine:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 459G 16G 420G 4% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 24G 4.0K 24G 1% /dev
tmpfs 4.8G 724K 4.8G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 24G 72K 24G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/sdb1 465G 847M 464G 1% /var/lib/ceph/osd/ceph-2

ubuntu@fat-machine:~$ uptime
 13:55:57 up 10:25, 1 user, load average: 0.04, 0.04, 0.05

Revision history for this message
rbahumi (rbahumi) wrote :

Hi,

Is there any fix to this issue which doesn't require restarting nova compute?

Revision history for this message
James Page (james-page) wrote :

OK so I'm pretty sure that newer nova-compute versions do reconnect to libvirt if connections are lost (I just checked on a newton cloud, restarting libvirt did not cause any side-effects in nova, and the compute unit is still reporting as active and is functional).

This is possibly an issue that impacted older nova versions; either way this is not a charm problem, so adding a task for nova in Ubuntu (icehouse is no longer supported upstream).

Changed in nova-compute (Juju Charms Collection):
status: New → Invalid
Changed in nova (Ubuntu):
importance: Undecided → Medium
tags: added: icehouse
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.