legacy-tempest-dsvm-cells constantly failing on stable pike and ocata due to libvirt connection reset

Bug #1745838 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
High
Unassigned

Bug Description

The cellsv1 job has been failing pretty constantly within the last week or two due to a libvirt connection reset:

http://logs.openstack.org/36/536936/1/check/legacy-tempest-dsvm-cells/a9ff792/logs/libvirt/libvirtd.txt.gz#_2018-01-28_01_25_23_762

2018-01-28 01:25:23.762+0000: 3896: error : virKeepAliveTimerInternal:143 : internal error: connection closed due to keepalive timeout

http://logs.openstack.org/36/536936/1/check/legacy-tempest-dsvm-cells/a9ff792/logs/screen-n-cpu.txt.gz?level=TRACE#_2018-01-28_01_25_23_766

2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager [req-392410f9-c834-4bdc-a439-ac20476fe212 - -] Error updating resources for node ubuntu-xenial-inap-mtl01-0002208439.
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager Traceback (most recent call last):
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/manager.py", line 6590, in update_available_resource_for_node
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager rt.update_available_resource(context, nodename)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/compute/resource_tracker.py", line 535, in update_available_resource
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager resources = self.driver.get_available_resource(nodename)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5675, in get_available_resource
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager data["vcpus_used"] = self._get_vcpu_used()
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5316, in _get_vcpu_used
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager for guest in self._host.list_guests():
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 573, in list_guests
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager only_running=only_running, only_guests=only_guests)]
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 593, in list_instance_domains
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager alldoms = self.get_connection().listAllDomains(flags)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager result = proxy_call(self._autowrap, f, *args, **kwargs)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager rv = execute(f, *args, **kwargs)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager six.reraise(c, e, tb)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager rv = meth(*args, **kwargs)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 4953, in listAllDomains
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager raise libvirtError("virConnectListAllDomains() failed", conn=self)
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager libvirtError: Cannot recv data: Connection reset by peer
2018-01-28 01:25:23.766 16360 ERROR nova.compute.manager

It seems to be totally random. I'm not sure what is different about this job running on stable vs master, but it doesn't appear to be an issue on master:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22libvirtError%3A%20Cannot%20recv%20data%3A%20Connection%20reset%20by%20peer%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22%20AND%20build_name%3A%5C%22legacy-tempest-dsvm-cells%5C%22&from=7d

Revision history for this message
Matt Riedemann (mriedem) wrote :

Comparing package versions:

master: libvirt 3.6.0 and qemu 2.10 (from the pike UCA)

pike: libvirt 2.5.0 and qemu 2.8

That doesn't really explain why things would have been fine on stable up until a couple of weeks ago.

Revision history for this message
Matt Riedemann (mriedem) wrote :

FWIW, the majority of these failures are on the rax-ord nodes, but I don't know why that would matter. I checked to make sure we're using the qemu virt_type in nova.conf and we are. I was wondering if we're somehow not doing nested virt on these systems, or if they don't have nested virt enabled or something?

Revision history for this message
Matt Riedemann (mriedem) wrote :

devstack in stable/pike is using the ocata ubuntu cloud archive, this is the package list:

http://ubuntu-cloud.archive.canonical.com/ubuntu/dists/xenial-updates/ocata/main/binary-amd64/Packages

Revision history for this message
Matt Riedemann (mriedem) wrote :

We can see if the cellsv1 job will pass in stable/pike with the pike UCA here:

https://review.openstack.org/#/c/536798/

Revision history for this message
Matt Riedemann (mriedem) wrote :

This change proposes that we don't run the cellsv1 job on stable branches until the issue is resolved:

https://review.openstack.org/#/c/538619/

Revision history for this message
Matt Riedemann (mriedem) wrote :

BTW, looking at logstash, it appears things started failing around Jan 18, which is around the same time that python was updated for xenial causing other CI job issues:

http://lists.openstack.org/pipermail/openstack-dev/2018-January/126580.html

The stable branch job is running with:

libpython2.7:amd64 2.7.12-1ubuntu0~16.04.3

eventlet==0.19.0

But we're also running with similar versions on master:

http://logs.openstack.org/10/533210/8/gate/legacy-tempest-dsvm-cells/4582601/logs/

Exception eventlet is 0.20.0 but the ML thread says anything under 0.22 is a problem.

Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

Given we're about to cut RC1 for rocky so ocata/pike are old and cells v1 is deprecated, I don't think anyone cares about fixing this for CI.

Changed in nova:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.