libvirt killed by kernel on general protection or stack segment traps

Bug #1646779 reported by Andrea Frittoli
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned
libvirt
New
Undecided
Unassigned
tempest
Invalid
Undecided
Unassigned

Bug Description

A VM fails to spawn with no host available. The nova-cpu logs reveals a problem connecting to libvirt. 84 hits since Nov 23rd:

message: "libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused"

Recent failure: http://logs.openstack.org/66/401366/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3deacc5/logs/screen-n-cpu.txt.gz?level=ERROR

2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host [req-12fbb338-7df0-4654-b686-257245421442 tempest-ImagesOneServerNegativeTestJSON-1400886372 tempest-ImagesOneServerNegativeTestJSON-1400886372] Connection to libvirt failed: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host Traceback (most recent call last):
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 453, in get_connection
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host conn = self._get_connection()
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 436, in _get_connection
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host {'msg': ex})
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host self.force_reraise()
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host six.reraise(self.type_, self.value, self.tb)
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 425, in _get_connection
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host self._wrapped_conn = self._get_new_connection()
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 370, in _get_new_connection
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host wrapped_conn = self._connect(self._uri, self._read_only)
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 226, in _connect
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host libvirt.openAuth, uri, auth, flags)
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host rv = execute(f, *args, **kwargs)
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host six.reraise(c, e, tb)
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host rv = meth(*args, **kwargs)
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 105, in openAuth
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host if ret is None:raise libvirtError('virConnectOpenAuth() failed')
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host libvirtError: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused
2016-12-01 18:16:05.117 6160 ERROR nova.virt.libvirt.host
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [req-12fbb338-7df0-4654-b686-257245421442 tempest-ImagesOneServerNegativeTestJSON-1400886372 tempest-ImagesOneServerNegativeTestJSON-1400886372] [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] Instance failed to spawn
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] Traceback (most recent call last):
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/compute/manager.py", line 2117, in _build_resources
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] yield resources
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/compute/manager.py", line 1924, in _build_and_run_instance
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] block_device_info=block_device_info)
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2647, in spawn
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] block_device_info=block_device_info)
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4741, in _get_guest_xml
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] context)
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4515, in _get_guest_config
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] instance.numa_topology, flavor, allowed_cpus, image_meta)
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 3872, in _get_guest_numa_config
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] if (not self._has_numa_support() and
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 5212, in _has_numa_support
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] if self._host.has_version(ver):
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 519, in has_version
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] lv_ver=lv_ver, hv_ver=hv_ver, hv_type=hv_type, op=operator.ne)
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 490, in _version_check
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] conn = self.get_connection()
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 462, in get_connection
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] raise exception.HypervisorUnavailable(host=CONF.host)
2016-12-01 18:16:05.123 6160 ERROR nova.compute.manager [instance: 6fa73b04-c6a7-47a8-908b-6738f36f6ffc] HypervisorUnavailable: Connection to the hypervisor is broken on host: ubuntu-xenial-rax-ord-5924789

The issue happens with different tests, so I doubt it is actually a Tempest issue, but I'm filing the issue here for initial triage.

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

message: "libvirtError: End of file while reading data: Input/output error"

This also started on the 23rd.

Revision history for this message
Matt Riedemann (mriedem) wrote :

The error in the libvirtd logs is here:

http://logs.openstack.org/66/401366/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3deacc5/logs/libvirt/libvirtd.txt.gz#_2016-12-01_18_16_04_907

2016-12-01 18:16:04.907+0000: 17078: error : qemuMonitorIORead:580 : Unable to read from monitor: Connection reset by peer

Seeing quite a few virtlogd i/o errors in syslog:

http://logs.openstack.org/66/401366/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3deacc5/logs/syslog.txt.gz

But virtlogd support was merged on 12/6, not 11/23:

https://github.com/openstack/nova/commit/85c781a7229e324155be3db029b7541a27f909db

Plus that requires libvirt>=1.3.3 and we run with 1.3.1 in the xenial nodes, so we're not using that virtlogd code in libvirt yet in the gate.

Sean Dague (sdague)
Changed in nova:
status: New → Incomplete
Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

The date of 23/11 may be artificial, perhaps it's simply due to the retention policy in the logstash cluster.

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

Looking at the timestamp from libvirt log, I found the following instance log which shows a qemu crash: http://logs.openstack.org/66/401366/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3deacc5/logs/libvirt/qemu/instance-0000000c.txt.gz#_2016-12-01_18_14_13_257

warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
main-loop: WARNING: I/O thread spun for 1000 iterations
2016-12-01T18:14:13.226270Z qemu-system-x86_64: terminating on signal 15 from pid 17078
2016-12-01 18:14:13.257+0000: shutting down

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

The CPUID.01H:ECX.vmx bit is on every single VM log.
The main-loop warning, after some googling, seems to be something that is associated to a lot of I/O in the guest, which causes slow downs.
The qemu process terminating is the normal shut down of the VM.
Digging further.

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

IN http://logs.openstack.org/66/401366/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3deacc5/logs/libvirt/qemu/instance-0000001a.txt.gz#_2016-12-01_18_16_08_032 I see:

2016-12-01T18:16:04.707300Z qemu-system-x86_64: terminating on signal 15 from pid 17078

Which matches closely (200ms gap) with http://logs.openstack.org/66/401366/4/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3deacc5/logs/libvirt/libvirtd.txt.gz#_2016-12-01_18_16_04_907

2016-12-01 18:16:04.907+0000: 17078: error : qemuMonitorIORead:580 : Unable to read from monitor: Connection reset by peer

So I think the qemuMonitorIORead errors are possibly related just to VMs being normally shutdown.

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

It looks like sometimes when the following log line happens:

virtlogd[...] End of file while reading data: Input/output error

libvirt is killed by the kernel, because of either a general protection or a stack segment IP.

summary: - Cannot connect to libvirt
+ libvirt killed by kernel on general protection or stack segment traps
Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

I did a few checks as Jordan did on the signature of this bug, and I see that test_delete_saving_image (http://git.openstack.org/cgit/openstack/tempest/tree/tempest/api/compute/images/test_images.py#n43) is always running when the issue occurs.

The test deletes a snapshot while it's being created from a server, which temporarily pauses a server to then resume it and finally destroy it.

The instance log is for instance: http://logs.openstack.org/82/409682/3/check/gate-tempest-dsvm-neutron-dvr-ubuntu-xenial/0b05dcf/logs/libvirt/qemu/instance-00000025.txt.gz

I tried to reproduce this locally running something like a few times:

tox -eall -- '(test_delete_saving_image|test_resize_server_from_manual_to_auto|test_add_remove_fixed_ip|test_create_image_from_stopped_server|test_verify_multiple_nics_order|test_max_image_meta_exceed_limit)'

but I failed to reproduce until now. I did get a kernel panic in my VM after a few runs, but that's probably unrelated.

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

I marked this as incomplete from a Tempest POV - I couldn't find anything wrong with the tests in Tempest that seem to trigger this, apart from triggering sometimes what looks like a libvirt issue.

Changed in tempest:
status: New → Incomplete
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi (cross post I know),
there is a bit of a "somewhat dup" context around the list of the following bugs:
- bug 1646779
- bug 1643911
- bug 1638982
- bug 1673483

Unfortunately, I’ve never hit these bugs in any of my libvirt/qemu runs and these are literally thousands every day due to all the automated tests. I also checked with our Ubuntu OpenStack Team and they didn't see them so far either. That makes it hard to debug them in depth as you will all understand.

But while the “signature” on each bug is different, they share a lot of things still (lets call it the bug "meta signature"):
- there is no way to recreate yet other than watching gate test crash statistics
- they seem to haunt the openstack gate testing more than anything else
- most are directly or indirectly related to memory corruption

As I’m unable to reproduce any of these bugs myself, I’d like to get some help from anyone that can help to recreate. Therefore I ask all people affected (mostly the same on all these bugs anyway) to test the PPAs I created for bug 1673483. That is the one bug where thanks to the great help of Kashyp, Matt and Dan (and more) at least a potential fix was identified.

That means two ppa's for you to try:
- Backport of the minimal recommended fix: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2620
- Backport of the full series of related fixes: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2619

Especially since the potential error of these fixes refers to almost anything from memleak to deadlock there is a chance that they all might be caused by the same root cause.

Revision history for this message
James Page (james-page) wrote :

Attempting to reproduce this outside of the OpenStack gate is proving problematic; I left a minimal Xenial/Mitaka cloud with concurrent test execution (x 16) of the tests described in #10 running for the last 12 hours on a single compute node deployment; 4500 instances later I've still not seen any of the issues this or other bugs describe.

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

This is not Tempest bug clearly.

Changed in tempest:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.