nova snapshot fails with "libvirtError: internal error: early end of file from monitor: possible problem: qemu: terminating on signal 15 from pid 15765"

Bug #1526989 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

http://logs.openstack.org/18/258118/2/check/gate-tempest-dsvm-postgres-full/872af9a/logs/screen-n-cpu.txt.gz?level=TRACE#_2015-12-16_22_36_33_037

This looks to be very random:

2015-12-16 22:36:33.037 ERROR nova.compute.manager [req-706bbae7-e8b6-4c91-9d12-56d61bd2df98 tempest-ServerAddressesNegativeTestJSON-431116692 tempest-ServerAddressesNegativeTestJSON-995800820] [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] Instance failed to spawn
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] Traceback (most recent call last):
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/opt/stack/new/nova/nova/compute/manager.py", line 2173, in _build_resources
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] yield resources
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/opt/stack/new/nova/nova/compute/manager.py", line 2020, in _build_and_run_instance
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] block_device_info=block_device_info)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 2522, in spawn
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] block_device_info=block_device_info)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4600, in _create_domain_and_network
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] xml, pause=pause, power_on=power_on)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 4530, in _create_domain
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] guest.launch(pause=pause)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 141, in launch
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] self._encoded_xml, errors='ignore')
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 204, in __exit__
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] six.reraise(self.type_, self.value, self.tb)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/opt/stack/new/nova/nova/virt/libvirt/guest.py", line 136, in launch
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] return self._domain.createWithFlags(flags)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] rv = execute(f, *args, **kwargs)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] six.reraise(c, e, tb)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] rv = meth(*args, **kwargs)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] File "/usr/local/lib/python2.7/dist-packages/libvirt.py", line 900, in createWithFlags
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] libvirtError: internal error: early end of file from monitor: possible problem:
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630] qemu: terminating on signal 15 from pid 15765
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630]
2015-12-16 22:36:33.037 22332 ERROR nova.compute.manager [instance: 86493959-22a5-4e5a-8226-92eb95c3c630]

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22_create_domain%5C%22%20AND%20message:%5C%22libvirtError:%20internal%20error:%20early%20end%20of%20file%20from%20monitor:%20possible%20problem%5C%22%20AND%20message:%5C%22qemu:%20terminating%20on%20signal%2015%20from%20pid%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22

4 hits in the last 7 days, check and gate, all failures.

This is most likely some latent libvirt/qemu bug.

Matt Riedemann (mriedem)
tags: added: libvirt
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Ivo Vasev (ivo-vasev) wrote :

Wondering if there is any progress on investigating this one, it does affects me

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ivo, which version of libvirt/qemu are you running?

Revision history for this message
Matt Riedemann (mriedem) wrote :

From logstash over the last 7 days:

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22_create_domain%5C%22%20AND%20message%3A%5C%22libvirtError%3A%20internal%20error%3A%20early%20end%20of%20file%20from%20monitor%3A%20possible%20problem%5C%22%20AND%20message%3A%5C%22qemu%3A%20terminating%20on%20signal%2015%20from%20pid%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

This only fails in multinode jobs where we set an exact cpu model match between the guest and host. I'm not sure why those tests aren't failing and the jobs are reporting success though.

But I noticed this from one of the failing snapshots:

http://logs.openstack.org/34/283134/6/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a1ab6f2/logs/subnode-2/screen-n-cpu.txt.gz?level=TRACE#_2016-03-05_11_39_37_887

<cpu mode='custom' match='exact'>
    <model fallback='allow'>gate64</model>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>

I think the multinode jobs are the only ones where we require a specific cpu model and that's for live migration to work, i.e.:

http://logs.openstack.org/34/283134/6/check/gate-tempest-dsvm-neutron-dvr-multinode-full/a1ab6f2/logs/subnode-2/etc/nova/nova.conf.txt.gz

[libvirt]
cpu_model = gate64
inject_partition = -2
live_migration_uri = qemu+ssh://stack@%s/system
use_usb_tablet = False
cpu_mode = custom
virt_type = kvm

I was wondering about cpu modes between the host and guest because I saw this related forum:

https://bbs.archlinux.org/viewtopic.php?id=159279

We should get Daniel Berrange (danpb in #openstack-nova in freenode IRC) to take a look at this.

tags: added: multinode
summary: - nova boot fails with "libvirtError: internal error: early end of file
- from monitor: possible problem: qemu: terminating on signal 15 from pid
- 15765"
+ nova snapshot fails with "libvirtError: internal error: early end of
+ file from monitor: possible problem: qemu: terminating on signal 15 from
+ pid 15765"
Revision history for this message
Mark Doffman (mjdoffma) wrote :

So the error is basically stating that the qemu process received a SIGTERM. Any way for us to find out what pid 15765 is? Well I mean the PID is constantly changing obviously, but some process is probably sending SIGTERM to qemu in the middle of the run. The error might lie elsewhere.

Revision history for this message
Mark Doffman (mjdoffma) wrote :

OK, Matt Riedemann is right. Its almost certainly the libvirt process that is sending SIGTERM. Need to get Daniel Berrange to take a look at this. Its going to be difficult to debug as the issue is probably elsewhere in the libvirt code. (Not at the spot in the monitor code where the error is reported)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.