libvirt randomly crashes on xenial nodes with "*** Error in `/usr/sbin/libvirtd': malloc(): memory corruption:"

Bug #1643911 reported by Matt Riedemann
88
This bug affects 13 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Unassigned
libvirt
New
Undecided
Unassigned
libvirt (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Seen here:

http://logs.openstack.org/44/386844/17/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/91befad/logs/syslog.txt.gz#_Nov_22_00_27_46

Nov 22 00:27:46 ubuntu-xenial-rax-ord-5717228 virtlogd[16875]: End of file while reading data: Input/output error
Nov 22 00:27:46 ubuntu-xenial-rax-ord-5717228 libvirtd[16847]: *** Error in `/usr/sbin/libvirtd': malloc(): memory corruption: 0x0000558ff1c7c800 ***

http://logs.openstack.org/44/386844/17/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/91befad/logs/screen-n-cpu.txt.gz?level=TRACE#_2016-11-22_00_27_46_571

2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [req-d6b33315-636c-4ebc-99e4-8cac236e1f7f tempest-ServerDiskConfigTestJSON-191847812 tempest-ServerDiskConfigTestJSON-191847812] [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] Failed to allocate network(s)
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] Traceback (most recent call last):
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/compute/manager.py", line 2021, in _build_resources
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] requested_networks, security_groups)
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/compute/manager.py", line 1445, in _build_networks_for_instance
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] requested_networks, macs, security_groups, dhcp_options)
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/compute/manager.py", line 1461, in _allocate_network
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] self._update_resource_tracker(context, instance)
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/compute/manager.py", line 564, in _update_resource_tracker
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] self.driver.node_is_available(instance.node)):
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/virt/driver.py", line 1383, in node_is_available
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] if nodename in self.get_available_nodes():
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/virt/libvirt/driver.py", line 6965, in get_available_nodes
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] return [self._host.get_hostname()]
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 708, in get_hostname
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] hostname = self.get_connection().getHostname()
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] File "/opt/stack/new/nova/nova/virt/libvirt/host.py", line 420, in get_connection
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] raise exception.HypervisorUnavailable(host=CONF.host)
2016-11-22 00:27:46.571 4886 ERROR nova.compute.manager [instance: d52c0be8-eed2-47a5-bbb5-dd560bb9276e] HypervisorUnavailable: Connection to the hypervisor is broken on host: ubuntu-xenial-rax-ord-5717228

2016-11-22 00:27:46.279+0000: 16847: error : qemuMonitorIORead:580 : Unable to read from monitor: Connection reset by peer

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22***%20Error%20in%20%60%2Fusr%2Fsbin%2Flibvirtd'%3A%20malloc()%3A%20memory%20corruption%3A%5C%22%20AND%20tags%3A%5C%22syslog%5C%22&from=7d

Tags: libvirt
Matt Riedemann (mriedem)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Talking with libvirt upstream folks (Dan Berrange & Michal Privoznik) on #virt, OFTC, the way to debug this seems to be: (a) come up with a reproducer with just one test; (b) run that reproducer with libvirtd under Valgrind. Otherwise, as DanB put it: "that's not even remotely practical - the gate jobs would take days to complete under Valgrind"

Revision history for this message
Lee Yarwood (lyarwood) wrote :
Download full text (3.2 KiB)

Just saw this on stable/newton and having looked at the logs I think this is a logging issue in libvirtd when terminating an instance created by the ImagesTestJSON tests.

- Attempt to terminate the ImagesTestJSON instance prior to libvirtd dying :

http://logs.openstack.org/61/407961/1/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3dc2b54/logs/screen-n-cpu.txt.gz#_2016-12-12_07_48_37_547

- Call to qemuDomainLogContextRead that I think is triggering this issue :

http://logs.openstack.org/61/407961/1/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3dc2b54/logs/libvirt/libvirtd.txt.gz#_2016-12-12_07_48_37_835

- Further call logged to virLogProbablyLogMessage in syslog :

http://logs.openstack.org/61/407961/1/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/3dc2b54/logs/syslog.txt.gz#_Dec_12_07_48_37

Kashyap, could you raise this with the libvirt folks for their take on this?

~~~
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: *** Error in `/usr/sbin/libvirtd': malloc(): memory corruption: 0x000055a98c7c6810 ***
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: ======= Backtrace: =========
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fc916b557e5]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /lib/x86_64-linux-gnu/libc.so.6(+0x8181e)[0x7fc916b5f81e]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fc916b615d4]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /lib/x86_64-linux-gnu/libc.so.6(realloc+0x358)[0x7fc916b61e68]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /lib/x86_64-linux-gnu/libc.so.6(+0xe53a2)[0x7fc916bc33a2]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /lib/x86_64-linux-gnu/libc.so.6(regexec+0xb3)[0x7fc916bc59c3]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virLogProbablyLogMessage+0x1f)[0x7fc917563a5f]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0x7d080)[0x7fc8c7ddf080]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0x933dc)[0x7fc8c7df53dc]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virEventPollRunOnce+0x7b3)[0x7fc91754d393]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virEventRunDefaultImpl+0x41)[0x7fc91754bc01]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virNetDaemonRun+0x14d)[0x7fc917697b4d]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /usr/sbin/libvirtd(+0x14407)[0x55a98c480407]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6107639 libvirtd[29343]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fc916afe830]
Dec 12 07:48:37 ubuntu-xenial-osic-cloud1-disk-6...

Read more...

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

I found this bug report while investigating gate issue with this query:

message: "kernel: traps: libvirtd" and message: "trap stack segment ip:" and message: "error:0 in libc-2.23.so" AND tags:"syslog"

I am not sure it's related to this issue, but so far what I know is that sometimes libvirtd crashes 'systemd[1]: libvirt-bin.service: Main process exited, code=killed, status=7/BUS' and is the restarted by systemd a few seconds latter.

The job status is not always FAILED because how Tempest works: in some tests, if deleting a server fails (VM status goes to ERROR), Tempest doesn't mark the test as failed: https://github.com/openstack/tempest/blob/fc267faa92c5355d9971b7dac3275ffea9646cd0/tempest/api/compute/base.py#L141

Revision history for this message
Jordan Pittier (jordan-pittier) wrote :

I can confirm that this happens mainly (human check, based on analysing 6 gate jobs) with Tempest tests in test_images.py. And also this happens always during VM deletion (destroy).

What's special with these Tempest tests is that they try to delete the same delete server twice in 2 very short/successive API call (first time using the addCleanup() mechanism, the other attempts come from the class method clear_servers() in the BaseV2ComputeTest class. So this looks like a race condition in libvirtd.

Revision history for this message
Christoph Fiehe (fiehe) wrote :
Download full text (90.2 KiB)

I am facing the same problem on one of my compute nodes. Everything was working since days, but suddenly Nova fails spawning a VM on one of my compute nodes. I am running OpenStack Newton on Ubuntu 16.04. I get the following error in the log:

Dec 15 14:03:55 os-compute03 libvirtd[4739]: Unable to read from monitor: Connection reset by peer
Dec 15 14:03:55 os-compute03 libvirtd[4739]: *** Error in `/usr/sbin/libvirtd': malloc(): memory corruption: 0x000055c548afa6f0 ***
Dec 15 14:03:55 os-compute03 libvirtd[4739]: ======= Backtrace: =========
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f6a1c8b57e5]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /lib/x86_64-linux-gnu/libc.so.6(+0x8181e)[0x7f6a1c8bf81e]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f6a1c8c15d4]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /lib/x86_64-linux-gnu/libc.so.6(realloc+0x358)[0x7f6a1c8c1e68]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /lib/x86_64-linux-gnu/libc.so.6(+0xe53a2)[0x7f6a1c9233a2]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /lib/x86_64-linux-gnu/libc.so.6(regexec+0xb3)[0x7f6a1c9259c3]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virLogProbablyLogMessage+0x1f)[0x7f6a1d2c3a5f]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0x7d080)[0x7f69f5aa6080]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0x933dc)[0x7f69f5abc3dc]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virEventPollRunOnce+0x7b3)[0x7f6a1d2ad393]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virEventRunDefaultImpl+0x41)[0x7f6a1d2abc01]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virNetDaemonRun+0x14d)[0x7f6a1d3f7b4d]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/sbin/libvirtd(+0x14407)[0x55c546c4c407]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f6a1c85e830]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: /usr/sbin/libvirtd(+0x14bd9)[0x55c546c4cbd9]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: ======= Memory map: ========
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 55c546c38000-55c546ca7000 r-xp 00000000 fc:00 2102952 /usr/sbin/libvirtd
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 55c546ea6000-55c546ea8000 r--p 0006e000 fc:00 2102952 /usr/sbin/libvirtd
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 55c546ea8000-55c546ead000 rw-p 00070000 fc:00 2102952 /usr/sbin/libvirtd
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 55c548aa0000-55c548b1a000 rw-p 00000000 00:00 0 [heap]
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 7f69bc000000-7f69bc021000 rw-p 00000000 00:00 0
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 7f69bc021000-7f69c0000000 ---p 00000000 00:00 0
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 7f69c0000000-7f69c0021000 rw-p 00000000 00:00 0
Dec 15 14:03:55 os-compute03 libvirtd[4739]: 7f69c0021000-7f...

Revision history for this message
Jakub Libosvar (libosvar) wrote :

Can we increase the importance please? It blocks patches to merge.

Revision history for this message
Andrea Frittoli (andrea-frittoli) wrote :

New occurrence, with a slightly different signature:
http://logs.openstack.org/04/426604/3/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/708771d/logs/syslog.txt.gz#_Feb_01_11_47_53

Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: *** Error in `/usr/sbin/libvirtd': realloc(): invalid next size: 0x000056108d7a7450 ***
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: ======= Backtrace: =========
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f48cb75c7e5]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /lib/x86_64-linux-gnu/libc.so.6(+0x82a5a)[0x7f48cb767a5a]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /lib/x86_64-linux-gnu/libc.so.6(realloc+0x179)[0x7f48cb768c89]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virReallocN+0x43)[0x7f48cc1362f3]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0x7d0f9)[0x7f48a78ac0f9]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/lib/libvirt/connection-driver/libvirt_driver_qemu.so(+0x933dc)[0x7f48a78c23dc]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virEventPollRunOnce+0x7b3)[0x7f48cc154393]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virEventRunDefaultImpl+0x41)[0x7f48cc152c01]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/lib/x86_64-linux-gnu/libvirt.so.0(virNetDaemonRun+0x14d)[0x7f48cc29eb4d]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/sbin/libvirtd(+0x14407)[0x56108cdf5407]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f48cb705830]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: /usr/sbin/libvirtd(+0x14bd9)[0x56108cdf5bd9]
Feb 01 11:47:53 ubuntu-xenial-osic-cloud1-s3700-7053530 libvirtd[16665]: ======= Memory map: ========

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :
Download full text (7.0 KiB)

On this job I've noticed that the memory corruption error [1] happends after (what I find are) suspicious iscsi errors and libvirt logging issues:

Feb 02 08:44:04 ubuntu-xenial-infracloud-chocolate-7072289 kernel: sd 3:0:0:1: [sda] Synchronizing SCSI cache
                                                                                     ^^^^^^^^^^^^^^^^^^^^^^^^
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 account-replicator[31349]: Beginning replication run
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 account-replicator[31349]: Replication run OVER
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 account-replicator[31349]: Attempted to replicate 1 dbs in 0.01254 seconds (79.75111/s)
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 account-replicator[31349]: Removed 0 dbs
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 account-replicator[31349]: 1 successes, 0 failures
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 account-replicator[31349]: no_change:1 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21121]: stack : TTY=unknown ; PWD=/opt/stack/new/devstack ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --show
Feb 02 08:44:05 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21121]: pam_unix(sudo:session): session opened for user root by (uid=0)
Feb 02 08:44:06 ubuntu-xenial-infracloud-chocolate-7072289 iscsid[651]: Connection2:0 to [target: iqn.2010-10.org.openstack:volume-2c20350a-011b-45ac-bc99-e2a9089e423e, portal: 15.184.70.50,3260] through [iface: default] is shutdown.
                                                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Feb 02 08:44:06 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21121]: pam_unix(sudo:session): session closed for user root
Feb 02 08:44:06 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21125]: stack : TTY=unknown ; PWD=/opt/stack/new/devstack ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2c20350a-011b-45ac-bc99-e2a9089e423e
Feb 02 08:44:06 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21125]: pam_unix(sudo:session): session opened for user root by (uid=0)
Feb 02 08:44:07 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21125]: pam_unix(sudo:session): session closed for user root
Feb 02 08:44:07 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21146]: stack : TTY=unknown ; PWD=/opt/stack/new/devstack ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --show
Feb 02 08:44:07 ubuntu-xenial-infracloud-chocolate-7072289 sudo[21146]: pam_unix(sudo:session): session opened for user root by (uid=0)
Feb 02 08:44:07 ubuntu-xenial-infracloud-chocolate-7072289 kernel: qbrd128337a-f9: port 2(tapd128337a-f9) entered disabled state
Feb 02 08:44:07 ubuntu-xenial-infracloud-chocolate-7072289 kernel: device tapd128337a-f9 left promiscuous mode
Fe...

Read more...

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22through%20%5Biface%3A%20default%5D%20is%20shutdown.%5C%22

the iscsi shutdown issues seems to occur quite frequently since 1-2 weeks.
perhaps they deserve their own bug ?

Revision history for this message
Dr. David Alan Gilbert (dgilbert-h) wrote :

There seem to be 6 people+ hitting it -> marking confirmed

Changed in libvirt (Ubuntu):
status: New → Confirmed
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Now, we seem to be stuck in a limbo here, unable to diagnose this to get to the root cause. So I asked upstream libvirt maintainers on IRC. And Dan Berrange responds [text formatted a little bit for readability here]:

"Running libvirt under Valgrind will likely point to a root cause. However, it's impossible to run libvirtd under Valgrind in the openstack CI system, unless you're happy to have many hours longer running time and massively more RAM used.

"The only way to debug it is to deploy custom libvirtd builds. Meaning: whatever extra debugging info is needed in the area of code that is suspected to be broken; there's no right answer here - you just have to experiment repeatedly until you find what you need. And deploy this custom build either by providing new packages in the the repos, or by using a hack [via 'rootwrap' facility] to install custom libvirtd in the Nova startup code.

"Also, a core dump in this scenario will not be helpful. With memory corruption, a core dump is rarely useful, because the actual problem you care about will have occurred some time before the crash happens. This is especially true for multithreaded applications like libvirtd. Because the thread showing the abrt/segv is quite often not the thread which caused the corruption."

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

FYI Armando suspects that this failure is a result of general high memory consumption in gate, something that lingers all projects: http://lists.openstack.org/pipermail/openstack-dev/2017-February/111413.html

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

I faced the same calltrace on http://logs.openstack.org/59/426459/4/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/0222b58/logs

The backtrace of libvirtd is

 *** Error in `/usr/sbin/libvirtd': malloc(): memory corruption: 0x00005560f082e8d0 ***
 ======= Backtrace: =========
 /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fd02d8337e5]
 /lib/x86_64-linux-gnu/libc.so.6(+0x8181e)[0x7fd02d83d81e]
 /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7fd02d83f5d4]
 /lib/x86_64-linux-gnu/libc.so.6(realloc+0x358)[0x7fd02d83fe68]
 /lib/x86_64-linux-gnu/libc.so.6(+0xe53a2)[0x7fd02d8a13a2]
 /lib/x86_64-linux-gnu/libc.so.6(regexec+0xb3)[0x7fd02d8a39c3]
 /usr/lib/x86_64-linux-gnu/libvirt.so.0(virLogProbablyLogMessage+0x1f)[0x7fd02e241a5f]

If this is due to oom-killer, syslog should contain oom-killer happened clearly.
However, the syslog doesn't contain it.

Revision history for this message
Clark Boylan (cboylan) wrote :

We've also somewhat recently gotten the OOMkiller problems to go away. And yet these problems remain. I doubt it is related to OOMKiller.

Revision history for this message
Ian Wienand (iwienand) wrote :

valgrind would be great, but is the 100-pound gorilla approach. I'll play with maybe some lighter-weight things like electric fence which could give us some insight. something like that is going to segfault so we cores seem a top priority. I'm probably more optimistic about general usefulness of cores anyway ... sometimes in my experience dumping bits of the heap you can see strings or other things that give you a clue as to what's going on.

Revision history for this message
Ian Wienand (iwienand) wrote :

(i just sent this to the list, but putting here too)

While I agree that a coredump is not that likely to help, I would also
like to come to that conclusion after inspecting a coredump :) I've
found things in the heap before that give clues as to what real
problems are.

To this end, I've proposed [2] to keep coredumps. It's a little
hackish but I think gets the job done. [3] enables this and saves any
dumps to the logs in d-g.

As suggested, running under valgrind would be great but probably
impractical until we narrow it down a little. Another thing I've had
some success with is electric fence [4] which puts boundaries around
allocations so out-of-bounds access hits at the time of access. I've
proposed [5] to try this out, but it's not looking particularly
promising unfortunately. I'm open to suggestions, for example maybe
something like tcalloc might give us a different failure and could be
another clue. If we get something vaguely reliable here, our best bet
might be to run a parallel non-voting job on all changes to see what
we can pick up.

[1] https://bugs.launchpad.net/nova/+bug/1643911
[2] https://review.openstack.org/451128
[3] https://review.openstack.org/451219
[4] http://elinux.org/Electric_Fence
[5] https://review.openstack.org/451136

Revision history for this message
Matthew Booth (mbooth-9) wrote :

Is this the same bug we saw the other week? I thought Dan B had found a couple of patches missing from the libvirt shipped in Ubuntu which are likely candidates for fixing this. Kashyap, do you have those to hand?

I don't think we're likely to get much traction on this from upstream libvirt because:

* This is not the latest stable release of libvirt.
* They looked anyway, and they think it's fixed in the latest stable release of libvirt.

IOW it's a downstream Ubuntu issue. We need to report this as a bug to Canonical in their package, providing the 2 patches recommended by upstream.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

If it is the other one is bug 1673483 with test builds of a wide and a more narrow backport of the mentioned fixes in comment #5

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

@Matt Booth: This is not the same bug 1673483 that DanB debugged the other day and identified fixes, as the Nova stacktraces are different for both.

For bug 1673483, the Nova crash directly relates to the libvirt commits mentioned in its comment #5 (of bug 1673483).

In this memory corruption bug, we don't quite now what is the root cause yet, that's why we didn't close it as a duplicate.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi (cross post I know),
there is a bit of a "somewhat dup" context around the list of the following bugs:
- bug 1646779
- bug 1643911
- bug 1638982
- bug 1673483

Unfortunately, I’ve never hit these bugs in any of my libvirt/qemu runs and these are literally thousands every day due to all the automated tests. I also checked with our Ubuntu OpenStack Team and they didn't see them so far either. That makes it hard to debug them in depth as you will all understand.

But while the “signature” on each bug is different, they share a lot of things still (lets call it the bug "meta signature"):
- there is no way to recreate yet other than watching gate test crash statistics
- they seem to haunt the openstack gate testing more than anything else
- most are directly or indirectly related to memory corruption

As I’m unable to reproduce any of these bugs myself, I’d like to get some help from anyone that can help to recreate. Therefore I ask all people affected (mostly the same on all these bugs anyway) to test the PPAs I created for bug 1673483. That is the one bug where thanks to the great help of Kashyp, Matt and Dan (and more) at least a potential fix was identified.

That means two ppa's for you to try:
- Backport of the minimal recommended fix: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2620
- Backport of the full series of related fixes: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/2619

Especially since the potential error of these fixes refers to almost anything from memleak to deadlock there is a chance that they all might be caused by the same root cause.

Revision history for this message
davidchen (synodavidchen) wrote :

@ChristianEhrhardt:

I can success reproduce this issue on my environment, the issue can be reproduced by stop and start multiple instance parallelly (in my case is 5 cirros instance)

I have try both PPA, the PPA-2619 fix this issue, stress 7 days and libvirt work fine, but PPA-2620 still hit libvirt crash issue

Below is my test script, the time to hit the issue is about 2 hour ~ 12 hour, libvirt will crash due to memory corruption

vm instance is created by command "openstack server create --image cirros --flavor m1.nano cirrosX", X=1~5

#!/bin/bash

TEST_ROUND=3000

for round in `seq 1 1 $TEST_ROUND`; do

echo "test round $round ..."

for i in `seq 1 1 5`; do
        openstack server stop cirros$i &
done

sleep 1

for i in `seq 1 1 5`; do
        while true; do
                STATUS=`openstack server show cirros$i -f value -c OS-EXT-STS:power_state`
                if [ "$STATUS" == "Shutdown" ]; then
                        break;
                fi
                sleep 1
        done
done

for i in `seq 1 1 5`; do
        openstack server start cirros$i &
done

sleep 1

for i in `seq 1 1 5`; do
        while true; do
                STATUS=`openstack server show cirros$i -f value -c OS-EXT-STS:power_state`
                if [ "$STATUS" == "Running" ]; then
                        break;
                fi
                sleep 1
        done
done

done

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Davidchen - Thank you to provide the info on your repro:
m1.nano is really small, I'd hope that with a slightly bigger size of 512M and 10 instead of 5 systems I might crash it earlier.
If not triggering I will modify to follow the smaller sizes you had.

I wanted to get openstack out of the equation here, since in these cases it doesn't do a lot more than forwarding commands which hopefully works.
Also I wonder why that would trigger the issue when the fixes are around "asynchronizing" blockjobs, but I take every chance at a test - maybe I overlook such a job at startup/shutdown.

I have started a slightly modified version and will let it run for a few days - hopefully hitting the issue as well.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Note: test script based on Davidchen's suggestion but without openstack and converted into an infinite run, now running on Horsea

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Davidchen - just to clarify details is on your system any other work going on, other guests or virtualization activity that might influence this?

I almost assumed that we need the bigger version that was in PPA-2619 (which is why I created it right away) - big thanks for verifying both PPAs for your case!

Revision history for this message
davidchen (synodavidchen) wrote :

Hi ChristianEhrhardt,

You're welcome :)

My test environment do not have any other workload or guest vm, only 5 cirros instance

so I guess parallel start/stop instance is the reason trigger libvirt crash

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Ok, thank you - my test is still running fine but you reported 2-12 hours so I'll give it at least a few days.

It also came to my mind - 5 x m1.nano which default to 5x64MB = 320MB. In case it might be related due to fragmentation or overall memory shortage. Is your system low on memory or was a nano just the lowest and easiest flavor to use?

Revision history for this message
davidchen (synodavidchen) wrote :

@ChristianEhrhardt - I chose m1.nano because nano just the lowest and easiest flavor to use. My test system have about 20GB memory.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Davidchen,

FYI now running ~48 hours being on ruond 3511 atm and continuing.
So far no crash yet - same libvirt PID still running and no logs/crashdumps/...

I contacted a few people to consider an openstack setup on my test node just in case starting/stopping it via that might - other than I assumed - be involved still.
If this continues to run just fine I'll at some point abort and set an OS up there.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

After slightly more than three days without crash of which almost exactly a day of pure cpu cycles were spent in libvirt I start to think that this won't trigger the bug as I hoped.

I deployed a new openstack and now have a loop running based on openstack start/stop (using 10x m1.small as I have slightly larger cloudimages ont cirros for now).
So this is more like your case just slightly enhanced by my time measurements - hope to trigger it on my end with this.

In the worst case after the weekend I'll have to fetch cirros and make nano instances to be even more similar.

Revision history for this message
davidchen (synodavidchen) wrote :

@ChristianEhrhardt -

In my test environment, I have encounter this issue with different flavor and image

so I think use different image and flavor to reproduce is ok :)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

After about a day my keystone and percona died for exceeding the limited size of the system I had. But not libvirt/qemu crash/fail so far.
I need to look into it again at more detail if I can get it to hold longer until the libvirt issue occurs.

@Davidchen is for you the -19 ppa still (would be a week more now I guess) running fine while the other ppa and original libvirt failed reproducible?

Revision history for this message
davidchen (synodavidchen) wrote :

@ChristianEhrhardt, I stop stress script after stress PPA-2619 for 7 days without issue, the libvirt still work fine, but there are no stress loading, so I will start stress script to stress PPA-2619 again.

On the other hand, due to I can reproduce this issue in my environment, so I will prepare another similar environment to reproduce this issue with libvirt 1.3.1-1ubuntu10.8, to see if the issue is still reproducible.

If the issue is reproducible in the similar environment, I will try to use virsh command to reproduce the issue.

Revision history for this message
davidchen (synodavidchen) wrote :

@ChristianEhrhardt -

In the new deployed environment (with libvirt 1.3.1-1ubuntu10.8), the libvirt is working fine for more than 24 hours.

The stress will keep running for more two days to check if the issue is reproducible.

However, so far I think I may missing some key point, so the new environment can't reproduce the issue.

I will do more test and update here if I have any new finding :)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

At least on my end it ran fine for ~7 days now with 20 guests.
Looking forward to hear what you might find being the hidden trigger.

Revision history for this message
davidchen (synodavidchen) wrote :

@ChristianEhrhardt -

After few days stress, the issue cat not be reproduced by the script I provide.

So I fallback to use a more simple script, which I use at the beginning to reproduce the issue in my environment, a script just start/stop instance and sleep (without vm status check).

After change to use this script, the issue is reproducible (but require few days in my test), with libvirt-bin(1.3.1-1ubuntu10.8) and qemu(1:2.5+dfsg-5ubuntu10.11) and 20 cirros guest.

The real instance behavior is more complex in the simple script, because sometimes the openstack api will encounter conflict error due to vm status is not match

Cannot 'stop' instance 971fe132-55de-4ff7-b1e8-c556390964c1 while it is in vm_state stopped (HTTP 409)
Cannot 'start' instance 01eae3e9-6330-4594-a1dd-dcdbf3a4d392 while it is in vm_state active (HTTP 409)

So some instance may be start/stop in the same time, due to some action may not be apply on previous command, and some pattern seems trigger libvirt crash.

However, due to the behavior is very complex, so I'm not sure if the issue is reproducible on other environment.

For example, if sleep time is long enough to let all vm finish their task, than there will no conflict error, and the script behavior will like the script I provide behavior, which can not reproduce the issue. On the other hand, if the sleep time is too short or too much guest, the openstack environment may not able to handle the request rate, which may lead some other client or server error

Sean Dague (sdague)
Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Clark Boylan (cboylan) wrote :

It should be noted that after switching to using Ubuntu Cloud Archive which includes newer libvirt this issue went away in the gate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.