libvirtd going into a tight loop causing instances to not transition to ACTIVE

Bug #1902276 reported by Michael Johnson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Incomplete
Undecided
Unassigned

Bug Description

Description
===========
This is current master branch (wallaby) of OpenStack.

We seen this regularly, but it's intermittent. Said another way: We see it on jobs daily, but it's not every run.

We are seeing nova instances that do not transition to ACTIVE inside five minutes. Investigating this led us to find that libvirtd seems to be going into a tight loop on an instance delete.

The 136MB log is here: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c77/759973/3/check/octavia-v2-dsvm-scenario/c77fe63/controller/logs/libvirt/libvirtd_log.txt

The overall job logs are here:
https://zuul.opendev.org/t/openstack/build/c77fe63a94ef4298872ad5f40c5df7d4/logs

When running the Octavia scenario test suite, we occasionally see nova instances fail to become ACTIVE in a timely manner, causing timeouts and failures. In investigating this issue we found the libvirtd log was 136MB.

Most of the file is full of this repeating:
2020-10-28 23:45:06.330+0000: 20852: debug : qemuMonitorIO:767 : Error on monitor internal error: End of file from qemu monitor
2020-10-28 23:45:06.330+0000: 20852: debug : qemuMonitorIO:788 : Triggering EOF callback
2020-10-28 23:45:06.330+0000: 20852: debug : qemuProcessHandleMonitorEOF:301 : Received EOF on 0x7f6278014ca0 'instance-00000001'
2020-10-28 23:45:06.330+0000: 20852: debug : qemuProcessHandleMonitorEOF:305 : Domain is being destroyed, EOF is expected

Here is a snippet for the lead in to the repeated lines:
http://paste.openstack.org/show/799559/

It appears to be a tight loop, repeating many times per second.

Eventually it does stop and things seem to go back to normal in nova.

Here is the snippet of the end of the loop in the log:
http://paste.openstack.org/show/799560/

Tags: libvirt
description: updated
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (3.3 KiB)

There is an interesting sequence of log lines in the compute log for the nova instance (instance-00000001) that is reported in the repeated libvirt logs.

Oct 28 23:45:02.776968 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: INFO nova.virt.libvirt.driver [-] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] Instance spawned successfully.
Oct 28 23:45:02.776968 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: INFO nova.compute.manager [None req-e609f22f-0c5b-4e75-8fed-2cc5ea78decb admin admin] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] Took 326.09 seconds to spawn the instance on the hypervisor.
Oct 28 23:45:02.777289 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: DEBUG nova.compute.manager [None req-e609f22f-0c5b-4e75-8fed-2cc5ea78decb admin admin] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] Checking state {{(pid=16413) _get_power_state /opt/stack/nova/nova/compute/manager.py:1557}}
Oct 28 23:45:02.828879 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: DEBUG nova.compute.manager [None req-1efcbd1f-1af9-4e08-8572-e04879fa08fb None None] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] Checking state {{(pid=16413) _get_power_state /opt/stack/nova/nova/compute/manager.py:1557}}
Oct 28 23:45:02.832224 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: DEBUG nova.compute.manager [None req-1efcbd1f-1af9-4e08-8572-e04879fa08fb None None] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] Synchronizing instance power state after lifecycle event "Resumed"; current vm_state: building, current task_state: deleting, current DB power_state: 0, VM power_state: 1 {{(pid=16413) handle_lifecycle_event /opt/stack/nova/nova/compute/manager.py:1294}}
Oct 28 23:45:02.865991 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: INFO nova.compute.manager [None req-1efcbd1f-1af9-4e08-8572-e04879fa08fb None None] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] During sync_power_state the instance has a pending task (deleting). Skip.
Oct 28 23:45:03.052749 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: DEBUG nova.compute.utils [None req-e609f22f-0c5b-4e75-8fed-2cc5ea78decb admin admin] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] Conflict updating instance 7a343084-ac33-4dcb-8343-744fe2c1fb26. Expected: {'task_state': ['spawning']}. Actual: {'task_state': 'deleting'} {{(pid=16413) notify_about_instance_usage /opt/stack/nova/nova/compute/utils.py:429}}
Oct 28 23:45:03.065608 nested-virt-ubuntu-bionic-vexxhost-ca-ymq-1-0021169707 nova-compute[16413]: DEBUG nova.compute.manager [None req-e609f22f-0c5b-4e75-8fed-2cc5ea78decb admin admin] [instance: 7a343084-ac33-4dcb-8343-744fe2c1fb26] Instance disappeared during build. {{(pid=16413) _do_build_and_run_instance /opt/stack/nova/nova/compute/manager.py:2266}}

nova reports that the instance is built successfully and right after it it reports that the instance is disappeared during build.

It seems that the whole libvirt error loop happens after nova resumed the guest in libvirt. The fact that the overall instance boot took mo...

Read more...

tags: added: libvirt
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Posting the start and end of the libvirt tight loop from the libvirtd log file. (As the pastebins expire.)

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

tl;dr: The libvirt version in the above logs is libvirt-5.4.0 (and libglib-2.56.4) ; and libvirt developers suggest that there are some fixes in this area in the upstream libvirt 6.7.0; and it might potentially also be fixed in libvirt-6.1.0 (or higher).

Long
----

Talking to Michal from the upstream libvirt team, this sounds like a libvirt bug. To quote him from IRC:

"When we [libvirt] start a guest and open its [QEMU] monitor we put the FD into our event loop. And whenever QEMU emits an event, a handler is called. But in this instance the handler was called over and over. On the other hand, there were some issues in glib's event loop once we [libvirt] switched to that."

So the preferred order of testing libvirt versions (based on what's available):

- try libvirt-6.9.0; if not
- try libvirt-6.7.0; if not
- try libvirt-6.1.0

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

@Michael J.: is there a way to change the libvirt version in the octavia jobs to see if the above suggestion helps?

Revision history for this message
Michael Johnson (johnsom) wrote :

I'm not sure, it is being installed via the nova devstack plugin. Is there a way to point nova to a different version?

2020-10-28 23:07:50.003 | + lib/nova_plugins/hypervisor-libvirt:install_nova_hypervisor:84 :   install_libvirt
2020-10-28 23:07:50.523 | Get:25 http://mirror.ca-ymq-1.vexxhost.opendev.org/ubuntu-cloud-archive bionic-updates/train/main amd64 libvirt-daemon amd64 5.4.0-0ubuntu5.4~cloud0 [1,412 kB]

I also see that the job is running on bionic, we could try flipping that to focal which might pull in a newer version.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Yeah flipping that to Focal on master is a good idea (and was a V cycle goal anyhow)

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Do we still seeing this issue?

Revision history for this message
Michael Johnson (johnsom) wrote :
Revision history for this message
Lee Yarwood (lyarwood) wrote :

What do we want to do with this bug? Ultimately the underlying issue is with the version of libvirt shipped in Bionic that's outside of our control. Should we move the bug to the libvirt component and ask the Ubuntu team if a rebase is even feasible?

no longer affects: libvirt
Lee Yarwood (lyarwood)
no longer affects: nova
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for the bump to this one.
I've marked it fixed in Focal according to the reports above.
By that it also is fixed in the Ubuntu Cloud Archive Ussuri which is availble for Bionic.
Only the libvirt version in bionic itself (4.0.0-1ubuntu8.19) is affected by this still.

From your testing and discussions - was in the meantime already a chance to identify:
a) an individual (set of) commits that fixed this between 64.0 and 6.0
   Comment 3 mentioned further improvements in 6.1 and 6.7, but since the tests report 6.0 as
   ok maybe the change we would need it smaller than one might have thought.
b) did in the meantime anywhere a smaller/simpler test come up that is easier
   than "run openstacks tests"?

Changed in libvirt (Ubuntu):
status: New → Fix Released
Changed in libvirt (Ubuntu Bionic):
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.