live_migration_permit_post_copy and PAUSED vm fails

Bug #1946752 reported by Olaf Seibert
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Undecided
Unassigned

Bug Description

Description
===========

The combination of allowing post-copy and migrating a VM in PAUSED state fails.

Steps to reproduce
==================
Suppose you have this in /etc/nova/nova-compute.conf:

live_migration_permit_post_copy = True

and you try to live-migrate a qemu VM which is in PAUSED state, it fails:

openstack server migrate <UUID> --host <DEST HOST> --live-migration --os-compute-api-version 2.30

Expected result
===============

Migration starts and succeeds

Actual result
=============

Migration fails: from /var/log/nova/nova-compute.log:

2021-10-12 12:15:24.847 15732 ERROR nova.virt.libvirt.driver [req-85846e61-8afe-4934-a192-7e6a1d0c7bb1 25633ac974224d6bbb8bb4b84fd77fb1 60ac4b0ca9f649e39a8e88caaa703b5e - default default] [instance: ceb414eb-932f-46bf-9bc6-27db0fb99cb1] Live Migration failure: argument unsupported: post-copy migration is not supported with non-live or paused migration: libvirtError: argument unsupported: post-copy migration is not supported with non-live or paused migration

which is rather annoying, because post-copy is very likely not needed when migrating a VM which is not doing anything.

On the other hand, post-copy is pretty much essential to get VMs migrated that change their memory faster than it can be migrated.

Environment
===========

I am seeing this problem in Queens and in Ussuri.

# dpkg -l | grep nova
ii nova-common 2:17.0.13-0ubuntu1+syseleven3~bionic all OpenStack Compute - common files
ii nova-compute 2:17.0.13-0ubuntu1+syseleven3~bionic all OpenStack Compute - compute node base
ii nova-compute-kvm 2:17.0.13-0ubuntu1+syseleven3~bionic all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:17.0.13-0ubuntu1+syseleven3~bionic all OpenStack Compute - compute node libvirt support
ii python-nova 2:17.0.13-0ubuntu1+syseleven3~bionic all OpenStack Compute Python libraries
ii python-novaclient 2:9.1.1-0ubuntu1 all client library for OpenStack Compute API - Python 2.7
ii python3-novaclient 2:9.1.1-0ubuntu1 all client library for OpenStack Compute API - 3.x

We are using Qemu + kvm + livirt:

ii libvirt-clients 4.0.0-1ubuntu8.19 amd64 Programs for the libvirt library
ii libvirt-daemon 4.0.0-1ubuntu8.19 amd64 Virtualization daemon
ii libvirt-daemon-system 4.0.0-1ubuntu8.19 amd64 Libvirt daemon configuration files
ii libvirt0:amd64 4.0.0-1ubuntu8.19 amd64 library for interfacing with different virtualization systems
ii python-libvirt 4.0.0-1 amd64 libvirt Python bindings
ii python3-libvirt 4.0.0-1 amd64 libvirt Python 3 bindings

ii qemu-kvm 1:2.11+dfsg-1ubuntu7.37 amd64 QEMU Full virtualization on x86 hardware

We use shared storage (Quobyte) but that should not be relevant here.
We use Queens with Midonet and Ussuri with OVS which should not matter here either.

(In fact, Queens has a much smarter strategy to switch to post-copy than Ussuri...)

A workaround in Nova could be that it doesn't ask for VIR_MIGRATE_POSTCOPY if the VM is in paused state.

(It would also be nice to be able to migrate SUSPENDED VMs...)

Revision history for this message
Olaf Seibert (oseibert-sys11) wrote :

Before enabling post-copy, we sometimes had VMs which did not manage to get migrated in half an hour. The memory transfer statistics showed that by that time, often around 2 orders of magnitude more memory had been transferred than the VMs memory size. So the migration had been flooding the network for half an hour at high speed, wasting lots of bandwidth and making it unavailable for other VMs.

So having post-copy available is quite important. That's why I'm disappointed that Ussuri can only use it after a fixed time-out. Queens automatically starts post-copy if the normal memory migration does not make enough progress. Sometimes that already kicks in after 30 seconds. On Ussuri you have to wait (and waste bandwidth) for several minutes (and the default-time out is waaaay too long; we set it to live_migration_completion_timeout = 10, down from 800).

Revision history for this message
Artom Lifshitz (notartom) wrote :

So while we do indeed start post_copy unconditionally regardless of instance state, and this is not great for PAUSED instances, the switch to post_copy is guarded by a try/except [1] that should prevent post_copy failure from failing the migration. So I'm wondering where in the code the ERROR that you provided in your original description came from. Would you mind including the full traceback in this bug, or better yet, attaching the whole nova-compute.log?

Thanks!

[1] https://opendev.org/openstack/nova/src/branch/stable/ussuri/nova/virt/libvirt/migration.py#L588-L593

Changed in nova:
status: New → Incomplete
Revision history for this message
Olaf Seibert (oseibert-sys11) wrote :

As I recall, the error does not occur at the time when the post-copy is attempted.

Rather, the error already occurs at the very start of the migration. The error text "post-copy migration is not supported with non-live or paused migration" comes from https://github.com/libvirt/libvirt/blob/master/src/qemu/qemu_migration.c#L2362

We run locally with this patch applied:

From: Olaf Seibert <email address hidden>
Date: Fri, 15 Oct 2021 12:02:12 +0000
Subject: Disable post-copy if a VM is PAUSED

because libvirt does not support this combination.
We don't need post-copy on those VMs anyway.
---
 nova/virt/libvirt/driver.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/nova/virt/libvirt/driver.py b/nova/virt/libvirt/driver.py
index 3e78723..9ffe60a 100644
--- a/nova/virt/libvirt/driver.py
+++ b/nova/virt/libvirt/driver.py
@@ -9016,6 +9016,9 @@ class LibvirtDriver(driver.ComputeDriver):
             else:
                 migration_flags = self._live_migration_flags

+ if instance.vm_state == vm_states.PAUSED:
+ migration_flags &= ~libvirt.VIR_MIGRATE_POSTCOPY
+
             serial_listen_addr = libvirt_migrate.serial_listen_addr(
                 migrate_data)
             if not serial_listen_addr:
@@ -9262,6 +9265,7 @@ class LibvirtDriver(driver.ComputeDriver):
         is_post_copy_enabled = self._is_post_copy_enabled(migration_flags)
         # vpmem does not support post copy
         is_post_copy_enabled &= not bool(self._get_vpmems(instance))
+ is_post_copy_enabled &= not (instance.vm_state == vm_states.PAUSED)
         while True:
             info = guest.get_job_info()

With this, the migrations work fine.
However, the bug about two migrations in a row still applies: https://bugs.launchpad.net/nova/+bug/1947725

If you still want log with the original errors, I will have to undo this patch to recreate it.

Revision history for this message
Olaf Seibert (oseibert-sys11) wrote :

The log doesn't show a stack backtrace anyway:

7023a7024,7029
> 2022-05-10 12:06:33.966 44202 INFO nova.compute.manager [-] [instance: d1fbcb4f-a743-4d4d-98bc-3635c320958b] Took 2.16 seconds for pre_live_migration on destination host zbk130711.zbk.sys11cloud.net.
> 2022-05-10 12:06:34.026 44202 INFO nova.virt.osinfo [-] Cannot load Libosinfo: (cannot import name Libosinfo, introspection typelib not found)
> 2022-05-10 12:06:34.088 44202 ERROR nova.virt.libvirt.driver [-] [instance: d1fbcb4f-a743-4d4d-98bc-3635c320958b] Live Migration failure: argument unsupported: post-copy migration is not supported with non-live or paused migration: libvirt.libvirtError: argument unsupported: post-copy migration is not supported with non-live or paused migration
> 2022-05-10 12:06:34.499 44202 ERROR nova.virt.libvirt.driver [-] [instance: d1fbcb4f-a743-4d4d-98bc-3635c320958b] Migration operation has aborted
> 2022-05-10 12:06:34.553 44202 INFO nova.compute.manager [-] [instance: d1fbcb4f-a743-4d4d-98bc-3635c320958b] Swapping old allocation on dict_keys(['2dced892-072f-4011-b124-0c9361e355d5']) held by migration c20c6e44-280b-4ba6-8d71-cd5f60279623 for instance
> 2022-05-10 12:06:36.301 44202 WARNING nova.compute.manager [req-4f23375d-6a38-4a50-af44-705a98a2b802 1869b07324e64160b2b8069914473fb1 6648715e73fa4fa68967b1ac8787cca1 - default default] [instance: d1fbcb4f-a743-4d4d-98bc-3635c320958b] Received unexpected event network-vif-plugged-4e391962-7cf4-4399-85cd-ea5880d8965a for instance with vm_state paused and task_state migrating.

Revision history for this message
Olaf Seibert (oseibert-sys11) wrote :

"This bug report will be marked for expiration in 38 days if no further activity occurs. (find out why)"
Closing bugs without taking action is... suboptimal.

Revision history for this message
Olaf Seibert (oseibert-sys11) wrote :

Somebody marked the ticket as "incomplete", and then it automatically expires after a while if there is no activity. I did respond to the question, yet there is still the notice "This bug report will be marked for expiration in 38 days if no further activity occurs. (find out why) " This is of course not a decent way to close bugs. So I changed the ticket status away from Incomplete.

Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
Olaf Seibert (oseibert-sys11) wrote :

No progress on this one (9 months later)?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I see the same thing in an upstream CI consuming nova Antelope version https://github.com/openstack-k8s-operators/nova-operator/pull/621#issuecomment-1859848726

tags: added: live-migration
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

[I'm commenting here, since the other comments are here]

I just talked to one of the libvirt maintainers (thanks, Daniel Berrangé), and the below is the summary:

Migrating a paused instance with post-copy is the same as using pre-copy.

"When postcopy is activated the guest is started on the target host immediately. In the background the pre-copy will carry on streaming across pages, while QEMU in the target will request any pages it touches async. If the QEMU on the target is paused then there's no async page requests, so the whole thing it functionally equivalent to just using pre-copy."

That said, they acknowledge that the libvirt check is too strict: "libvirt should allow VIR_MIGRATE_POSTCOPY to be set even if a guest is paused, as that flag merely says that we want to use post-copy later".

    - - -

Meanwhile, this is relatively easy (as the draft patch above shows) to fix this in Nova by adding the flag into account in the libvirt driver.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.