Live migration post-copy not working as Expected

Bug #2052473 reported by keerthivasan
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
keerthivasan

Bug Description

Description
===========
I am trying to enable live migration feature post copy using below config, seeing post-copy is not supported error

block_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_NON_SHARED_INC
cpu_mode = custom
cpu_model_extra_flags = -ds,-acpi,+ss,-ht,-tm,-pbe,-dtes64,-monitor,-ds_cpl,+vmx,-smx,-est,-tm2,-xtpr,+pdcm,-dca,+tsc_adjust,-intel-pt,+md-clear,+stibp,+ssbd,+pdpe1gb,-invtsc,-hle,-rtm,-mpx,-xsavec,-xgetbv1
cpu_models = Skylake-Client-IBRS
live_migration_bandwidth = 900
live_migration_downtime = 100
live_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE
live_migration_permit_post_copy = True
live_migration_timeout_action=force_complete

Steps to reproduce
==================

KVM hypervisor

Using Openstack Antelope base version

qemu-system-x86_64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-2ubuntu6.16)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers

libvirtd (libvirt) 8.0.0

#Create general vm once config is set
#perform live migration either with block-migration or without

After Pre-migration phase, able to see migration got trigger successfully & while copying memory , seeing postcopy is not supported

Expected result
===============
Migration should be sucessfull

Actual result
=============

compute.log

Feb 05 22:16:02 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:02.988 1156821 INFO nova.virt.libvirt.migration [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Increasing downtime to 10 ms after 0 sec elapsed time
Feb 05 22:16:03 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:03.069 1156821 INFO nova.virt.libvirt.driver [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Migration running for 0 secs, memory 100% remaining (bytes processed=0, remaining=0, total=0); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Feb 05 22:16:03 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:03.571 1156821 DEBUG nova.virt.libvirt.migration [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Current 10 elapsed 1 steps [(0, 10), (960, 19), (1920, 28), (2880, 37), (3840, 46), (4800, 55), (5760, 64), (6720, 73), (7680, 82), (8640, 91), (9600, 100)] update_downtime /openstack/venvs/nova-27.4.0/lib/python3.10/site-packages/nova/virt/libvirt/migration.py:512
Feb 05 22:16:03 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:03.572 1156821 DEBUG nova.virt.libvirt.migration [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Downtime does not need to change update_downtime /openstack/venvs/nova-27.4.0/lib/python3.10/site-packages/nova/virt/libvirt/migration.py:525
Feb 05 22:16:04 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:04.074 1156821 DEBUG nova.virt.libvirt.migration [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Current 10 elapsed 1 steps [(0, 10), (960, 19), (1920, 28), (2880, 37), (3840, 46), (4800, 55), (5760, 64), (6720, 73), (7680, 82), (8640, 91), (9600, 100)] update_downtime /openstack/venvs/nova-27.4.0/lib/python3.10/site-packages/nova/virt/libvirt/migration.py:512
Feb 05 22:16:04 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:04.075 1156821 DEBUG nova.virt.libvirt.migration [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Downtime does not need to change update_downtime /openstack/venvs/nova-27.4.0/lib/python3.10/site-packages/nova/virt/libvirt/migration.py:525
Feb 05 22:16:04 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:04.577 1156821 DEBUG nova.virt.libvirt.migration [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Current 10 elapsed 2 steps [(0, 10), (960, 19), (1920, 28), (2880, 37), (3840, 46), (4800, 55), (5760, 64), (6720, 73), (7680, 82), (8640, 91), (9600, 100)] update_downtime /openstack/venvs/nova-27.4.0/lib/python3.10/site-packages/nova/virt/libvirt/migration.py:512
Feb 05 22:16:04 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:04.577 1156821 DEBUG nova.virt.libvirt.migration [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Downtime does not need to change update_downtime /openstack/venvs/nova-27.4.0/lib/python3.10/site-packages/nova/virt/libvirt/migration.py:525
Feb 05 22:16:04 cdc-appblx095-37 nova-compute[1156821]: 2024-02-05 22:16:04.638 1156821 ERROR nova.virt.libvirt.driver [None req-f6650a9f-9465-40ca-9981-fff470751fc7 4807f132b7bb47bbabbe50de9bd974c8 b61fc56101024f498d4d95e863c7333f - - default default] [instance: 31fcf3ba-c0b1-4c74-afdd-685ba45a11f0] Live Migration failure: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported: libvirt.libvirtError: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported

libvirtd.log
-------

2024-02-05 21:27:56.254+0000: 2177204: debug : qemuMonitorSetMigrationCapabilities:3689 : mon:0x7f8d68080460 vm:0x7f8d6804b930 fd:82
2024-02-05 21:27:56.254+0000: 2177204: info : qemuMonitorSend:914 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d68080460 msg={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":true},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":true},{"capability":"multifd","state":false},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":true}]},"id":"libvirt-402"}^M
 fd=-1
2024-02-05 21:27:56.254+0000: 16270: info : qemuMonitorIOWrite:402 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d68080460 buf={"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":true},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":true},{"capability":"multifd","state":false},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":true}]},"id":"libvirt-402"}^M
 len=531 ret=531 errno=0
2024-02-05 21:27:56.258+0000: 16270: debug : qemuMonitorJSONIOProcessLine:220 : Line [{"id": "libvirt-402", "error": {"class": "GenericError", "desc": "Postcopy is not supported"}}]
2024-02-05 21:27:56.258+0000: 16270: info : qemuMonitorJSONIOProcessLine:239 : QEMU_MONITOR_RECV_REPLY: mon=0x7f8d68080460 reply={"id": "libvirt-402", "error": {"class": "GenericError", "desc": "Postcopy is not supported"}}
2024-02-05 21:27:56.258+0000: 2177204: debug : qemuMonitorJSONCheckErrorFull:385 : unable to execute QEMU command {"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"xbzrle","state":false},{"capability":"auto-converge","state":false},{"capability":"rdma-pin-all","state":false},{"capability":"postcopy-ram","state":true},{"capability":"compress","state":false},{"capability":"pause-before-switchover","state":false},{"capability":"late-block-activate","state":true},{"capability":"multifd","state":false},{"capability":"dirty-bitmaps","state":false},{"capability":"return-path","state":true}]},"id":"libvirt-402"}: {"id":"libvirt-402","error":{"class":"GenericError","desc":"Postcopy is not supported"}}
2024-02-05 21:27:56.258+0000: 2177204: error : qemuMonitorJSONCheckErrorFull:397 : internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported

Revision history for this message
sean mooney (sean-k-mooney) wrote :

is there anything sepcial about the vm you are migrating in terms of what is requested by the flavour.

we have post-copy enabled by default in our downstream OpenStack distribution.

there are some feature that are not supported with it like live migrating a paused VM, and it generally does not work unless you tune some variables in the kernel.

namely you need to set vm.unprivileged_userfaultfd=1 in sysctl

https://github.com/openstack-k8s-operators/edpm-ansible/blob/main/roles/edpm_kernel/vars/main.yml#L105C3-L106

https://github.com/openstack/tripleo-heat-templates/blob/1393d39be367db3acb02508e0e858395a4e4fefa/deployment/nova/nova-compute-container-puppet.yaml#L1631-L1638

if you can provide an example fo the guest XML we may be able to spot any feature that might not be supported although i don't know of any other then pause that would break it. we have a bug for pause
https://bugs.launchpad.net/nova/+bug/1946752

that is fixed back to zed https://review.opendev.org/q/topic:%22bug/1671011%22 but it was only done a month ago so that may not be in a packaged release yet.

Revision history for this message
keerthivasan (keerthivassan86) wrote :

Thanks @sean for your inputs, after setting `vm.unprivileged_userfaultfd` it is working as expected. Changed status of this bugs to Invalid

Changed in nova:
status: New → Invalid
Changed in nova:
status: Invalid → New
assignee: nobody → keerthivasan (keerthivassan86)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/908201

Revision history for this message
keerthivasan (keerthivassan86) wrote :

Thanks Sean, validated the sysctl changes you provided & pushed for doc update on live migration configuring page

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.