Live migration failure

Bug #1905424 reported by Luis
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
libvirt
Unknown
Unknown
libvirt (Ubuntu)
Incomplete
Undecided
Unassigned
qemu (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

Performing a live migration with hugepages enabled fails.

nova-compute.log:2020-11-24 08:19:53.266 865081 INFO nova.compute.manager [-] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Took 3.02 seconds for pre_live_migration on destination host compute02-asd001b.
nova-compute.log:2020-11-24 08:19:54.063 865081 INFO nova.virt.libvirt.migration [-] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Increasing downtime to 50 ms after 0 sec elapsed time
nova-compute.log:2020-11-24 08:19:54.134 865081 INFO nova.virt.libvirt.driver [-] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Migration running for 0 secs, memory 100% remaining (bytes processed=0, remaining=0, total=0); disk 100% remaining (bytes processed=0, remaining=0, total=0).
nova-compute.log:2020-11-24 08:19:55.862 865081 ERROR nova.virt.libvirt.driver [-] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Live Migration failure: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported: libvirt.libvirtError: internal error: unable to execute QEMU command 'migrate-set-capabilities': Postcopy is not supported
nova-compute.log:2020-11-24 08:19:56.144 865081 ERROR nova.virt.libvirt.driver [-] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Migration operation has aborted
nova-compute.log:2020-11-24 08:19:56.216 865081 INFO nova.compute.manager [-] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Swapping old allocation on dict_keys(['d933c0a1-1eb3-4c9f-a018-334638019498']) held by migration 61a97f7a-9425-4ee6-af04-3486c69f8d09 for instance
nova-compute.log:2020-11-24 08:19:57.617 865081 WARNING nova.compute.manager [req-bf1b305f-8b37-4847-898b-04d8941ff558 f60eedb4a49c47aa8e62315ddf1a49dd 145f5574b15942dea22bf225befd219a - ea788e1be1ca4dd791c08dad188197a4 ea788e1be1ca4dd791c08dad188197a4] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Received unexpected event network-vif-unplugged-dfa5c55f-453d-451b-8605-156500ccb0d9 for instance with vm_state active and task_state None.
nova-compute.log:2020-11-24 08:19:58.271 865081 WARNING nova.compute.manager [req-642793f8-10e6-49c3-9aa4-a757826eee97 f60eedb4a49c47aa8e62315ddf1a49dd 145f5574b15942dea22bf225befd219a - ea788e1be1ca4dd791c08dad188197a4 ea788e1be1ca4dd791c08dad188197a4] [instance: 24063aa2-2c61-4b43-8d9f-bfd9e38547ab] Received unexpected event network-vif-plugged-dfa5c55f-453d-451b-8605-156500ccb0d9 for instance with vm_state active and task_state None.

this is the conf. related with the hugepages on the VM.

  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>

https://bugzilla.redhat.com/show_bug.cgi?id=1710687

Seems that is related to qemu & libvirt releases and nodeset setting.

root@compute01-asd001b:/var/log/nova# lsb_release -rd
Description: Ubuntu 18.04.5 LTS
Release: 18.04

root@compute01-asd001b:/var/log/nova# dpkg -l | grep qemu
ii ipxe-qemu 1.0.0+git-20180124.fbe8c52d-0ubuntu2.2 all PXE boot firmware - ROM images for qemu
ii ipxe-qemu-256k-compat-efi-roms 1.0.0+git-20150424.a25a16d-0ubuntu2 all PXE boot firmware - Compat EFI ROM images for qemu
ii qemu-block-extra:amd64 1:4.0+dfsg-0ubuntu9.8~cloud0 amd64 extra block backend modules for qemu-system and qemu-utils
ii qemu-kvm 1:4.0+dfsg-0ubuntu9.8~cloud0 amd64 QEMU Full virtualization on x86 hardware
ii qemu-system-common 1:4.0+dfsg-0ubuntu9.8~cloud0 amd64 QEMU full system emulation binaries (common files)
ii qemu-system-data 1:4.0+dfsg-0ubuntu9.8~cloud0 all QEMU full system emulation (data files)
ii qemu-system-gui 1:4.0+dfsg-0ubuntu9.8~cloud0 amd64 QEMU full system emulation binaries (user interface and audio support)
ii qemu-system-x86 1:4.0+dfsg-0ubuntu9.8~cloud0 amd64 QEMU full system emulation binaries (x86)
ii qemu-utils 1:4.0+dfsg-0ubuntu9.8~cloud0 amd64 QEMU utilities

root@compute01-asd001b:/var/log/nova# dpkg -l | grep libvirt
ii libvirt-clients 5.4.0-0ubuntu5.4~cloud0 amd64 Programs for the libvirt library
ii libvirt-daemon 5.4.0-0ubuntu5.4~cloud0 amd64 Virtualization daemon
ii libvirt-daemon-driver-storage-rbd 5.4.0-0ubuntu5.4~cloud0 amd64 Virtualization daemon RBD storage driver
ii libvirt-daemon-system 5.4.0-0ubuntu5.4~cloud0 amd64 Libvirt daemon configuration files
ii libvirt0:amd64 5.4.0-0ubuntu5.4~cloud0 amd64 library for interfacing with different virtualization systems
ii nova-compute-libvirt 2:20.3.0-0ubuntu1~cloud0 all OpenStack Compute - compute node libvir support
ii python3-libvirt 5.0.0-1~cloud0 amd64 libvirt Python 3 bindings

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libvirt (Ubuntu):
status: New → Confirmed
Changed in qemu-kvm (Ubuntu):
status: New → Confirmed
affects: qemu-kvm (Ubuntu) → qemu (Ubuntu)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The upstream bug identified (and asked the reporter to check) to remove the "nodeset".
IMHO progress there depends on someone saying on the upstream bug "yes indeed it is the same for me, once nodeset is gone it works, but with it it fails".

If you could in your environment do this modification and confirm you could induce some life back into that upstream bug.

Also I'd recommend to verify with the latest Ubuntu development release (21.04 / Hirsute) which has qemu 5.1 - not in general but for your testing.
If that has it fixed we could look for an existing patch to backport if applicable.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: I added a tracker task to the upstream bug.

After lunch I'll try to recreate this if possible without HP (just nodeset) if there is a way to do it. Would help testability, but no promises ...

Revision history for this message
Luis (luis-ramirez) wrote :

The vms are created by nova, OpenStack Train env., so I cannot modify the settings for this VM, but I could try to create one and perform a live migration btw 2 nodes.

Br
Luis

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

What I found interesting in your bug vs the upstream one is that you have a single node in your nodeset, that might help to simplify things.

# new guest with UVT (any other probably does as well)
$ uvt-kvm create --host-passthrough --password=ubuntu h-migr-nodeset release=hirsute arch=amd64 label=daily

# verify things migrate
virsh migrate --unsafe --live h-migr-nodeset qemu+ssh://testkvm-hirsute-to/system

# make it use HP+nodeset
Adding this scction to the config
  <memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
    </hugepages>
  </memoryBacking>

This migrates just fine in Hirsute (qemu 5.1 / libvirt 6.9)
You mentioned to see this on Bionic, so I tried the same on Bionic - but that works as well.

Hmm, so far trying to simplify this (e.g. take openstack out of the equation) fails.
Once you get to create VMs manually could you check what the smallest set of "ingredients" is to trigger the issue?

Note: this is a non Numa system, I only have node 0 - this could be important. It would be awesome if we could make it trigger on those, but if eventually a numa bare metal system is required we can't change it.

P.S: TBH - I'm not sure if that isn't just a real limitation, but let us track the case until we know for sure. Or did you see that ever working and it degraded in a newer release?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Subscribing James/Corey - have you seen these issues with HP backed guests as generated by openstack?

Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
Changed in libvirt (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Luis (luis-ramirez) wrote :

This is a NUMA system:

root@compute01-asd001b:~# numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0
nodebind: 0
membind: 0 1
root@compute01-asd001b:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 0 size: 64287 MB
node 0 free: 623 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 1 size: 64491 MB
node 1 free: 8949 MB
node distances:
node 0 1
  0: 10 12
  1: 12 10

If you don't have NUMA nodes enabled OpenStack can not add the hugepages settings due NUMATopolyFilter.

I'll try to create the vm, and I let you know. Thx in advance.

Brç
Luis

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.