libvirt live migrate to a lower generation processor freeze the migrated vm

Bug #2003226 reported by Daniel Roche
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Expired
Undecided
Unassigned
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Hi,

i have several libvirt hosts servers with differents CPU generation, in particular :
- older generation : Intel Xeon E5-2640 v4 2.40GHz
- newer generation : Intel Xeon Gold 5215 2.50GHz

i recently re-install all this servers into ubuntu server 22.04.1
and since, when i live migrate a VM from a new generation processor to older generation processor
the migrated guest freeze without generating any error logs.

if i migrate the opposite way ( older cpu to newer cpu ) it works perfectly.

previous version of hosts ( same hardware on ubuntu 16.04 ) did not present the problem

the live migration is done with the following command ( issued from a third server playing the role of 'virtual-center' ) :

virsh -c qemu+ssh://root@new_server/system migrate --verbose --live --undefinesource --persistent --unsafe guest_name qemu+ssh://root@old_server/system

this one freeze the guest

while the opposite migration :

virsh -c qemu+ssh://root@old_server/system migrate --verbose --live --undefinesource --persistent --unsafe guest_name qemu+ssh://root@new_server/system

works without problem
migrate between 2 servers with same generation CPU also works perfectly

the cpu configuration of guest is generic :
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='lahf_lm'/>
    <feature policy='disable' name='svm'/>
  </cpu>

i have tried several ( almost all ) other virtual cpu configuration , always with the same problem.
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 févr. 16 12:39 seq
 crw-rw---- 1 root audio 116, 33 févr. 16 12:39 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
CasperMD5CheckResult: pass
DistroRelease: Ubuntu 22.04
InstallationDate: Installed on 2023-02-15 (0 days ago)
InstallationMedia: Ubuntu-Server 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: FUJITSU PRIMERGY RX2530 M2
Package: linux (not installed)
PciMultimedia:

ProcCmdline: BOOT_IMAGE=/boot/vmlinuz-5.15.0-60-generic root=UUID=13e84e97-ad18-49ed-8050-c8f7293e5e7d ro
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-60-generic root=UUID=13e84e97-ad18-49ed-8050-c8f7293e5e7d ro
ProcVersionSignature: Ubuntu 5.15.0-60.66-generic 5.15.78
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-60-generic N/A
 linux-backports-modules-5.15.0-60-generic N/A
 linux-firmware 20220329.git681281e4-0ubuntu3.10
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: jammy
Uname: Linux 5.15.0-60-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 09/29/2016
dmi.bios.release: 1.10
dmi.bios.vendor: FUJITSU // American Megatrends Inc.
dmi.bios.version: V5.0.0.11 R1.10.0 for D3279-B1x
dmi.board.name: D3279-B1
dmi.board.vendor: FUJITSU
dmi.board.version: S26361-D3279-B12 WGS03 GS02
dmi.chassis.asset.tag: System Asset Tag
dmi.chassis.type: 23
dmi.chassis.vendor: FUJITSU
dmi.chassis.version: RX2530M2R1
dmi.modalias: dmi:bvnFUJITSU//AmericanMegatrendsInc.:bvrV5.0.0.11R1.10.0forD3279-B1x:bd09/29/2016:br1.10:svnFUJITSU:pnPRIMERGYRX2530M2:pvrGS01:rvnFUJITSU:rnD3279-B1:rvrS26361-D3279-B12WGS03GS02:cvnFUJITSU:ct23:cvrRX2530M2R1:skuABNK1565-V101-236:
dmi.product.family: SERVER
dmi.product.name: PRIMERGY RX2530 M2
dmi.product.sku: ABN:K1565-V101-236
dmi.product.version: GS01
dmi.sys.vendor: FUJITSU
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 févr. 16 12:39 seq
 crw-rw---- 1 root audio 116, 33 févr. 16 12:39 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
CasperMD5CheckResult: pass
DistroRelease: Ubuntu 22.04
InstallationDate: Installed on 2023-02-15 (0 days ago)
InstallationMedia: Ubuntu-Server 22.04.1 LTS "Jammy Jellyfish" - Release amd64 (20220809)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 004: ID 04b3:4010 IBM Corp. XClarity Controller
 Bus 001 Device 002: ID 2a4b:0400 EMULEX Corporation Pilot4 Integrated Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Lsusb-t:
 /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 5000M
 /: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/16p, 480M
     |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/7p, 480M
         |__ Port 6: Dev 4, If 0, Class=Communications, Driver=cdc_ether, 480M
         |__ Port 6: Dev 4, If 1, Class=CDC Data, Driver=cdc_ether, 480M
MachineType: Lenovo ThinkSystem SR530 -[7X08CTO1WW]-
Package: linux (not installed)
PciMultimedia:

ProcCmdline: BOOT_IMAGE=/boot/vmlinuz-5.15.0-60-generic root=UUID=fc2c50fd-cc47-4243-918d-75af7d00b0e4 ro
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-60-generic root=UUID=fc2c50fd-cc47-4243-918d-75af7d00b0e4 ro
ProcVersionSignature: Ubuntu 5.15.0-60.66-generic 5.15.78
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-60-generic N/A
 linux-backports-modules-5.15.0-60-generic N/A
 linux-firmware 20220329.git681281e4-0ubuntu3.10
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: jammy
Uname: Linux 5.15.0-60-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 10/31/2019
dmi.bios.release: 2.41
dmi.bios.vendor: Lenovo
dmi.bios.version: -[TEE148M-2.41]-
dmi.board.asset.tag: none
dmi.board.name: -[7X08CTO1WW]-
dmi.board.vendor: Lenovo
dmi.board.version: none
dmi.chassis.asset.tag: none
dmi.chassis.type: 23
dmi.chassis.vendor: Lenovo
dmi.chassis.version: none
dmi.ec.firmware.release: 3.8
dmi.modalias: dmi:bvnLenovo:bvr-[TEE148M-2.41]-:bd10/31/2019:br2.41:efr3.8:svnLenovo:pnThinkSystemSR530-[7X08CTO1WW]-:pvr08:rvnLenovo:rn-[7X08CTO1WW]-:rvrnone:cvnLenovo:ct23:cvrnone:sku7X08CTO1WW:
dmi.product.family: ThinkSystem
dmi.product.name: ThinkSystem SR530 -[7X08CTO1WW]-
dmi.product.sku: 7X08CTO1WW
dmi.product.version: 08
dmi.sys.vendor: Lenovo

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Wow, interesting - thanks for the report.

With this CPU definition you already have almost all advanced features disabled.
And you said you even tried others - well done.

You also said that the same HW worked (with 16.04 instead of 22.04).
That is interesting.

Sadly, until we'd have access to comparable hardware that exposes this (I've tried two, not too similar, not showing the same), the best we could do it trying to corner the change that broke it.

So I wonder if you could try a few things, how possible (or not) would it be to:
- try newer code from [1], if it works we could bisect looking for a fix
- try older code, we do not yet know if it is qemu, libvirt, kernel or anything else.
  We could check them once the hard way - e.g. if you say you can do so we could
  bisect between qemu 2.5 (as in xenial) and 6.2 (as in jammy). Before I go in detail how
  to do so, would that even be possible? Or are those production machines not to be
  messed with too much?

[1]: https://launchpad.net/~canonical-server/+archive/ubuntu/server-backports

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

Hello,

unfortunately the new servers are production machines ( with almost 500 VMs )
so i do not want to do exotics test on them.

i may do tests on an older system, still in production, but about to be retired.

- i will try to install qemu 7.0 with the ppa and test a migration
- i may also revert this older system to ubuntu 16 (qemu 2.5)

you will have to wait until the first week of february, i will not be available before !
i'll let you know when it's done.

Best Regards

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

I have found a post on serverfault :

https://serverfault.com/questions/1110353/libvirt-qemu-vm-freezes-when-migrating-between-specific-hosts

this guy seems to have exactly the same problem
but no answer either....

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

Hello,

sorry for the delay, busy schedule ....

so , i finally had time to do some tests, and i got some new informations :

- i reinstalled 2 of my virtualization servers on ubuntu 18.04.6
     - 1 older ( Intel Xeon E5-2640 v4 2.40GHz )
     - 1 newer ( Intel Xeon Gold 5215 2.50GHz )

- first, i can confirm that with ubuntu 18.04.6 ( kernel 4.15.0-204 ) i do not have the problem
  i can migrate VM from new system to older system and reverse without any issue.

- i installed (manually) a kernel 5.15.0-56 ( ubuntu 2204 ) on both 1804 servers,
  and reboot to this kernel ( without changing anything else )

- with kernel 5.15.0-56, i can reproduce the problem each time :
    - migrate VM from older system to newer always works
    - migrate VM from newer system to older always crash the VM

     this is a bit different though, the VM is tagged as "paused",
     and if i try to virsh resume, i got :
        error: Failed to resume domain devansible01
        error: internal error: unable to execute QEMU command 'cont': Resetting the Virtual Machine is required

     the only thing i can do is virsh destroy and virsh start

since i did not change qemu version, i seems that this is a kernel issue.

i plan to do the opposite test : reinstall both servers in ubuntu 2204, and downgrade the kernel
hopefully by the end of the week, i will let you know.

Best regards.

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

Hello Again,

i don't know if is is relevant,
when i migrate a VM to older system with kernel 5.15.0-56 ( and the VM crashes )
i got the following message in /var/log/kern.log :

set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.

best regards.

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

Hello Again....

i would have guess the opposite, but it seems that this is the source kernel that matters
not the destination.

so i have found :

- virsh migrate from older cpu to newer cpu : always works whatever kernel.

- virsh migrate from newer cpu to older cpu
   - source kernel = 5.15.0-56 : always crashes VMs whatever destination kernel
   - source kernel = 4.15.0-204: always works

best regards

Revision history for this message
Paride Legovini (paride) wrote :

Hello Daniel, thanks for providing the results on your testing. This looks like a kernel issue, I think what we need is testing with different kernels to check when the regression happened.

The kernel team maintains kernel builds of the mainline kernel (i.e. without Ubuntu changes). Instructions on how to test those kernels are here:

  https://wiki.ubuntu.com/Kernel/MainlineBuilds

Would you be able to verify that (1) the issue also happens with the mainline kernels (2) which kernel version regressed (bisecting between versions)? Given that this issue is difficult to reproduce (requires specific hardware) we need to rely in your testing to make progress.

I added a kernel task to the bug.

Changed in libvirt (Ubuntu):
status: New → Incomplete
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2003226

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Daniel Roche (dan-y-roche) wrote :

Hello Again,

thank you for your respons,
i have plan to test different kernel, and i will post the apport-collect result very soon.

meanwhile, i can confirm some new informations :

i re-installed the two system into ubuntu 22.04.1, (kernel 5.15.0-60-generic)
with this, i reproduce the problem every time.
i downgraded the kernel to 4.15.0-204 on both systems, without changing anything else ,
and then the problem is gone, i have done dozens of virsh migrate without any issue.

i guess this confirm the kernel issue.

i will come back soon with apport-collect result and more kernel tests.

Revision history for this message
Daniel Roche (dan-y-roche) wrote : CurrentDmesg.txt

apport information

tags: added: apport-collected jammy
description: updated
Revision history for this message
Daniel Roche (dan-y-roche) wrote : KernLog.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lspci.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lspci-vt.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lsusb.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lsusb-t.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lsusb-v.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcModules.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : UdevDb.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : WifiSyslog.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : acpidump.txt

apport information

description: updated
Revision history for this message
Daniel Roche (dan-y-roche) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : KernLog.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lspci.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lspci-vt.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : Lsusb-v.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : ProcModules.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : UdevDb.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : WifiSyslog.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote : acpidump.txt

apport information

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

apport-collect 2003226 is done for both system.....

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

so far, kernel 5.4.0-139-generic ( the last one from ubuntu 20.04 ) does not have the problem.

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

Hello Again,

i have done some tests with mainline kernels :

- 5.9.16-050916-generic = OK
- 5.12.19-051219-generic = OK
- 5.15.94-051594-generic = OK
- 5.16.20-051620-generic = CRASH

i will do some more test with intermediate version 5.16.xx

Revision history for this message
Daniel Roche (dan-y-roche) wrote :

- 5.16.0-051600-generic = CRASH also

should i test 5.16-rc versions ?

best regards.

Revision history for this message
Bryce Harrington (bryce) wrote :

@Daniel testing rc versions might be ok, although typically at this point you'd want to consider switching to bisecting a git checkout of the kernel since there'll be a limited number of -rc's.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for libvirt (Ubuntu) because there has been no activity for 60 days.]

Changed in libvirt (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.