EPYC-Rome model without XSAVES may break live migration since the removal of the flag on the physical CPU

Bug #2048517 reported by Jan Horstmann
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
nova (Ubuntu)
Triaged
Medium
Mauricio Faria de Oliveira
qemu (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

The linux kernel upstream disabled XSAVES on AMD EPYC Rome CPUs ([1]). Upstream qemu shortly followed with a patch adding a CPU model version of EPYC-Rome without XSAVES ([2])
The change in the kernel has been backported to ubuntu focal ([3]).

Without further workarounds or the adapted CPU model in qemu this will lead to a situation were virtual machines with an EPYC-Rome CPU model created on hypervisors with newer EPYC CPUs will have the XSAVES flag enabled, thus preventing live migration to hypervisors with EPYC Rome CPUs were XSAVES is no longer available.
Therefore I would like to argue that the patch adapting the CPU model in qemu should also be backported to ubuntu focal.

[1]
https://<email address hidden>/

[2]
https://<email address hidden>/

[3]
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2023420

Tags: server-todo
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hello Jan,

Thank you for reporting this bug, and for providing a good initial analysis of the problem.

Would you have access to a host with an EPYC-Rome CPU where you can run some tests? This is something that needs to be done in order to proceed here, especially if we indeed decide to SRU this change into Focal.

Thanks.

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Hello Sergio,

I can help with testing. I have access to EPYC-Rome CPU machines.

Thanks

Changed in qemu (Ubuntu):
assignee: nobody → Sergio Durigan Junior (sergiodj)
tags: added: server-todo
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hello Giuseppe,

Thank you. I went ahead and backported the necessary patches. In total, I had to backport the following upstream commits:

fb00aa61267c8b9c57a2d1a1fa1e336d02e3bcd1
d7c72735f618a7ee27ee109d8b1468193734606a
cca0a000d06f897411a8af4402e5d0522bbe450b

I uploaded a version of QEMU with the patches to:

https://launchpad.net/~sergiodj/+archive/ubuntu/qemu/+packages

The version is 1:4.2-3ubuntu6.29~ppa1. Could you take it for a spin, please? I'm interested in seeing the results of the scenarios suggested by Jan (namely, migrating VMs between hosts that have differing states for XSAVES).

Thank you.

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Hello Sergio,

thanks for providing the patch.

I have installed qemu on my test machine. The new epyc-rome-v4 is now listed by qemu

```
# qemu-system-x86_64 -enable-kvm -cpu help| grep -i epyc-rome-v
x86 EPYC-Rome-v1 AMD EPYC-Rome Processor
x86 EPYC-Rome-v2 AMD EPYC-Rome Processor
x86 EPYC-Rome-v3 AMD EPYC-Rome-v3 Processor
x86 EPYC-Rome-v4 AMD EPYC-Rome-v4 Processor (no XSAVES)
```

But I am not able to have the VM dropping xsaves.

I have stopped/started the VM running on my host but it is still requiring the xsaves feature. Same for newly created VMs on the host, they still require the xsaves feature, that means can't be migrated to hosts which have already dropped the xsaves cpu flag. I have also tried to restart nova-compute and libvirtd but still the same:

```
# virsh dumpxml instance-005ac280 | grep xsaves
    <feature policy='require' name='xsaves'/>
```

I believe this patch is not enough to allow migrations between hosts with xsaves and w/o xsaves or I am missing some steps.

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

One thing I notice is that the libvirt xml from EPYC at /usr/share/libvirt/cpu_map/x86_EPYC-Rome.xml ofc is not updated but also a new one for using -v4 is not rendered.

That means I can't instruct nova to use EPYC-Rome-v4 with

# grep -i epyc /etc/nova/nova.conf
cpu_model = EPYC-Rome-v4

because nova will stop working with error

2024-04-16 07:30:04.939 4168598 ERROR oslo_service.service nova.exception.InvalidCPUInfo: Configured CPU model: EPYC-Rome-v4 is not correct, or your host CPU arch does not support this model. Please correct your config and try again.

Do we need a patch for libvirt as well to have this working?

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Giuseppe,

You're right, libvirt checks the specified model against its known models.

However, the EPYC-Rome (not-v4) doesn't specify 'xsaves', just EPYC-Milan,
so it _seems_ the feature came from the default with cpu_model host-model,
which perhaps found EPYC Milan model closer to the host flags for reasons,
and used that CPU model file instead, which resulted in 'xsaves' required.

Since there is an increase in CPU feature flags changes recently (eg, the
'xsaves' you mentioned and also PKRU/xsave changes/regressions in kernel
5.15.0-85 per bug 2032164 comment 5), and these apparently may continue
to grow over time, as errata, security vulnerabilities, and other stuff
come up, maybe it's better not to rely on updates that require CPU model
updates in packages (specially 2 of them; qemu/libvirt).

So, in Yoga and later, Nova extends the 'libvirt.cpu_model_extra_flags' [1]
parsing with '+' and '-', so you can disable specific flags with '-', [2]
e.g., '-xsaves'.

This can be used with the unpatched QEMU, as it can use existing CPU models.

I backported the patch to Focal/Ussuri's nova (it's present in Jammy/Yoga),
and built it in PPA [3].

Could you please test it, and see how it goes? More details in [1] and [2].

Thanks!

[1] https://docs.openstack.org/nova/yoga/admin/cpu-models.html#cpu-feature-flags

[2] https://opendev.org/openstack/nova/commit/bcd6b42047ea9422a58a4273d831e23f2ea27092

[3] https://launchpad.net/~mfo/+archive/ubuntu/lp2048517

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hmm, sorry, please hold; the unit tests caught an error.
I'll check that on Monday.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Giuseppe,

Please test nova version 2:21.2.4-0ubuntu2.7~ppa3 in the PPA.
(It's finished building and should be published in some time.)

The config change should be something along the lines of
`cpu_model_extra_flags = -xsaves`

I couldn't test the package/functionality yet, but all of the
build-time unit tests have passed (just like official packages).

Please let us know how it goes.

Thanks,
Mauricio

Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Hello Mauricio,

thanks for providing the nova patch. It worked as expected.

I installed on a node with xsaves enabled and updated nova.conf

# dpkg -l | grep nova
ii nova-api-metadata 2:21.2.4-0ubuntu2.7~ppa3 all OpenStack Compute - metadata API frontend
ii nova-common 2:21.2.4-0ubuntu2.7~ppa3 all OpenStack Compute - common files
ii nova-compute 2:21.2.4-0ubuntu2.7~ppa3 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:21.2.4-0ubuntu2.7~ppa3 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:21.2.4-0ubuntu2.7~ppa3 all OpenStack Compute - compute node libvirt support
ii python3-nova 2:21.2.4-0ubuntu2.7~ppa3 all OpenStack Compute Python 3 libraries

# grep cpu_model /etc/nova/nova.conf
cpu_model = EPYC-Rome
cpu_model_extra_flags = -xsaves

Then I restarted nova-compute.

I had the following VM running which was using xsaves:

# virsh dumpxml instance-001499b6 | grep xsaves <feature policy='require' name='xsaves'/>

I stopped/started the VM and the xsaves feature was disabled after that

# virsh dumpxml instance-001499b6 | grep xsaves <feature policy='disable' name='xsaves'/>

and that allowed me to migrate the VM to a node with xsaves disabled (newer kernel)

I think next step should be to start SRU process for this patch, correct?

Thanks,
Giuseppe

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote (last edit ):

Hi Giuseppe,

That's good news; thanks for testing!

Yes, that's correct. I'll review the patch for SRU considerations (e.g., potential side-effects/regressions for existing users, not always directly clear from the code changes) and proceed if all is OK.

Changed in qemu (Ubuntu):
status: Incomplete → Invalid
Changed in nova (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Mauricio Faria de Oliveira (mfo)
Changed in qemu (Ubuntu):
assignee: Sergio Durigan Junior (sergiodj) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.