[SRU] Allow for specifying common baseline CPU model with disabled feature
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Expired
|
Undecided
|
Unassigned | ||
Ussuri |
New
|
Undecided
|
Unassigned | ||
Victoria |
Won't Fix
|
Undecided
|
Unassigned | ||
Wallaby |
Won't Fix
|
Undecided
|
Unassigned | ||
Xena |
Won't Fix
|
Undecided
|
Unassigned | ||
Yoga |
New
|
Undecided
|
Unassigned | ||
Ubuntu Cloud Archive |
Invalid
|
Undecided
|
Unassigned | ||
Ussuri |
New
|
Undecided
|
Unassigned | ||
Yoga |
Fix Committed
|
Undecided
|
Unassigned | ||
nova (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Won't Fix
|
Undecided
|
Unassigned | ||
Focal |
Fix Released
|
Medium
|
Rodrigo Barbieri | ||
Jammy |
Fix Released
|
Medium
|
Rodrigo Barbieri |
Bug Description
******** SRU TEMPLATE AT THE BOTTOM *******
Hello,
This is very similar to pad.lv/1852437 (and the related blueprint at https:/
A customer I'm working with has two classes of blades that they're trying to use. Their existing ones are Cascade Lake-based; they are presently using the Cascadelake-
The result of this is evident when I try to start nova on the new blades with the Ice Lake CPUs. Even if I specify the following in my nova.conf:
[libvirt]
cpu_mode = custom
cpu_model = Cascadelake-
cpu_model_
That is not enough to allow Nova to start; it fails in the libvirt driver in the _check_
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
2022-12-15 17:20:59.562 1836708 ERROR oslo_service.
If I make a custom libvirt CPU map file which removes the "<feature name='mpx'/>" feature and specify that as the cpu_model instead, I am able to make Nova start - so it does indeed seem to specifically be that single feature which is blocking me. However, editing the libvirt CPU mapping files is probably not the right way to fix this - hence why I'm filing this bug, for discussion of how to support cases like this.
Currently the only "proper" way I'm aware of to work around this right now is to fall back to a Broadwell-based configuration which lacks the "mpx" feature to use as a common baseline, but that's a much older configuration than Cascade Lake and would mean missing out on all the other features which are common in both Cascade Lake and Ice Lake. I would rather if there were a way to use the Cascade Lake settings but simply remove that "mpx" feature from use.
----
Steps to reproduce
==================
On an Ice Lake system lacking the MPX feature (e.g. /proc/cpuinfo reporting model of "Intel(R) Xeon(R) Gold 5318Y"), specify the following settings in nova.conf in libvirt settings:
[libvirt]
cpu_mode = custom
cpu_model = Cascadelake-
cpu_model_
Then try to start nova.
Expected result
===============
Nova should start since Cascadelake-
Actual result
=============
Nova refuses to start, claiming the specified CPU model is incompatible. The "cpu_model_
Environment
===========
Nova/OpenStack version: OpenStack Ussuri running on Ubuntu Focal. Specifically, nova packages are at version 2:21.2.4-0ubuntu2.
Hypervisor: libvirt + KVM
Other relevant notes
=======
There are some other open related bugs. The removal of the MPX feature in some Ice Lake processors has manifested in other ways as well. These bugs are primarily in regards to the missing MPX feature breaking how Ice Lake processors are detected, so the nuance is somewhat different - however, they may be worth reviewing as well.
* https:/
There is also an interesting comment on this bug at https:/
* https:/
===============
SRU Description
===============
[Impact]
When using IceLake CPUs alongside CascadeLake CPUs, the Nova code does not start due to comparing CPU models. It fails before even comparing the flags. Unfortunately, IceLake CPUs are detected as having compatibility with Broadwell, not CascadeLake. Using Broadwell as a common denominator disables many modern features. The Libvirt upstream team will not add specific support to IceLake [1]. The fix [2] in Nova is to ignore CPU check (as a configurable workaround) as let libvirt handle the added/removed flags, which is assumed to work for this specific case.
[Test case]
Due to not having Icelake and Cascadelake CPUs in our usual lab for testing of this specific scenario, the test case for this could be either:
1) run for this SRU is running the charmed-
2) manually deploy nova and the necessary openstack services to get Nova to the code point of validating the issue in a single node. I already achieved this and was able to test the fix by hacking the node code to bypass the need of other services (conductor, keystone, mysql, etc) but for a proper validation a clean installation (without any hackery) is considered mandatory. In such case, the test case would be:
a) Deploy nova and required services in an IceLake machine
b) Make sure the nova.conf has:
cpu_mode = custom
cpu_models = Cascadelake-
cpu_model_
c) Check /var/log/
2024-07-08 15:08:48.378 8399 CRITICAL nova [-] Unhandled error: nova.exception.
d) Install package containing the fix and confirm the successful nova-compute service restart, not containing the error and containing this instead:
2024-07-09 19:41:31.806 243487 DEBUG nova.virt.
<model>
<feature name="mpx" policy="disable"/>
</cpu>
[Regression Potential]
There is 1 new behavior introduced and 1 changed. The behavior introduced is gated by a new config option that needs to be enabled, and when enabled, it skips running the code. The behavior changed is the one assumed by the default disabled value of the config option. The fact that the code being backported in Yoga-Ussuri is exactly the same as in currently Master (Caracal+), it means that no issues have been found with the code across 4 releases, giving some confidence that the code changed is unlikely to cause issues.
[Other Info]
[1] https:/
[2] https:/
summary: |
- Allow for specifying common baseline CPU model with disabled feature + [SRU] Allow for specifying common baseline CPU model with disabled + feature |
description: | updated |
tags: | added: sts sts-sru-needed |
description: | updated |
Changed in nova (Ubuntu Jammy): | |
assignee: | nobody → Rodrigo Barbieri (rodrigo-barbieri2010) |
importance: | Undecided → Medium |
The reason this doesn't currently work is because Nova does not take the cpu-extra-flags into consideration when checking the host cpu model against one provided in cpu-models [1]. I believe it should be safe to do that so we should try to get that added.
The point about capabilities vs. domcapabilities is interesting and if the latter is the more accurate source of information, Nova should probably be using it.
[1] https:/ /github. com/openstack/ nova/blob/ 8a476061c5e0340 16668cd9e5a20c4 430ef6b68d/ nova/virt/ libvirt/ driver. py#L987