When REBUILDING from UEFI to non-UEFI instance ends up in ERROR state

Bug #1997352 reported by Christian Rohmann
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Unassigned
2024.1
Fix Committed
Undecided
Unassigned

Bug Description

If an UEFI instance is REBUILDED using a non-UEFI image as a replacement via e.g.:

# openstack server create --flavor c4.2xlarge --image ubuntu-22.04-x86_64-uefi --network mynetwork --key-name mykey ubuntu-uefi-test --security-group default

# openstack server rebuild --image ubuntu-22.04-x86_64 ubuntu-uefi-test

The instance ends up in an error state:

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 285, in delete_configuration
    self._domain.undefineFlags(flags)
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 193, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 151, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/libvirt.py", line 2924, in undefineFlags
    if ret == -1: raise libvirtError (\'virDomainUndefineFlags() failed\', dom=self)
libvirt.libvirtError: Requested operation is not valid: cannot undefine domain with nvram

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 200, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3095, in terminate_instance
    do_terminate_instance(instance, bdms)
  File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 360, in inner
    return f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3093, in do_terminate_instance
    self._set_instance_obj_error_state(instance)
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3083, in do_terminate_instance
    self._delete_instance(context, instance, bdms)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3018, in _delete_instance
    self._shutdown_instance(context, instance, bdms)
  File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2910, in _shutdown_instance
    self._try_deallocate_network(context, instance,
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
 File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 2897, in _shutdown_instance
    self.driver.destroy(context, instance, network_info,
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1423, in destroy
    self.cleanup(context, instance, network_info, block_device_info,
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1493, in cleanup
    return self._cleanup(
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1585, in _cleanup
    self._undefine_domain(instance)
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1442, in _undefine_domain
    LOG.error(\'Error from libvirt during undefine. \'
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
    self.force_reraise()
  File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
    raise self.value
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 1433, in _undefine_domain
    guest.delete_configuration(support_uefi)
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/guest.py", line 289, in delete_configuration
    self._domain.undefine()
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 193, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 151, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/libvirt.py", line 2888, in undefine
    if ret == -1: raise libvirtError (\'virDomainUndefine() failed\', dom=self)
libvirt.libvirtError: Requested operation is not valid: cannot undefine domain with nvram

```

Additionally the instance cannot be deleted. Only manually by issuing a ``virsh undefine [instance_uuid] --nvram``

I am pretty certain this is related to https://github.com/openstack/nova/commit/539d381434ccadcdc3f5d58c2705c35558a3a065 which introduced the deletion of nvram if the machine supports UEFI. Likely this info is lost on the switch to the non-UEFI image and therefore this flag is not set.

I am wondering why this flag is not send to libvirt by default to always be able to delete instances UEFI or not?

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

There was couple of changes on the uefi support check since https://github.com/openstack/nova/commit/539d381434ccadcdc3f5d58c2705c35558a3a065 . What version of OpenStack causing the above error? Is it reproducible on recent master?

Looking through the current code on master the deletion of the nvram does not depend on the guest it only depends on the host capabilities. So simply changing the guest image to non uefi should not mean we are not trying the remove the nvram during delete.

Setting this to INCOMPLETE until exact openstack version / master reproducibility is determined.

tags: added: compute rebuild
Changed in nova:
status: New → Incomplete
Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

This was tested using OpenStack Xena.

There were no changes to the `delete_configuration` method itself https://opendev.org/openstack/nova/blame/commit/f9dc9f259bfff5ac9424b4acb8575bf8c2463113/nova/virt/libvirt/guest.py#L298, but there was https://bugs.launchpad.net/nova/+bug/1845628 and the resulting https://review.opendev.org/c/openstack/nova/+/685678.

Going along the chain of events I believe with the switch to a non-UEFI image during REBUILD the information that there WAS an NVRAM present due to the previously used UEFI-enabled image is lost, thus the delete_configuration does to work anymore causing the rebuild to fail.

My take would be:

Why not always delete the NVRAM if the hosts has UEFI support, just in case? Is that call really that expensive? By calling "delete_configuration" I simply want to end up with a deleted configuration / domain.

Revision history for this message
Waldemar Reger (wreger) wrote (last edit ):

This was tested on OpenStack Yoga with the same result.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Changed in nova:
status: Expired → Incomplete
Changed in nova:
status: Incomplete → Expired
Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Balazs Gibizer could you please take a look at the code I referenced in https://bugs.launchpad.net/nova/+bug/1997352/comments/2.

I second my question why NVRAM is not simply deleted "just to be sure" or to "reconcile" out of every condition.

Changed in nova:
status: Expired → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Changed in nova:
status: Expired → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/906380

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/2024.1)

Fix proposed to branch: stable/2024.1
Review: https://review.opendev.org/c/openstack/nova/+/915051

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/906380
Committed: https://opendev.org/openstack/nova/commit/406d590a364d2c3ebc91e5f28f94011b158459d2
Submitter: "Zuul (22348)"
Branch: master

commit 406d590a364d2c3ebc91e5f28f94011b158459d2
Author: Simon Hensel <email address hidden>
Date: Tue Jan 23 16:16:17 2024 +0100

    Always delete NVRAM files when deleting instances

    When deleting an instance, always send VIR_DOMAIN_UNDEFINE_NVRAM to
    delete the NVRAM file, regardless of whether the image is of type UEFI.
    This prevents a bug when rebuilding an instance from an UEFI image to a
    non-UEFI image.

    Closes-Bug: #1997352

    Change-Id: I24648f5b7895bf5d093f222b6c6e364becbb531f
    Signed-off-by: Simon Hensel <email address hidden>

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/2024.1)

Reviewed: https://review.opendev.org/c/openstack/nova/+/915051
Committed: https://opendev.org/openstack/nova/commit/ebd45460f7d6290f42464314e0d5fb864869dcab
Submitter: "Zuul (22348)"
Branch: stable/2024.1

commit ebd45460f7d6290f42464314e0d5fb864869dcab
Author: Simon Hensel <email address hidden>
Date: Tue Jan 23 16:16:17 2024 +0100

    Always delete NVRAM files when deleting instances

    When deleting an instance, always send VIR_DOMAIN_UNDEFINE_NVRAM to
    delete the NVRAM file, regardless of whether the image is of type UEFI.
    This prevents a bug when rebuilding an instance from an UEFI image to a
    non-UEFI image.

    Closes-Bug: #1997352

    Change-Id: I24648f5b7895bf5d093f222b6c6e364becbb531f
    Signed-off-by: Simon Hensel <email address hidden>
    (cherry picked from commit 406d590a364d2c3ebc91e5f28f94011b158459d2)

Revision history for this message
sean mooney (sean-k-mooney) wrote :

as far as im aware the nvram file can contain data set by the guest os.

from the libvirt docs
“””
nvram
Some UEFI firmwares may want to use a non-volatile memory to store some variables. In the host, this is represented as a file and the absolute path to the file is stored in this element. Moreover, when the domain is started up libvirt copies so called master NVRAM store file defined in qemu.conf. If needed, the template attribute can be used to per domain override map of master NVRAM stores from the config file. Note, that for transient domains if the NVRAM file has been created by libvirt it is left behind and it is management application's responsibility to save and remove file (if needed to be persistent). Since 1.2.8

Since 8.5.0, it's possible for the element to have type attribute (accepts values file, block and network) in that case the NVRAM storage is described by a <source> sub-element with the same syntax as disk's source. See Hard drives, floppy disks, CDROMs.

Note: network backed NVRAM the variables are not instantiated from the template and it's user's responsibility to provide a valid NVRAM image.

This element supports a format attribute, which has the same semantics as the attribute of the same name for the <loader> element. Since 9.2.0 (QEMU only)

It is not valid to provide this element if the loader is marked as stateless.
“””

deleting this on rebuild may be ok. deleting this always could be a pretty major regression and is not something we should do with out carful discussion.

nova does not supprot stateless firemware todya. there is a request to supprot it going forward but today our uefi support is ment to be stateful and the nvram file should be preserved in general.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

based on https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainUndefineFlagsValues

when we are undefineinf a domain for any other reason then delete i think we should be passing

VIR_DOMAIN_UNDEFINE_KEEP_NVRAM and VIR_DOMAIN_UNDEFINE_KEEP_TPM

even in the rebuild case i belive this is correct

its only for nova delete that we should remove them

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 29.0.2

This issue was fixed in the openstack/nova 29.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.