Ironic VIF unplug in destroy may fail due to node locked by cleaning

Bug #2019977 reported by Mark Goddard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned

Bug Description

This bug describes a race condition during deletion of a bare metal (Ironic) instance when automated cleaning is enabled in Ironic (which is the default).

# Steps to reproduce

As a race condition, this one is not easy to reproduce, although it has been seen in the wild on several occasions. The easiest way to reproduce it is in a development environment (e.g. devstack), tweaking some configuration and adding some sleeps to particular operations in nova and ironic.

nova.conf configuration: Reduce retries for ironic API polling. Needs to be long enough to allow operations to complete.

[ironic]
api_max_retries = 10

sleep #1: nova-compute waits for the node to move to cleaning (nova/virt/ironic/driver.py):

        try:
            if node.provision_state in _UNPROVISION_STATES:
                self._unprovision(instance, node)
            else:
                # NOTE(hshiina): if spawn() fails before ironic starts
                # provisioning, instance information should be
                # removed from ironic node.
                self._remove_instance_info_from_node(node, instance)
        finally:
            ##### HERE #######
            time.sleep(20)
            ##### HERE #######
            # NOTE(mgoddard): We don't need to remove instance info at this
            # point since we will have already done it. The destroy will only
            # succeed if this method returns without error, so we will end up
            # removing the instance info eventually.
            self._cleanup_deploy(node, instance, network_info,
                                 remove_instance_info=False)

sleep #2: ironic conductor holds onto the node lock:

@task_manager.require_exclusive_lock
def do_node_clean(task, clean_steps=None, disable_ramdisk=False):
    """Internal RPC method to perform cleaning of a node.

    :param task: a TaskManager instance with an exclusive lock on its node
    :param clean_steps: For a manual clean, the list of clean steps to
                        perform. Is None For automated cleaning (default).
                        For more information, see the clean_steps parameter
                        of :func:`ConductorManager.do_node_clean`.
    :param disable_ramdisk: Whether to skip booting ramdisk for cleaning.
    """
    ##### HERE #####
    import time
    time.sleep(120)
    ##### HERE #####
    node = task.node
    manual_clean = clean_steps is not None

Next, create, then destroy a bare metal instance.

# Expected results

The node is deleted

# Actual results

The node moves to an ERROR state.

Logs from nova-compute:

2023-05-17 14:30:30.493 7 ERROR ironicclient.common.http [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Error contacting Ir
onic server: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409). Attempt 6 of 6: ironicclient.common.apic
lient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Exception during m
essage handling: ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is complete
d. (HTTP 409)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 72, in wrapped
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server context, exc, binary)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server raise self.value
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 63, in wrapped
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10693, in external_instance_event
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server event.tag)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10498, in _process_instance_vif_deleted_event
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self.driver.detach_interface(context, instance, vif)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1630, in detach_interface
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self.unplug_vifs(instance, [vif])
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1591, in unplug_vifs
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self._unplug_vifs(node, instance, network_info)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1558, in _unplug_vifs
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server port_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/client_wrapper.py", line 180, in call
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return self._multi_getattr(client, method)(*args, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 512, in vif_detach
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server global_request_id=global_request_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 408, in delete
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server global_request_id=global_request_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/base.py", line 292, in _delete
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server headers=headers)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 426, in raw_request
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return self._http_request(url, method, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 287, in wrapper
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return func(self, url, method, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 387, in _http_request
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server error_json.get('debuginfo'), method, url)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)

# Analysis

This issue is similar to https://bugs.launchpad.net/nova/+bug/1815799, where instance_uuid and instance_info were being cleaned up. Ironic deletes VIFs during tear down since Rocky [1], so we should also be safe to skip this in the case where the node has been deprovisioned. Debug logs confirm that the VIF cleanup is a noop:

2023-05-17 14:24:11.791 7 DEBUG nova.virt.ironic.driver [req-e419f330-fe14-4c4b-8d5a-b7a40bb77651 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] VIF a5c1622b-eb13-43
d3-a46d-aa5781a85cdb isn't attached to Ironic node 3152ff2d-a0cb-4e7a-bcc7-81266106fef2 _unplug_vifs /var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1561

[1] https://review.opendev.org/q/I8d683d2d506c97535b5a8f9a5de4c070c7e887df

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/883411

Changed in nova:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.