Attempting to delete a baremetal server places server into ERROR state

Bug #1815799 reported by Lars Kellogg-Stedman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Mark Goddard
Rocky
Fix Committed
Medium
Julia Kreger
Stein
Fix Committed
Medium
Julia Kreger

Bug Description

When deleting a baremetal server, we see:

  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Traceback (most recent call last):
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2698, in do_terminate_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._delete_instance(context, instance, bdms)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] rv = f(*args, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2635, in _delete_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._shutdown_instance(context, instance, bdms)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2527, in _shutdown_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] requested_networks)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self.force_reraise()
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] six.reraise(self.type_, self.value, self.tb)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2514, in _shutdown_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] block_device_info)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1245, in destroy
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._cleanup_deploy(node, instance, network_info)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 461, in _cleanup_deploy
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._remove_instance_info_from_node(node, instance)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 396, in _remove_instance_info_from_node
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self.ironicclient.call('node.update', node.uuid, patch)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 170, in call
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] return self._multi_getattr(client, method)(*args, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 360, in update
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] params=params)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 232, in _update
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] resp, body = self.api.json_request(method, url, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 678, in json_request
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] resp = self._http_request(url, method, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 287, in wrapper
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] return func(self, url, method, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 660, in _http_request
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] error_json.get('debuginfo'), method, url)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated while a state transition is in progress. (HTTP 409)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a]

This places the server into an ERROR state:

  $ openstack server list
  +--------------------------------------+------------+--------+----------------------------+---------+------------+
  | ID | Name | Status | Networks | Image | Flavor |
  +--------------------------------------+------------+--------+----------------------------+---------+------------+
  | dcb4f055-cda4-4d61-ba8f-976645c4e92a | larstest2 | ERROR | | | baremetal |
  +--------------------------------------+------------+--------+----------------------------+---------+------------+

A subsequent "openstack server delete" will successfully delete the server.

On the ironic side, we see the following message logged several times before Nova reports the conflict:

  2019-02-13 12:58:15.138 21 DEBUG wsme.api [req-b6f0a1a3-e2cc-4b13-a979-3676757b0972 7a9b1ac45e084e7cbeb9cb740ffe8d08 41ea8af8d00e46438c7be3b182bbb53f - default default] Client-side error: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated while a state transition is in progress. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222

This is with nova @ ad842aa and ironic @ 4404292.

Tags: ironic
tags: added: ironic
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not sure what nova can do about this besides waiting / retrying the delete of the ironic node before giving up. The libvirt driver has some retry stuff in it:

https://github.com/openstack/nova/blob/63e5cba88af5cd121b498cc46358eef19f83b69f/nova/virt/libvirt/driver.py#L888

Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

In this specific case, it is not needed, Ironic has long nuked the field. I'll put up a patch to gracefully handle this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/643484

Changed in nova:
assignee: nobody → Julia Kreger (juliaashleykreger)
status: Confirmed → In Progress
Revision history for this message
Mark Goddard (mgoddard) wrote :

Also hitting this, on Rocky release. It was pretty obvious, not sure how we've not seen it sooner.

While the error message mentions locking, I think this is a red herring - I added debug and there is no lock held. The update is prevented due to a state transition being in progress while cleaning is performed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/659809

Changed in nova:
assignee: Julia Kreger (juliaashleykreger) → Mark Goddard (mgoddard)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Julia Kreger (<email address hidden>) on branch: master
Review: https://review.opendev.org/643484

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/659809
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7
Submitter: Zuul
Branch: master

commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/662479

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/662584

Matt Riedemann (mriedem)
Changed in nova:
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/662479
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=95ace7cf8daf4ff935ff6df877554fdfddef9e4a
Submitter: Zuul
Branch: stable/stein

commit 95ace7cf8daf4ff935ff6df877554fdfddef9e4a
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799
    (cherry picked from commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.1

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/662584
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ed3f8f9e487c233fdf714edad57d126fe6b1cd7
Submitter: Zuul
Branch: stable/rocky

commit 1ed3f8f9e487c233fdf714edad57d126fe6b1cd7
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799
    (cherry picked from commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7)
    (cherry picked from commit 95ace7cf8daf4ff935ff6df877554fdfddef9e4a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.2.1

This issue was fixed in the openstack/nova 18.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
Mark Goddard (mgoddard) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/883411

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.