Attempting to delete a baremetal server places server into ERROR state

Bug #1815799 reported by Lars Kellogg-Stedman on 2019-02-13
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Mark Goddard
Rocky
Medium
Julia Kreger
Stein
Medium
Julia Kreger

Bug Description

When deleting a baremetal server, we see:

  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Traceback (most recent call last):
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2698, in do_terminate_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._delete_instance(context, instance, bdms)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] rv = f(*args, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2635, in _delete_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._shutdown_instance(context, instance, bdms)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2527, in _shutdown_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] requested_networks)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self.force_reraise()
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] six.reraise(self.type_, self.value, self.tb)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2514, in _shutdown_instance
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] block_device_info)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1245, in destroy
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._cleanup_deploy(node, instance, network_info)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 461, in _cleanup_deploy
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self._remove_instance_info_from_node(node, instance)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 396, in _remove_instance_info_from_node
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] self.ironicclient.call('node.update', node.uuid, patch)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/nova/virt/ironic/client_wrapper.py", line 170, in call
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] return self._multi_getattr(client, method)(*args, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/v1/node.py", line 360, in update
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] params=params)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/base.py", line 232, in _update
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] resp, body = self.api.json_request(method, url, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 678, in json_request
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] resp = self._http_request(url, method, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 287, in wrapper
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] return func(self, url, method, **kwargs)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] File "/usr/lib/python2.7/site-packages/ironicclient/common/http.py", line 660, in _http_request
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] error_json.get('debuginfo'), method, url)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a] Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated while a state transition is in progress. (HTTP 409)
  2019-02-13 12:58:16.856 7 ERROR nova.compute.manager [instance: dcb4f055-cda4-4d61-ba8f-976645c4e92a]

This places the server into an ERROR state:

  $ openstack server list
  +--------------------------------------+------------+--------+----------------------------+---------+------------+
  | ID | Name | Status | Networks | Image | Flavor |
  +--------------------------------------+------------+--------+----------------------------+---------+------------+
  | dcb4f055-cda4-4d61-ba8f-976645c4e92a | larstest2 | ERROR | | | baremetal |
  +--------------------------------------+------------+--------+----------------------------+---------+------------+

A subsequent "openstack server delete" will successfully delete the server.

On the ironic side, we see the following message logged several times before Nova reports the conflict:

  2019-02-13 12:58:15.138 21 DEBUG wsme.api [req-b6f0a1a3-e2cc-4b13-a979-3676757b0972 7a9b1ac45e084e7cbeb9cb740ffe8d08 41ea8af8d00e46438c7be3b182bbb53f - default default] Client-side error: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated while a state transition is in progress. format_exception /usr/lib/python2.7/site-packages/wsme/api.py:222

This is with nova @ ad842aa and ironic @ 4404292.

tags: added: ironic
Matt Riedemann (mriedem) wrote :

I'm not sure what nova can do about this besides waiting / retrying the delete of the ironic node before giving up. The libvirt driver has some retry stuff in it:

https://github.com/openstack/nova/blob/63e5cba88af5cd121b498cc46358eef19f83b69f/nova/virt/libvirt/driver.py#L888

Changed in nova:
status: New → Confirmed
importance: Undecided → Low

In this specific case, it is not needed, Ironic has long nuked the field. I'll put up a patch to gracefully handle this.

Fix proposed to branch: master
Review: https://review.openstack.org/643484

Changed in nova:
assignee: nobody → Julia Kreger (juliaashleykreger)
status: Confirmed → In Progress
Mark Goddard (mgoddard) wrote :

Also hitting this, on Rocky release. It was pretty obvious, not sure how we've not seen it sooner.

While the error message mentions locking, I think this is a red herring - I added debug and there is no lock held. The update is prevented due to a state transition being in progress while cleaning is performed.

Fix proposed to branch: master
Review: https://review.opendev.org/659809

Changed in nova:
assignee: Julia Kreger (juliaashleykreger) → Mark Goddard (mgoddard)

Change abandoned by Julia Kreger (<email address hidden>) on branch: master
Review: https://review.opendev.org/643484

Reviewed: https://review.opendev.org/659809
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7
Submitter: Zuul
Branch: master

commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem) on 2019-06-03
Changed in nova:
importance: Low → Medium

Reviewed: https://review.opendev.org/662479
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=95ace7cf8daf4ff935ff6df877554fdfddef9e4a
Submitter: Zuul
Branch: stable/stein

commit 95ace7cf8daf4ff935ff6df877554fdfddef9e4a
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799
    (cherry picked from commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7)

This issue was fixed in the openstack/nova 19.0.1 release.

Reviewed: https://review.opendev.org/662584
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ed3f8f9e487c233fdf714edad57d126fe6b1cd7
Submitter: Zuul
Branch: stable/rocky

commit 1ed3f8f9e487c233fdf714edad57d126fe6b1cd7
Author: Mark Goddard <email address hidden>
Date: Fri May 17 14:58:59 2019 +0100

    [ironic] Don't remove instance info twice in destroy

    During teardown, at the end of the ironic virt driver's destroy method,
    we call _cleanup_deploy, which since
    https://review.opendev.org/#/c/563722/ removes instance_info from the
    ironic node. Given that we're destroying the node, the instance_info
    will have been cleared from the node anyway, so we don't need to do
    this.

    Further, it can cause tear down to fail if automated cleaning is
    enabled in ironic. This happens because ironic prevents updates to nodes
    that are undergoing a state transition, as is the case during cleaning.
    The virt driver will retry this request by default every 2 seconds with
    60 attempts. Typically this is not long enough for cleaning to complete,
    so tear down fails with the following error:

      Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
      while a state transition is in progress. (HTTP 409)

    This change skips the instance info update in _cleanup_deploy in the
    case of tear down.

    Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
    Closes-Bug: #1815799
    (cherry picked from commit 060e42b7fd185bf3b827b7aa5c9ab6131f6a21d7)
    (cherry picked from commit 95ace7cf8daf4ff935ff6df877554fdfddef9e4a)

This issue was fixed in the openstack/nova 18.2.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers