compute service failed to delete

Bug #1860312 reported by tanghang
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Unassigned

Bug Description

Description
===========
I deployed openstack with openstack-helm on kubernetes.When one of the nova-compute service(driver=ironic replica of the deployment is 1) breakdown.It may be scheduled to another node by kubernetes.When I try to delete the old compute service(status down), it failed.

Steps to reproduce
==================
Firstly, openstack was deployed in kubernetes cluster, and the replica of the nova-compute-ironic is 1.
* I deleted the pod nova-compute-ironic-xxxxx
* then wait for the new pod to start
* then exec openstack compute service list, there will be two compute service for ironic, the status of the old one would be down.
* then I try to delete the old compute service

Expected result
===============
the old compute service could be deleted successfully

Actual result
=============
failed to delete, and returned an http 500

Environment
===========
1. Exact version of OpenStack you are running. See the following
   18.2.2, rocky

2. Which hypervisor did you use?
   Libvirt + KVM

2. Which storage type did you use?
   ceph

3. Which networking type did you use?
   Neutron with OpenVSwitch

Logs & Configs
==============
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi [req-922cc601-9aa1-4c3d-ad9c-71f73a341c28 40e7b8c3d59943e08a52acd24fe30652 d13f1690c08d41ac854d720ea510a710 - default default] Unexpected exception in API method: ComputeHostNotFound: Compute host mgt-slave03 could not be found.
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 801, in wrapped
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/compute/services.py", line 252, in delete
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi context, service.host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi result = fn(cls, context, *args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py", line 443, in get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi use_slave=use_slave)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 213, in wrapper
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py", line 438, in _db_compute_node_get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return db.compute_node_get_all_by_host(context, host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/api.py", line 291, in compute_node_get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return IMPL.compute_node_get_all_by_host(context, host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 258, in wrapped
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(context, *args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 659, in compute_node_get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi raise exception.ComputeHostNotFound(host=host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi ComputeHostNotFound: Compute host mgt-slave03 could not be found.
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi [req-922cc601-9aa1-4c3d-ad9c-71f73a341c28 40e7b8c3d59943e08a52acd24fe30652 d13f1690c08d41ac854d720ea510a710 - default default] Unexpected exception in API method: ComputeHostNotFound: Compute host mgt-slave03 could not be found.
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 801, in wrapped
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/api/openstack/compute/services.py", line 252, in delete
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi context, service.host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi result = fn(cls, context, *args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py", line 443, in get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi use_slave=use_slave)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 213, in wrapper
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/objects/compute_node.py", line 438, in _db_compute_node_get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return db.compute_node_get_all_by_host(context, host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/api.py", line 291, in compute_node_get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return IMPL.compute_node_get_all_by_host(context, host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 258, in wrapped
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi return f(context, *args, **kwargs)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 659, in compute_node_get_all_by_host
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi raise exception.ComputeHostNotFound(host=host)
2020-01-20 06:44:53.480 1 ERROR nova.api.openstack.wsgi ComputeHostNotFound: Compute host mgt-slave03 could not be found

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

I think I found the problem, we delete the ComputeNode record if the virt driver doesn't support the related node :

https://github.com/openstack/nova/blob/85521691a843b9606d4a8aa050f4452ba025eb02/nova/compute/manager.py#L8215-L8221 (Rocky code)

Could you please reproduce the issue and tell us whether you get the above LOG.info() ?

If so, I think we should modify https://github.com/openstack/nova/blob/85521691a843b9606d4a8aa050f4452ba025eb02/nova/api/openstack/compute/services.py#L249 to
tell it's possible to get ZERO compute nodes for a service.

Please put the bug status back to 'New' once you reply and ideally ping 'bauzas' on IRC to confirm it so I could triage this bug.

tags: added: api resource-tracker
Changed in nova:
status: New → Incomplete
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

There is a related bug where Ironic rebalanced nodes leads to missing compute nodes in DB :
https://bugs.launchpad.net/nova/+bug/1853009

Back on my comment #1, I think that a potential solution for this bug is better to fix the root cause of the DB node removal instead of trying to workaround the os-services API delete.

This is debatable so I won't officially mark this one as a duplicate of bug 1853009 but I would argue of any bugfix related to the os-services API until we fix the Ironic problem.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
Revision history for this message
Artom Lifshitz (notartom) wrote :

I'd argue this is a valid bug, because it can be hit in the following scenario:

1. Run Ironic

2. Need to replace compute nodes

3. Remove all compute nodes from a single compute service

4. Delete the compute service from step 3. Yes, this is technically operator error - why would you want to delete the service at this point? But Nova has no constraints that says you must not delete a compute service with no associated compute nodes, so this step should still succeed.

Excepted:

Step 4 to successfully delete the compute service

Actual:

Step 4 returns error 500 stemming from a ComputeHostNotFound.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

actully the operator woudl be deleting the comptue service after removing the compute nodes.
you shoudl remove the compute service first but we shoudl fix this regardless.

you should be able to recreate this bug by just creating a compute servce
and then deleteing it.

Changed in nova:
status: Expired → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/802697

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/802840

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/802841

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/802842

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/802843

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/802846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/802847

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/802848

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/802849

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/802697
Committed: https://opendev.org/openstack/nova/commit/32257a2a6d159406577c885a9d7e366cbf0c72b9
Submitter: "Zuul (22348)"
Branch: master

commit 32257a2a6d159406577c885a9d7e366cbf0c72b9
Author: Artom Lifshitz <email address hidden>
Date: Wed Jul 28 12:48:50 2021 +0200

    Reproducer unit test for bug 1860312

    Consider the following situation:

    - Using the Ironic virt driver
    - Replacing (so removing and re-adding) all baremetal nodes
      associated with a single nova-compute service

    The update resources periodic will have destroyed the compute node
    records because they're no longer being reported by the virt driver.
    If we then attempt to manually delete the compute service record, the
    datbase layer will raise an exception, as there are no longer any
    compute node records for the host. This exception gets bubbled up as
    an error 500 in the API. This patch adds a unit test to demonstrate
    this.

    Related bug: 1860312
    Change-Id: I03eec634b25582ec9643cacf3e5868c101176983

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/801285
Committed: https://opendev.org/openstack/nova/commit/880611df0b6b967adabd3f08886e385d0a100c5c
Submitter: "Zuul (22348)"
Branch: master

commit 880611df0b6b967adabd3f08886e385d0a100c5c
Author: Artom Lifshitz <email address hidden>
Date: Mon Jul 19 12:58:29 2021 +0200

    Allow deletion of compute service with no compute nodes

    Consider the following situation:

    - Using the Ironic virt driver
    - Replacing (so removing and re-adding) all baremetal nodes
      associated with a single nova-compute service

    The update resources periodic will have destroyed the compute node
    records because they're no longer being reported by the virt driver.
    If we then attempt to manually delete the compute service record, the
    datbase layer will raise an exception, as there are no longer any
    compute node records for the host. Previously, this exception would
    get bubbled up as an error 500 in the API. This patch catches it and
    allows service deletion to complete succefully.

    Closes bug: 1860312
    Change-Id: I2f9ad3df25306e070c8c3538bfed1212d6d8682f

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/802840
Committed: https://opendev.org/openstack/nova/commit/e6cd23c3b4928b421b8c706f9cc218020779e367
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit e6cd23c3b4928b421b8c706f9cc218020779e367
Author: Artom Lifshitz <email address hidden>
Date: Wed Jul 28 12:48:50 2021 +0200

    Reproducer unit test for bug 1860312

    Consider the following situation:

    - Using the Ironic virt driver
    - Replacing (so removing and re-adding) all baremetal nodes
      associated with a single nova-compute service

    The update resources periodic will have destroyed the compute node
    records because they're no longer being reported by the virt driver.
    If we then attempt to manually delete the compute service record, the
    datbase layer will raise an exception, as there are no longer any
    compute node records for the host. This exception gets bubbled up as
    an error 500 in the API. This patch adds a unit test to demonstrate
    this.

    Related bug: 1860312
    Change-Id: I03eec634b25582ec9643cacf3e5868c101176983
    (cherry picked from commit 32257a2a6d159406577c885a9d7e366cbf0c72b9)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/802841
Committed: https://opendev.org/openstack/nova/commit/df5158bf3f80fd4362725dc280de67b88ece9952
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit df5158bf3f80fd4362725dc280de67b88ece9952
Author: Artom Lifshitz <email address hidden>
Date: Mon Jul 19 12:58:29 2021 +0200

    Allow deletion of compute service with no compute nodes

    Consider the following situation:

    - Using the Ironic virt driver
    - Replacing (so removing and re-adding) all baremetal nodes
      associated with a single nova-compute service

    The update resources periodic will have destroyed the compute node
    records because they're no longer being reported by the virt driver.
    If we then attempt to manually delete the compute service record, the
    datbase layer will raise an exception, as there are no longer any
    compute node records for the host. Previously, this exception would
    get bubbled up as an error 500 in the API. This patch catches it and
    allows service deletion to complete succefully.

    Closes bug: 1860312
    Change-Id: I2f9ad3df25306e070c8c3538bfed1212d6d8682f
    (cherry picked from commit 880611df0b6b967adabd3f08886e385d0a100c5c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/nova/+/802842
Committed: https://opendev.org/openstack/nova/commit/9efdd0b085733a0a4c4192aab8fc870c8aadf316
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 9efdd0b085733a0a4c4192aab8fc870c8aadf316
Author: Artom Lifshitz <email address hidden>
Date: Wed Jul 28 12:48:50 2021 +0200

    Reproducer unit test for bug 1860312

    Consider the following situation:

    - Using the Ironic virt driver
    - Replacing (so removing and re-adding) all baremetal nodes
      associated with a single nova-compute service

    The update resources periodic will have destroyed the compute node
    records because they're no longer being reported by the virt driver.
    If we then attempt to manually delete the compute service record, the
    datbase layer will raise an exception, as there are no longer any
    compute node records for the host. This exception gets bubbled up as
    an error 500 in the API. This patch adds a unit test to demonstrate
    this.

    Related bug: 1860312
    Change-Id: I03eec634b25582ec9643cacf3e5868c101176983
    (cherry picked from commit 32257a2a6d159406577c885a9d7e366cbf0c72b9)
    (cherry picked from commit e6cd23c3b4928b421b8c706f9cc218020779e367)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/nova/+/802843
Committed: https://opendev.org/openstack/nova/commit/e238cc9cd6ee7ba9d556c0c29f6f69dc3e3a7af9
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit e238cc9cd6ee7ba9d556c0c29f6f69dc3e3a7af9
Author: Artom Lifshitz <email address hidden>
Date: Mon Jul 19 12:58:29 2021 +0200

    Allow deletion of compute service with no compute nodes

    Consider the following situation:

    - Using the Ironic virt driver
    - Replacing (so removing and re-adding) all baremetal nodes
      associated with a single nova-compute service

    The update resources periodic will have destroyed the compute node
    records because they're no longer being reported by the virt driver.
    If we then attempt to manually delete the compute service record, the
    datbase layer will raise an exception, as there are no longer any
    compute node records for the host. Previously, this exception would
    get bubbled up as an error 500 in the API. This patch catches it and
    allows service deletion to complete succefully.

    Closes bug: 1860312
    Change-Id: I2f9ad3df25306e070c8c3538bfed1212d6d8682f
    (cherry picked from commit 880611df0b6b967adabd3f08886e385d0a100c5c)
    (cherry picked from commit df5158bf3f80fd4362725dc280de67b88ece9952)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 22.3.0

This issue was fixed in the openstack/nova 22.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.1.0

This issue was fixed in the openstack/nova 23.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/802848
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/802849
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/802846
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/802847
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.