removing compute node causes ComputeHostNotFound in nova-api

Bug #1646255 reported by Vance Morris
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Newton
Fix Committed
Medium
Lee Yarwood

Bug Description

trying to remove compute node properly

Steps to reproduce
==================
1) remove all instances from the hypervisor:
(env) vance@zs95k5:~$ nova hypervisor-servers zs93k23
+----+------+---------------+---------------------+
| ID | Name | Hypervisor ID | Hypervisor Hostname |
+----+------+---------------+---------------------+
+----+------+---------------+---------------------+

2) disable the hypervisor:
(env) vance@zs95k5:~$ nova service-list
+----+----------------+---------------------+----------+----------+-------+----------------------------+-----------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+----------------+---------------------+----------+----------+-------+----------------------------+-----------------+
| 3 | nova-cert | juju-605709-2-lxd-3 | internal | enabled | up | 2016-11-30T21:13:34.000000 | - |
| 4 | nova-scheduler | juju-605709-2-lxd-3 | internal | enabled | up | 2016-11-30T21:13:27.000000 | - |
| 5 | nova-conductor | juju-605709-2-lxd-3 | internal | enabled | up | 2016-11-30T21:13:30.000000 | - |
| 14 | nova-compute | u27-maas-machine-1 | nova | disabled | up | 2016-11-30T21:13:28.000000 | - |
| 16 | nova-compute | zs95k181 | nova | enabled | up | 2016-11-30T21:13:33.000000 | - |
| 17 | nova-compute | zs93k23 | nova | enabled | up | 2016-11-30T21:13:33.000000 | - |
+----+----------------+---------------------+----------+----------+-------+----------------------------+-----------------+
(env) vance@zs95k5:~$ nova service-disable zs93k23 nova-compute
+---------+--------------+----------+
| Host | Binary | Status |
+---------+--------------+----------+
| zs93k23 | nova-compute | disabled |
+---------+--------------+----------+

3) delete the compute service
(env) vance@zs95k5:~$ nova service-delete 17
(env) vance@zs95k5:~$ nova service-list
+----+----------------+---------------------+----------+----------+-------+----------------------------+-----------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+----------------+---------------------+----------+----------+-------+----------------------------+-----------------+
| 3 | nova-cert | juju-605709-2-lxd-3 | internal | enabled | up | 2016-11-30T21:14:54.000000 | - |
| 4 | nova-scheduler | juju-605709-2-lxd-3 | internal | enabled | up | 2016-11-30T21:14:47.000000 | - |
| 5 | nova-conductor | juju-605709-2-lxd-3 | internal | enabled | up | 2016-11-30T21:14:56.000000 | - |
| 14 | nova-compute | u27-maas-machine-1 | nova | disabled | up | 2016-11-30T21:14:48.000000 | - |
| 16 | nova-compute | zs95k181 | nova | enabled | up | 2016-11-30T21:14:53.000000 | - |
+----+----------------+---------------------+----------+----------+-------+----------------------------+-----------------+

4) delete the neutron agent
(env) vance@zs95k5:~$ openstack network agent list
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+
| 039e4b7a-3dbe-4e87-a9a5-d4b569d3113d | Open vSwitch agent | u27-maas-machine-2 | None | True | UP | neutron-openvswitch-agent |
| 2aa13570-0e62-4198-96d9-dfe732d7874d | DHCP agent | u27-maas-machine-2 | nova | True | UP | neutron-dhcp-agent |
| 2ab2320a-69cb-4a7f-8ae4-541d3b2bdd3b | L3 agent | u27-maas-machine-2 | nova | True | UP | neutron-l3-agent |
| 48d6d83b-e459-46c8-945c-eea4197c01ec | Open vSwitch agent | zs95k181 | None | True | UP | neutron-openvswitch-agent |
| a36eecd5-1fb4-436a-becd-fccc737518fd | Metering agent | u27-maas-machine-2 | None | True | UP | neutron-metering-agent |
| aaee3bf0-f8bd-41b7-94ed-f9213f120016 | Open vSwitch agent | zs93k23 | None | True | UP | neutron-openvswitch-agent |
| c6039c81-ad20-4258-a926-bc4a90dccc96 | Loadbalancer agent | u27-maas-machine-2 | None | True | UP | neutron-lbaas-agent |
| cfecc66c-2888-4c3d-8241-1d3fcd018a1f | Metadata agent | u27-maas-machine-2 | None | True | UP | neutron-metadata-agent |
| f60cbf28-f030-43ae-a598-0d0182529804 | Open vSwitch agent | u27-maas-machine-1 | None | True | UP | neutron-openvswitch-agent |
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+
(env) vance@zs95k5:~$ openstack network agent delete aaee3bf0-f8bd-41b7-94ed-f9213f120016
(env) vance@zs95k5:~$ openstack network agent list
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+
| 039e4b7a-3dbe-4e87-a9a5-d4b569d3113d | Open vSwitch agent | u27-maas-machine-2 | None | True | UP | neutron-openvswitch-agent |
| 2aa13570-0e62-4198-96d9-dfe732d7874d | DHCP agent | u27-maas-machine-2 | nova | True | UP | neutron-dhcp-agent |
| 2ab2320a-69cb-4a7f-8ae4-541d3b2bdd3b | L3 agent | u27-maas-machine-2 | nova | True | UP | neutron-l3-agent |
| 48d6d83b-e459-46c8-945c-eea4197c01ec | Open vSwitch agent | zs95k181 | None | True | UP | neutron-openvswitch-agent |
| a36eecd5-1fb4-436a-becd-fccc737518fd | Metering agent | u27-maas-machine-2 | None | True | UP | neutron-metering-agent |
| c6039c81-ad20-4258-a926-bc4a90dccc96 | Loadbalancer agent | u27-maas-machine-2 | None | True | UP | neutron-lbaas-agent |
| cfecc66c-2888-4c3d-8241-1d3fcd018a1f | Metadata agent | u27-maas-machine-2 | None | True | UP | neutron-metadata-agent |
| f60cbf28-f030-43ae-a598-0d0182529804 | Open vSwitch agent | u27-maas-machine-1 | None | True | UP | neutron-openvswitch-agent |
+--------------------------------------+--------------------+--------------------+-------------------+-------+-------+---------------------------+

5) check hypervisor list
(env) vance@zs95k5:~$ nova hypervisor-list
ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.ComputeHostNotFound'> (HTTP 500) (Request-ID: req-2b773323-be4f-4c8c-be05-a41144ebed76)

Okay, here is the error from the log:
http://paste.ubuntu.com/23559892/

Expected result
===============
Nova compute node is properly removed from the OS database.

Environment
===========
1. OpenStack nova version 2:13.1.2-0ubuntu2

2. Hypervisor:
   Libvirt+KVM (KVM for IBM z Systems)

3. Networking type: OVS+GRE

Vance Morris (vmorris)
tags: added: api
tags: added: db
Vance Morris (vmorris)
description: updated
Frank Heimes (fheimes)
tags: added: openstack-ibm
Revision history for this message
Matt Riedemann (mriedem) wrote :

Can you dump the services and compute_nodes tables from the database, does zs93k23 still show up? It might, but the deleted column should be a non-0 value.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like it fails here:

https://github.com/openstack/nova/blob/13.1.2/nova/api/openstack/compute/hypervisors.py#L98

Which means the service record is deleted, but the compute_node record isn't.

I'm not sure what deletes the compute node record in the database. The resource tracker running in the nova-compute service itself on the host creates the compute node record, but would have to dig into what destroys it.

Is the nova-compute service still running on the host because I think even if you deleted it, if the service is still running it will automatically recreate the compute node record in the update_available_resource periodic task on the compute.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like this is where the orphaned compute node record would be deleted:

https://github.com/openstack/nova/blob/13.1.2/nova/compute/manager.py#L6513

Revision history for this message
Alex Xu (xuhj) wrote :

Looks like https://github.com/openstack/nova/blob/13.1.2/nova/compute/manager.py#L6513 doesn't help on remove the compute node record. It only delete the compute node which hypervisor didn't know.

Probably we have same problem, when the compute is down, then user remove the service records, then still no-one can remove the compute node record.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like you have to delete the compute node from the database manually as we have no APIs to do it:

http://www.thereluctanttecchie.com/openstack-removing-a-compute-node-in-icehouse/

There are several articles and support pages around similar to ^ so we need to handle that 404 in the services API code and ignore it.

Changed in nova:
status: New → Triaged
assignee: nobody → Matt Riedemann (mriedem)
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/406627

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/406627
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f0d44c5b09f3f3c84038d40b621bb629a1f8110e
Submitter: Jenkins
Branch: master

commit f0d44c5b09f3f3c84038d40b621bb629a1f8110e
Author: Matt Riedemann <email address hidden>
Date: Sun Dec 4 15:08:04 2016 -0500

    Handle ComputeHostNotFound when listing hypervisors

    Compute node resources must currently be deleted manually
    in the database, and as such they can reference service
    records which have been deleted via the services delete API.
    Because of this when listing hypervisors (compute nodes), we
    may get a ComputeHostNotFound error when trying to lookup a
    service record for a compute node where the service was
    deleted. This causes the API to fail with a 500 since it's not
    handled.

    This change handles the ComputeHostNotFound when looping over
    compute nodes in the hypervisors index and detail methods and
    simply ignores them.

    Change-Id: I2717274bb1bd370870acbf58c03dc59cee30cc5e
    Closes-Bug: #1646255

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/407961

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/407961
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5931f555568eb52235dd28bb520f354d884b7ea4
Submitter: Jenkins
Branch: stable/newton

commit 5931f555568eb52235dd28bb520f354d884b7ea4
Author: Matt Riedemann <email address hidden>
Date: Sun Dec 4 15:08:04 2016 -0500

    Handle ComputeHostNotFound when listing hypervisors

    Compute node resources must currently be deleted manually
    in the database, and as such they can reference service
    records which have been deleted via the services delete API.
    Because of this when listing hypervisors (compute nodes), we
    may get a ComputeHostNotFound error when trying to lookup a
    service record for a compute node where the service was
    deleted. This causes the API to fail with a 500 since it's not
    handled.

    This change handles the ComputeHostNotFound when looping over
    compute nodes in the hypervisors index and detail methods and
    simply ignores them.

    Change-Id: I2717274bb1bd370870acbf58c03dc59cee30cc5e
    Closes-Bug: #1646255
    (cherry picked from commit f0d44c5b09f3f3c84038d40b621bb629a1f8110e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.0.0.0b2

This issue was fixed in the openstack/nova 15.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 14.0.3

This issue was fixed in the openstack/nova 14.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/553598

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/553598
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e0c43ac7a7b9edbd3eb4921d87f476a33364549b
Submitter: Zuul
Branch: master

commit e0c43ac7a7b9edbd3eb4921d87f476a33364549b
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 15 16:05:46 2018 -0400

    api-ref: add a note in DELETE /os-services about deleting computes

    As seen in I2717274bb1bd370870acbf58c03dc59cee30cc5e, if an
    operator deletes a nova-compute service via the API but fails
    to stop the actual nova-compute process, the API will delete the
    services table and compute_nodes table records, but the nova-compute
    process will hit the 'update_available_resource' periodic task and
    re-create the compute node record, which at that point is orphaned
    since it does not have an associated nova-compute service.

    Restarting the nova-compute service _should_ recreate the service
    table record for that host and then the user could attempt to
    delete the service and compute node record again via this API, but
    it's better to not have to find this out the hard way.

    So this change adds a simple reminder that the nova-compute process
    on the host should be stopped before deleting the compute service.

    While in here, the link to the install guide about the various
    compute services is also fixed.

    Change-Id: I68f2074814c3ae890888a5c75fd2870bb99f0e08
    Related-Bug: #1646255

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/563233

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/563233
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1b7ba3a200ff3b7d7a26c38de9135f7439caa4eb
Submitter: Zuul
Branch: stable/queens

commit 1b7ba3a200ff3b7d7a26c38de9135f7439caa4eb
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 15 16:05:46 2018 -0400

    api-ref: add a note in DELETE /os-services about deleting computes

    As seen in I2717274bb1bd370870acbf58c03dc59cee30cc5e, if an
    operator deletes a nova-compute service via the API but fails
    to stop the actual nova-compute process, the API will delete the
    services table and compute_nodes table records, but the nova-compute
    process will hit the 'update_available_resource' periodic task and
    re-create the compute node record, which at that point is orphaned
    since it does not have an associated nova-compute service.

    Restarting the nova-compute service _should_ recreate the service
    table record for that host and then the user could attempt to
    delete the service and compute node record again via this API, but
    it's better to not have to find this out the hard way.

    So this change adds a simple reminder that the nova-compute process
    on the host should be stopped before deleting the compute service.

    While in here, the link to the install guide about the various
    compute services is also fixed.

    Change-Id: I68f2074814c3ae890888a5c75fd2870bb99f0e08
    Related-Bug: #1646255
    (cherry picked from commit e0c43ac7a7b9edbd3eb4921d87f476a33364549b)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/580494
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=93854e47e09110ea8fd1b6111813f87d5d8501cc
Submitter: Zuul
Branch: stable/pike

commit 93854e47e09110ea8fd1b6111813f87d5d8501cc
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 15 16:05:46 2018 -0400

    api-ref: add a note in DELETE /os-services about deleting computes

    As seen in I2717274bb1bd370870acbf58c03dc59cee30cc5e, if an
    operator deletes a nova-compute service via the API but fails
    to stop the actual nova-compute process, the API will delete the
    services table and compute_nodes table records, but the nova-compute
    process will hit the 'update_available_resource' periodic task and
    re-create the compute node record, which at that point is orphaned
    since it does not have an associated nova-compute service.

    Restarting the nova-compute service _should_ recreate the service
    table record for that host and then the user could attempt to
    delete the service and compute node record again via this API, but
    it's better to not have to find this out the hard way.

    So this change adds a simple reminder that the nova-compute process
    on the host should be stopped before deleting the compute service.

    While in here, the link to the install guide about the various
    compute services is also fixed.

    Change-Id: I68f2074814c3ae890888a5c75fd2870bb99f0e08
    Related-Bug: #1646255
    (cherry picked from commit e0c43ac7a7b9edbd3eb4921d87f476a33364549b)
    (cherry picked from commit 1b7ba3a200ff3b7d7a26c38de9135f7439caa4eb)

tags: added: in-stable-pike
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.