Bug #1506054 “Nova tries to connect to deleted VM causing API er...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Nell Jerram (neil-jerram) wrote on 2015-10-14:

#1

Download full text (3.5 KiB)

The n-api.log includes several instances of this:

2015-10-14 12:45:41.416 DEBUG nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Calling method '<bound method HypervisorsController.detail of <nova.api.openstack.compute.hypervisors.HypervisorsController object at 0x7fd447b7eed0>>' _process_stack /opt/stack/nova/nova/api/openstack/wsgi.py:792
2015-10-14 12:45:41.432 ERROR nova.api.openstack.extensions [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Unexpected exception in API method
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/api/openstack/extensions.py", line 478, in wrapped
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/api/openstack/compute/hypervisors.py", line 101, in detail
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions for hyp in compute_nodes])
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/compute/api.py", line 3478, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions return objects.Service.get_by_compute_host(context, host_name)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 171, in wrapper
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions result = fn(cls, context, *args, **kwargs)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/objects/service.py", line 223, in get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions db_service = db.service_get_by_compute_host(context, host)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/db/api.py", line 145, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions use_slave=use_slave)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 512, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions raise exception.ComputeHostNotFound(host=host)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions ComputeHostNotFound: Compute host calico-vm23 could not be found.
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions
2015-10-14 12:45:41.488 INFO nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.ComputeHostNotFound'>
2015-10-14 12:45:41.488 DEBUG nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Returning 500 to user: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
...

The n-api.log includes several instances of this:

2015-10-14 12:45:41.416 DEBUG nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Calling method '<bound method HypervisorsController.detail of <nova.api.openstack.compute.hypervisors.HypervisorsController object at 0x7fd447b7eed0>>' _process_stack /opt/stack/nova/nova/api/openstack/wsgi.py:792
2015-10-14 12:45:41.432 ERROR nova.api.openstack.extensions [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Unexpected exception in API method
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions   File "/opt/stack/nova/nova/api/openstack/extensions.py", line 478, in wrapped
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions     return f(*args, **kwargs)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions   File "/opt/stack/nova/nova/api/openstack/compute/hypervisors.py", line 101, in detail
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions     for hyp in compute_nodes])
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions   File "/opt/stack/nova/nova/compute/api.py", line 3478, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions     return objects.Service.get_by_compute_host(context, host_name)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 171, in wrapper
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions     result = fn(cls, context, *args, **kwargs)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions   File "/opt/stack/nova/nova/objects/service.py", line 223, in get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions     db_service = db.service_get_by_compute_host(context, host)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions   File "/opt/stack/nova/nova/db/api.py", line 145, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions     use_slave=use_slave)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions   File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 512, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions     raise exception.ComputeHostNotFound(host=host)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions ComputeHostNotFound: Compute host calico-vm23 could not be found.
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions
2015-10-14 12:45:41.488 INFO nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.ComputeHostNotFound'>
2015-10-14 12:45:41.488 DEBUG nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Returning 500 to user: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.ComputeHostNotFound'> __call__ /opt/stack/nova/nova/api/openstack/wsgi.py:1175

calico-vm23 is a node that was previously set up as a compute node (actually from a stacking a couple of weeks ago) but has now been wiped.

So I guess the question is: how do I get Nova on the controller to stop trying to communicate with calico-vm23 ?

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2015-10-15:

#2

The important information is this one, I guess:

2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions
ComputeHostNotFound: Compute host calico-vm23 could not be found.

    calico-vm23 is a node that was previously set up as a compute node
    (actually from a stacking a couple of weeks ago) but has now been
    wiped.

So I guess the question is: how do I get Nova on the controller to
stop trying to communicate with calico-vm23 ?

The "nova/compute/manager.py" has a periodic task which should cleanup
this king of things, IIUC:

    @periodic_task.periodic_task(spacing=CONF.update_resources_interval)
    def update_available_resource(self, context):
        """See driver.get_available_resource()

Periodic process that keeps that the compute host's understanding of
resource availability and usage in sync with the underlying hypervisor.

        :param context: security context
        """
        new_resource_tracker_dict = {}

        compute_nodes_in_db = self._get_compute_nodes_in_db(context,
                                                            use_slave=True)
        nodenames = set(self.driver.get_available_nodes())
        for nodename in nodenames:
            rt = self._get_resource_tracker(nodename)
            try:
                rt.update_available_resource(context)
            except exception.ComputeHostNotFound:
                # NOTE(comstud): We can get to this case if a node was
                # marked 'deleted' in the DB and then re-added with a
                # different auto-increment id. The cached resource
                # tracker tried to update a deleted record and failed.
                # Don't add this resource tracker to the new dict, so
                # that this will resolve itself on the next run.
                LOG.info(_LI("Compute node '%s' not found in "
                             "update_available_resource."), nodename)
                continue
            except Exception as e:
                LOG.error(_LE("Error updating resources for node "
                              "%(node)s: %(e)s"),
                          {'node': nodename, 'e': e})
            new_resource_tracker_dict[nodename] = rt

        # NOTE(comstud): Replace the RT cache before looping through
        # compute nodes to delete below, as we can end up doing greenthread
        # switches there. Best to have everyone using the newest cache
        # ASAP.
        self._resource_tracker_dict = new_resource_tracker_dict

        # Delete orphan compute node not reported by driver but still in db
        for cn in compute_nodes_in_db:
            if cn.hypervisor_hostname not in nodenames:
                LOG.info(_LI("Deleting orphan compute node %s") % cn.id)
                cn.destroy()

That's at least the only occurance of the "ComputeNode"'s method
"destroy()" which should remove this entry from the database. I didn't
find a REST API to trigger this as end user.

The important information is this one, I guess:

2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions 
    ComputeHostNotFound: Compute host calico-vm23 could not be found.

calico-vm23 is a node that was previously set up as a compute node 
    (actually from a stacking a couple of weeks ago) but has now been 
    wiped.

So I guess the question is: how do I get Nova on the controller to
    stop trying to communicate with calico-vm23 ?

The "nova/compute/manager.py" has a periodic task which should cleanup
this king of things, IIUC:

@periodic_task.periodic_task(spacing=CONF.update_resources_interval)
    def update_available_resource(self, context):
        """See driver.get_available_resource()

Periodic process that keeps that the compute host's understanding of
        resource availability and usage in sync with the underlying hypervisor.

:param context: security context
        """
        new_resource_tracker_dict = {}

compute_nodes_in_db = self._get_compute_nodes_in_db(context,
                                                            use_slave=True)
        nodenames = set(self.driver.get_available_nodes())
        for nodename in nodenames:
            rt = self._get_resource_tracker(nodename)
            try:
                rt.update_available_resource(context)
            except exception.ComputeHostNotFound:
                # NOTE(comstud): We can get to this case if a node was
                # marked 'deleted' in the DB and then re-added with a
                # different auto-increment id. The cached resource
                # tracker tried to update a deleted record and failed.
                # Don't add this resource tracker to the new dict, so
                # that this will resolve itself on the next run.
                LOG.info(_LI("Compute node '%s' not found in "
                             "update_available_resource."), nodename)
                continue
            except Exception as e:
                LOG.error(_LE("Error updating resources for node "
                              "%(node)s: %(e)s"),
                          {'node': nodename, 'e': e})
            new_resource_tracker_dict[nodename] = rt

# NOTE(comstud): Replace the RT cache before looping through
        # compute nodes to delete below, as we can end up doing greenthread
        # switches there. Best to have everyone using the newest cache
        # ASAP.
        self._resource_tracker_dict = new_resource_tracker_dict

# Delete orphan compute node not reported by driver but still in db
        for cn in compute_nodes_in_db:
            if cn.hypervisor_hostname not in nodenames:
                LOG.info(_LI("Deleting orphan compute node %s") % cn.id)
                cn.destroy()

That's at least the only occurance of the "ComputeNode"'s method
"destroy()" which should remove this entry from the database. I didn't
find a REST API to trigger this as end user.

tags:

added: compute

Revision history for this message

Augustina Ragwitz (auggy) wrote on 2016-02-03:

#3

Hi Neil, I just wanted to follow up to see if you're still experiencing this error? If so it's possible the Computer manager cleanup didn't happen as it should have and we can look into it further. If however you are no longer experiencing this, then maybe you can provide some info (if you still remember or have it handy) as to how long it took Nova to stop trying to connect to the deleted VM? Thanks!

summary:	- Can't retrieve hypervisor info - Nova Unexpected API error + Nova tries to connect to deleted VM causing API error and Horizon error
Changed in nova:
status:	New → Incomplete

Revision history for this message

Ross Golder (ross-golder) wrote on 2016-02-04:

#4

fwiw, I've just started looking into exactly same problem as described above - same symptoms, same stacktraces in the logs - in this case, a three node Ubuntu Wily based stack.

Revision history for this message

Nell Jerram (neil-jerram) wrote on 2016-02-05:

#5

@Augustina - No, I'm not still seeing this myself. I think it originally occurred when one of my compute node installs failed in some way, and I've since improved my automation so that that kind of failure doesn't happen.

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2016-02-08:

#6

@Ross+@Neil: Maybe the same node exists twice in the database but with different IDs. Could you add the contents of the nova db "instances" table?

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-04-09:

#7

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

OpenStack Compute (nova)

Nova tries to connect to deleted VM causing API error and Horizon error

Bug Description

Other bug subscribers

Remote bug watches