Nova tries to connect to deleted VM causing API error and Horizon error

Bug #1506054 reported by Nell Jerram
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

This is a problem I'm seeing in a 3 node DevStack system, stacked today (14th October 2015). I click on Hypervisors in the System section of the Horizon UI, to see hypervisor information. Horizon shows an error box "Unable to retrieve hypervisor information".

/var/log/apache2/horizon_error.log includes this:

2015-10-14 11:04:00.379206 REQ: curl -g -i 'http://calico-vm18:8774/v2.1/3c8d3e6c106c44b6ad37140023b92df8/os-hypervisors/detail' -X GET -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Project-Id: 3c8d3e6c106c44b6ad37140023b92df8" -H "X-Auth-Token: {SHA1}db34b44f5be5800c12f48ab59251dc8277eb0b0c"
2015-10-14 11:04:00.420325 RESP: [500] {'content-length': '208', 'x-compute-request-id': 'req-6881cfe9-5962-4105-bf46-2d164ef4451c', 'vary': 'X-OpenStack-Nova-API-Version', 'connection': 'keep-alive', 'x-openstack-nova-api-version': '2.1', 'date': 'Wed, 14 Oct 2015 11:04:00 GMT', 'content-type': 'application/json; charset=UTF-8'}
2015-10-14 11:04:00.420358 RESP BODY: {"computeFault": {"message": "Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.\\n<class 'nova.exception.ComputeHostNotFound'>", "code": 500}}
2015-10-14 11:04:00.420365
2015-10-14 11:04:00.420842 Recoverable error: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.

Otherwise the cluster appears to be working; for example I've successfully launched 10 cirros instances, across the 3 available compute nodes.

ubuntu@calico-vm18:/opt/stack/nova$ git log -1
commit 1d97b0e7308975cd4a912dda00df8726ba776fe7
Merge: 7200b6a 11e20dd
Author: Jenkins <email address hidden>
Date: Wed Oct 14 08:50:59 2015 +0000

    Merge "Ignore errorcode=4 when executing `cryptsetup remove` command"

Tags: compute
Revision history for this message
Nell Jerram (neil-jerram) wrote :
Download full text (3.5 KiB)

The n-api.log includes several instances of this:

2015-10-14 12:45:41.416 DEBUG nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Calling method '<bound method HypervisorsController.detail of <nova.api.openstack.compute.hypervisors.HypervisorsController object at 0x7fd447b7eed0>>' _process_stack /opt/stack/nova/nova/api/openstack/wsgi.py:792
2015-10-14 12:45:41.432 ERROR nova.api.openstack.extensions [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Unexpected exception in API method
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/api/openstack/extensions.py", line 478, in wrapped
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/api/openstack/compute/hypervisors.py", line 101, in detail
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions for hyp in compute_nodes])
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/compute/api.py", line 3478, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions return objects.Service.get_by_compute_host(context, host_name)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 171, in wrapper
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions result = fn(cls, context, *args, **kwargs)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/objects/service.py", line 223, in get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions db_service = db.service_get_by_compute_host(context, host)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/db/api.py", line 145, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions use_slave=use_slave)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 512, in service_get_by_compute_host
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions raise exception.ComputeHostNotFound(host=host)
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions ComputeHostNotFound: Compute host calico-vm23 could not be found.
2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions
2015-10-14 12:45:41.488 INFO nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.ComputeHostNotFound'>
2015-10-14 12:45:41.488 DEBUG nova.api.openstack.wsgi [req-722ba30b-5b9d-4f57-95dd-c6173f8d427c admin admin] Returning 500 to user: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
...

Read more...

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

The important information is this one, I guess:

    2015-10-14 12:45:41.432 8564 ERROR nova.api.openstack.extensions
    ComputeHostNotFound: Compute host calico-vm23 could not be found.

    calico-vm23 is a node that was previously set up as a compute node
    (actually from a stacking a couple of weeks ago) but has now been
    wiped.

    So I guess the question is: how do I get Nova on the controller to
    stop trying to communicate with calico-vm23 ?

The "nova/compute/manager.py" has a periodic task which should cleanup
this king of things, IIUC:

    @periodic_task.periodic_task(spacing=CONF.update_resources_interval)
    def update_available_resource(self, context):
        """See driver.get_available_resource()

        Periodic process that keeps that the compute host's understanding of
        resource availability and usage in sync with the underlying hypervisor.

        :param context: security context
        """
        new_resource_tracker_dict = {}

        compute_nodes_in_db = self._get_compute_nodes_in_db(context,
                                                            use_slave=True)
        nodenames = set(self.driver.get_available_nodes())
        for nodename in nodenames:
            rt = self._get_resource_tracker(nodename)
            try:
                rt.update_available_resource(context)
            except exception.ComputeHostNotFound:
                # NOTE(comstud): We can get to this case if a node was
                # marked 'deleted' in the DB and then re-added with a
                # different auto-increment id. The cached resource
                # tracker tried to update a deleted record and failed.
                # Don't add this resource tracker to the new dict, so
                # that this will resolve itself on the next run.
                LOG.info(_LI("Compute node '%s' not found in "
                             "update_available_resource."), nodename)
                continue
            except Exception as e:
                LOG.error(_LE("Error updating resources for node "
                              "%(node)s: %(e)s"),
                          {'node': nodename, 'e': e})
            new_resource_tracker_dict[nodename] = rt

        # NOTE(comstud): Replace the RT cache before looping through
        # compute nodes to delete below, as we can end up doing greenthread
        # switches there. Best to have everyone using the newest cache
        # ASAP.
        self._resource_tracker_dict = new_resource_tracker_dict

        # Delete orphan compute node not reported by driver but still in db
        for cn in compute_nodes_in_db:
            if cn.hypervisor_hostname not in nodenames:
                LOG.info(_LI("Deleting orphan compute node %s") % cn.id)
                cn.destroy()

That's at least the only occurance of the "ComputeNode"'s method
"destroy()" which should remove this entry from the database. I didn't
find a REST API to trigger this as end user.

tags: added: compute
Revision history for this message
Augustina Ragwitz (auggy) wrote :

Hi Neil, I just wanted to follow up to see if you're still experiencing this error? If so it's possible the Computer manager cleanup didn't happen as it should have and we can look into it further. If however you are no longer experiencing this, then maybe you can provide some info (if you still remember or have it handy) as to how long it took Nova to stop trying to connect to the deleted VM? Thanks!

summary: - Can't retrieve hypervisor info - Nova Unexpected API error
+ Nova tries to connect to deleted VM causing API error and Horizon error
Changed in nova:
status: New → Incomplete
Revision history for this message
Ross Golder (ross-golder) wrote :

fwiw, I've just started looking into exactly same problem as described above - same symptoms, same stacktraces in the logs - in this case, a three node Ubuntu Wily based stack.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

@Augustina - No, I'm not still seeing this myself. I think it originally occurred when one of my compute node installs failed in some way, and I've since improved my automation so that that kind of failure doesn't happen.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

@Ross+@Neil: Maybe the same node exists twice in the database but with different IDs. Could you add the contents of the nova db "instances" table?

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.