When the rest API gets the server, it returns an empty list. Nova-api says Cell "cellid"is not responding and hence is being omitted from the results

Bug #1970087 reported by fanzijie
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Thank you in advance for seeing this question

When I view it on Web pages and CLI commands, the instance is fine

openstack 4.0.2
train

nova-api.log
2022-04-24 17:42:04.142 2487 WARNING nova.compute.multi_cell_list [req-53aa6a4d-61b7-4248-a7e6-b2a5544808e4 db9e37bfd51a4853b045ec98388a2ffb - - default -] Cell 00000000-0000-0000-0000-000000000000 is not responding and hence is being omitted from the results
2022-04-24 17:42:04.143 2487 WARNING nova.compute.multi_cell_list [req-53aa6a4d-61b7-4248-a7e6-b2a5544808e4 db9e37bfd51a4853b045ec98388a2ffb - - default -] Cell cd67e4f8-daa9-4cf3-a1ee-64a11959f6ad is not responding and hence is being omitted from the results

When I ask http://control:8774/v2.1/servers/"instance ID" , an error will be reported
nova-api.log

2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi [req-33b68053-5d7b-42d8-861b-60931ed03100 db9e37bfd51a4853b045ec98388a2ffb - - default -] Unexpected exception in API method: NovaException: Cell cd67e4f8-daa9-4cf3-a1ee-64a11959f6ad is not responding and hence instance info is not available.
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 671, in wrapped
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 476, in show
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 379, in _get_server
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/common.py", line 473, in get_instance
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi cell_down_support=cell_down_support)
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 2657, in get
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support=cell_down_support)
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 2604, in _get_instance
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi expected_attrs, cell_down_support)
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 2597, in _get_instance_from_cell
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi "info is not available.") % cell_uuid)
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi NovaException: Cell cd67e4f8-daa9-4cf3-a1ee-64a11959f6ad is not responding and hence instance info is not available.
2022-04-24 17:40:10.891 2487 ERROR nova.api.openstack.wsgi
2022-04-24 17:40:10.913 2487 INFO nova.api.openstack.wsgi [req-33b68053-5d7b-42d8-861b-60931ed03100 db9e37bfd51a4853b045ec98388a2ffb - - default -] HTTP exception

fanzijie (lonebrave)
description: updated
description: updated
description: updated
description: updated
fanzijie (lonebrave)
description: updated
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

The error indicates that the communication is failed between the nova-api service and the nova-conductor service in cell cd67e4f8-daa9-4cf3-a1ee-64a11959f6ad.

You can list details the cells in your deployment with:
$ nova-manage cell_v2 list_cells

There you will see what is the message bus configuration of the cell in question. Check that the given message bus is working correctly. Also check the logs of the nova-conductor service that is running in the given cell. To see if there are any errors raised there.

I'm setting this as Incomplete. Please set it back to New if you can provide more information.

Changed in nova:
status: New → Incomplete
fanzijie (lonebrave)
Changed in nova:
status: Incomplete → Opinion
status: Opinion → New
Revision history for this message
fanzijie (lonebrave) wrote (last edit ):
Download full text (5.3 KiB)

First of all, thank you for your answer

I think your point is right, but I don't know how to solve it when I ask http://control:8774/v2.1/servers/serverid I didn't see a detailed error report when

I will list the logs of the cell list and conductor services

+-------+--------------------------------------+---------------------------------------+----------------------------------------------+----------+
| Name | UUID | Transport URL | Database connection | Disabled |
+-------+--------------------------------------+---------------------------------------+----------------------------------------------+----------+
| cell0 | 00000000-0000-0000-0000-000000000000 | none:/ | mysql+pymysql://nova:****@control/nova_cell0 | False |
| cell1 | cd67e4f8-daa9-4cf3-a1ee-64a11959f6ad | rabbit://openstack:****@control:5672/ | mysql+pymysql://nova:****@control/nova | False |
+-------+--------------------------------------+---------------------------------------+----------------------------------------------+----------+

nova-conductor.log
2022-04-26 14:38:14.613 2707 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2022-04-26 14:38:14.963 2707 INFO oslo_service.service [-] Waiting on 2 children to exit
2022-04-26 14:38:14.999 2707 INFO oslo_service.service [-] Child 2894 killed by signal 15
2022-04-26 14:38:15.001 2707 INFO oslo_service.service [-] Child 2893 killed by signal 15
2022-04-26 14:38:24.791 29258 WARNING oslo_config.cfg [-] Deprecated: Option "use_neutron" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
). Its value may be silently ignored in the future.
2022-04-26 14:38:29.823 29258 INFO oslo_service.service [-] Starting 2 workers
2022-04-26 14:38:29.847 29258 WARNING oslo_config.cfg [-] Deprecated: Option "dhcpbridge" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
). Its value may be silently ignored in the future.
2022-04-26 14:38:29.850 29258 WARNING oslo_config.cfg [-] Deprecated: Option "dhcpbridge_flagfile" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
). Its value may be silently ignored in the future.
2022-04-26 14:38:29.851 29288 INFO nova.service [-] Starting conductor node (version 20.6.0-1.el7)
2022-04-26 14:38:29.864 29258 WARNING oslo_config.cfg [-] Deprecated: Option "firewall_driver" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
). Its value may be silently ignored in the future.
2022-04-26 14:38:29.865 29258 WARNING oslo_config.cfg [-] Deprecated: Option "force_dhcp_release" from group "DEFAULT" is deprecated for removal (
nova-network is deprecated, as are any related configuration options.
). Its value may be silently ignored in the future.
2022-04-26 14:38:29.863 29289 INFO nova.service [-] Starting conductor node (version 20.6.0-1.el7)
2022-04-26 14:40:22.275 29258 INFO oslo_service.service [-] Ca...

Read more...

Revision history for this message
melanie witt (melwitt) wrote :
Download full text (3.1 KiB)

Looking at the code where this message comes from [1], it's unfortunately a bit misleading:

            if context.is_cell_failure_sentinel(item._db_record):
                if (not CONF.api.list_records_by_skipping_down_cells and
                        not cell_down_support):
                    # Value the config
                    # ``CONF.api.list_records_by_skipping_down_cells`` only if
                    # cell_down_support is False and generate the exception
                    # if CONF.api.list_records_by_skipping_down_cells is False.
                    # In all other cases the results from the down cell should
                    # be skipped now to either construct minimal constructs
                    # later if cell_down_support is True or to simply return
                    # the skipped results if cell_down_support is False.
                    raise exception.NovaException(
                        _('Cell %s is not responding but configuration '
                          'indicates that we should fail.')
                          % item.cell_uuid)
                LOG.warning('Cell %s is not responding and hence is '
                            'being omitted from the results',
                            item.cell_uuid)
                if item._db_record == context.did_not_respond_sentinel:
                    self._cells_timed_out.add(item.cell_uuid)
                elif isinstance(item._db_record, Exception):
                    self._cells_failed.add(item.cell_uuid)

This error message means either there was a timeout waiting for a response from the cell OR an exception was raised when trying to retrieve a DB record from a cell. In the latter case, the exception traceback is logged only for non-NovaException [2]. Are there any additional errors in nova-api.log that you have not included by chance?

I also notice in the code for http://control:8774/v2.1/servers/"instance ID" [3] we only propagate the common InstanceNotFound NovaException up:

        elif isinstance(result[cell_uuid], exception.InstanceNotFound):
            raise exception.InstanceNotFound(instance_id=uuid)

and then unfortunately if everything falls through to the last exception handling, it does not log the content of the exception and is something we need to improve:

        else:
            raise exception.NovaException(
                _("Cell %s is not responding and hence instance "
                  "info is not available.") % cell_uuid)

I'm afraid I can't yet tell what is happening in your environment given the lack of data. If there were a DB error, I think it would have been logged by [2] ... so that implies it's a NovaException that is not InstanceNotFound *and* it's raised by both cell0 and cell1 during an instance list, which seems strange.

If you would be able to "hack" in a print or log statement in the above code ^ and do the http://control:8774/v2.1/servers/"instance ID" again, it would give some clues.

[1] https://github.com/openstack/nova/blob/stable/train/nova/compute/multi_cell_list.py#L429
[2] https://github.com/openstack/nova/blob/stable/train/nova/context.py#L426
[3] https://github.com/openstack/nova/blob/stable/train/no...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/840260

Revision history for this message
melanie witt (melwitt) wrote :

I have proposed a patch to add logging of the exception content during a compute API get() for an instance ^. If you are able to try that patch and post what it logs, we will be able to debug further.

Changed in nova:
status: New → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/840260
Committed: https://opendev.org/openstack/nova/commit/1d4dbfd4680d32d0619e84dbe563deed892e0506
Submitter: "Zuul (22348)"
Branch: master

commit 1d4dbfd4680d32d0619e84dbe563deed892e0506
Author: melanie witt <email address hidden>
Date: Tue May 3 01:48:36 2022 +0000

    Log the exception returned from a cell during API.get()

    When getting an instance using the compute.API we call
    scatter_gather_single_cell() to be able to capture details when we fail
    to retrieve a result from a cell such as timeouts and exceptions.

    Currently however, we aren't logging the content of an exception if
    scatter_gather_single_cell() returns an exception as the result. The
    scatter gather method itself logs exceptions that are not of type
    NovaException as these represent definite unexpected errors such as
    database errors but NovaException handling are left for the caller to
    decide whether they want to log it or re-raise it and so on.

    It can be difficult to debug a situation where a cell is returning a
    NovaException result so this adds logging of the exception content in
    the compute API when we encounter an unexpected NovaException.

    The existing log message has been updated to more accurately reflect
    what has happened (did not respond vs exception). The assignment of the
    exception object in scatter gather has also been updated to not
    unnecessarily construct a new exception object because it (a) wasn't
    necessary and (b) made asserting the LOG.exception() call argument in
    the unit test difficult.

    Related-Bug: #1970087

    Change-Id: Iae1c61c72be5b6017b934293e3dc079a24eeb0e7

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.