CantStartEngineError in cell conductor during reschedule - get_host_availability_zone up-call

Bug #1781286 reported by Matthew Edmonds on 2018-07-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
Pike
Undecided
Unassigned
Queens
Undecided
Unassigned

Bug Description

In a stable/queens devstack environment with multiple PowerVM compute nodes, everytime I see this in <email address hidden> logs:

Jul 11 15:48:57 myhostname nova-conductor[3796]: DEBUG nova.conductor.manager [None req-af22375c-f920-4747-bd2f-0de80ee69465 admin admin] Rescheduling: True {{(pid=4108) build_instances /opt/stack/nova/nova/conductor/manager.py:571}}

it is shortly thereafter followed by:

Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server [None req-af22375c-f920-4747-bd2f-0de80ee69465 admin admin] Exception during message handling: CantStartEngineError: No sql_connection parameter is established
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/manager.py", line 652, in build_instances
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server host.service_host))
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/availability_zones.py", line 95, in get_host_availability_zone
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server key='availability_zone')
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server result = fn(cls, context, *args, **kwargs)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/objects/aggregate.py", line 541, in get_by_host
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server _get_by_host_from_db(context, host, key=key)]
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 987, in wrapper
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server with self._transaction_scope(context):
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server return self.gen.next()
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1037, in _transaction_scope
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server context=context) as resource:
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server return self.gen.next()
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 640, in _session
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server bind=self.connection, mode=self.mode)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 404, in _create_session
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server self._start()
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 491, in _start
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server engine_args, maker_args)
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 513, in _setup_for_connection
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server "No sql_connection parameter is established")
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server CantStartEngineError: No sql_connection parameter is established
Jul 11 15:48:57 myhostname nova-conductor[3796]: ERROR oslo_messaging.rpc.server

The nova_cell1.conf does have [database]connection set:

[database]
connection = mysql+pymysql://root:mysql@127.0.0.1/nova_cell1?charset=utf8

This may be related to https://bugs.launchpad.net/nova/+bug/1736946 , though that is supposedly fixed in stable/queens and the trace is different, hence the new defect.

description: updated
description: updated
description: updated
Matt Riedemann (mriedem) wrote :

It's not the [database]/connection that matters, build requests are in API DB which isn't configured in the cell conductor in default devstack.

If you have a recreate, can you log the filter_properties when this fails? Because the logic is based on the num_attempts count from the retries:

except Exception as exc:
            num_attempts = filter_properties.get(
                'retry', {}).get('num_attempts', 1)
            updates = {'vm_state': vm_states.ERROR, 'task_state': None}
            for instance in instances:
                self._set_vm_state_and_notify(
                    context, instance.uuid, 'build_instances', updates,
                    exc, request_spec)
                # If num_attempts > 1, we're in a reschedule and probably
                # either hit NoValidHost or MaxRetriesExceeded. Either way,
                # the build request should already be gone and we probably
                # can't reach the API DB from the cell conductor.
                if num_attempts <= 1:
                    try:
                        # If the BuildRequest stays around then instance
                        # show/lists will pull from it rather than the errored
                        # instance.
                        self._destroy_build_request(context, instance)

Matt Riedemann (mriedem) wrote :

I don't see how this could happen. When we initially schedule a server, we populate the retry field in the filter_properties and set num_attempts to 1:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/conductor/manager.py#L1208

We do the same here with what should be the same filter_properties dict passed from conductor -> compute -> conductor during the reschedule loop:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/conductor/manager.py#L563

That second call to populate_retry should increment num_attempts to 2:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/scheduler/utils.py#L646

The only thing I can figure is maybe you have the max_attempts config option value set to 1 or you're forcing the host/node during the server create?

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/scheduler/utils.py#L634

In that case we don't set the retry entry in filter_properties.

tags: added: conductor
Matthew Edmonds (edmondsw) wrote :

The except block in comment 1 is not entered, because the exception doesn't get raised until after that, line 652 in stable/queens:

https://github.com/openstack/nova/blob/7ae2dc840a0bccb868122bb4b77e8958a0e842a7/nova/conductor/manager.py#L650-L652

The links in comment 2 all appear to be from master. I wouldn't be surprised if master is also affected, but I'm seeing this in stable/queens, so we probably want to look there first.

[scheduler]max_attempts is not changed from the default (3) and the create requests do not appear to be specifying a specific host/node.

Matt Riedemann (mriedem) wrote :

OK looking at the stacktrace I see it's not the '_destroy_build_request' call that's blowing up on reschedule, it's the up-call to get the availability zone for the next chosen host from the list of alternates:

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/conductor/manager.py#L647

And if an AZ is not requested during server create, we are free to move the instance to another AZ during reschedule. So it seems we've fallen into a dreaded up-call hole here that needs to be tracked:

https://docs.openstack.org/nova/latest/user/cellsv2-layout.html#operations-requiring-upcalls

summary: - CantStartEngineError in cell conductor during rebuild
+ CantStartEngineError in cell conductor during reschedule -
+ get_host_availability_zone up-call
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: cells
Matthew Edmonds (edmondsw) wrote :

This appears to trace back to https://review.openstack.org/#/c/446053/

But it's unclear to me what needs to consider the AZ if the user didn't specify... E.g. migrate doesn't care about the AZ in that case, and could move the instance to a different AZ (see https://review.openstack.org/#/c/567701/)

Matt Riedemann (mriedem) wrote :

One idea for fixing this would be to set the AZ for each host in the list of Selection objects that come back from the scheduler - that happens at "the top" where we have access to the API DB and thus the aggregates table.

We send the list of Selection objects down to the compute service which, during a reschedule, would pass them back to the cell conductor and then rather than call get_host_availability_zone() we can just get the AZ for the alternate host from the Selection object and avoid the "up call".

This could potentially also be used to fix bug 1497253 where we're doing boot from volume and nova-compute creates the volume and [cinder]/cross_az_attach=False so nova has to tell cinder to create the volume in the same AZ as the instance.

Matt Riedemann (mriedem) wrote :

I had to re-learn why this part of comment 2 isn't a problem:

"""
The only thing I can figure is maybe you have the max_attempts config option value set to 1 or you're forcing the host/node during the server create?

https://github.com/openstack/nova/blob/39b05ee9e34ae7e7c1854439f887588ec157bc69/nova/scheduler/utils.py#L634

In that case we don't set the retry entry in filter_properties.
"""

That's because if CONF.scheduler.max_attempts=1 or force_hosts/nodes, we don't set the 'retry' entry in the filter_properties dict passed between conductor and compute:

https://github.com/openstack/nova/blob/536e5fa57f72f71217fd9f2160df0284a76102e1/nova/scheduler/utils.py#L638

And then if we hit a build failure in compute, we don't cast back to the cell conductor to reschedule, we just fail:

https://github.com/openstack/nova/blob/536e5fa57f72f71217fd9f2160df0284a76102e1/nova/compute/manager.py#L1825

So if we get here in conductor during a reschedule:

https://github.com/openstack/nova/blob/536e5fa57f72f71217fd9f2160df0284a76102e1/nova/conductor/manager.py#L585

The 'retry' key is going to be set and num_attempts won't be 1.

Related fix proposed to branch: master
Review: https://review.openstack.org/582412

Reviewed: https://review.openstack.org/582412
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=276130c6d1637f8e0f39ba274e2619a115a8fa1a
Submitter: Zuul
Branch: master

commit 276130c6d1637f8e0f39ba274e2619a115a8fa1a
Author: Matt Riedemann <email address hidden>
Date: Thu Jul 12 18:25:07 2018 -0400

    Add note about reschedules and num_attempts in filter_properties

    The "retry" entry in filter_properties is not set if reschedules
    are disabled, which happens in these cases:

    1. [scheduler]/max_attempts=1
    2. The server is forced to a specific host and/or node.

    More times than I'd like to admit, I've had to re-learn that
    filter_properties['retry']['num_attempts'] will always be >1 in
    conductor build_instances during a reschedule because if
    reschedules are disabled, the compute service aborts the build
    on failure and we don't even get back to conductor.

    This change adds a note since it's hard to keep in your head how
    the retry logic is all tied together from the API, superconductor,
    compute and cell conductor during a reschedule scenario.

    Change-Id: I83536b179000f41f9618a4b6f2a16b4440fd61ba
    Related-Bug: #1781286

Reviewed: https://review.openstack.org/581910
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1160ae7a6585c1567e2e91badb4c312533d15686
Submitter: Zuul
Branch: master

commit 1160ae7a6585c1567e2e91badb4c312533d15686
Author: Matt Riedemann <email address hidden>
Date: Wed Jul 11 19:21:44 2018 -0400

    Add another up-call to the cells v2 caveats list

    Due to change I8d426f2635232ffc4b510548a905794ca88d7f99 in Pike,
    which ironically was meant to avoid up-calls (I think), it
    introduced an up-call during reschedules for server create and
    resize to set the instance.availability_zone based on the
    alternate host selected during the reschedule.

    This adds the up-call to our list of known issues in the cells
    v2 docs so we can track the issue and make people aware of it.

    Change-Id: Id819f91477613a013b89b1fb0b2def3b0fd4b08c
    Related-Bug: #1781286

Matt Riedemann (mriedem) wrote :

Note related bug 1781300.

Matt Riedemann (mriedem) wrote :

I think this is also for reschedules during server create because of this call:

https://github.com/openstack/nova/blob/4c37ff72e5446c835a48d569dd5a1416fcd36c71/nova/conductor/manager.py#L657

build_instances in the cell conductor won't be able to hit aggregates in the API DB if the cell conductor isn't configured for the API DB.

Matthew Edmonds (edmondsw) wrote :

Yes Matt, I believe it was server create where I originally hit this.

Matt Riedemann (mriedem) wrote :

Yup I hit it during server create as well (this is on a Train devstack with 2 cells and 3 computes, where 2 computes are in cell1 and that's where I got a reschedule on server create because of a port binding failure on the first compute attempted). Note also that the server is stuck in BUILD status due to this, it is not set to ERROR status:

stack@crosscell:~$ sudo journalctl -a -u <email address hidden> | grep req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d
Apr 02 23:29:17 crosscell nova-conductor[25503]: ERROR nova.scheduler.utils [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] [instance: 7a69a14e-6a00-426a-a35c-5340597c30af] Error from last host: crosscell2 (node crosscell2): [u'Traceback (most recent call last):\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 1950, in _do_build_and_run_instance\n filter_properties, request_spec)\n', u' File "/opt/stack/nova/nova/compute/manager.py", line 2320, in _build_and_run_instance\n instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance 7a69a14e-6a00-426a-a35c-5340597c30af was re-scheduled: Binding failed for port 6079221f-1de8-4406-a7f1-842bc44ed0fd, please check neutron logs for more information.\n']
Apr 02 23:29:18 crosscell nova-conductor[25503]: DEBUG nova.conductor.manager [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] Rescheduling: True {{(pid=25503) build_instances /opt/stack/nova/nova/conductor/manager.py:618}}
Apr 02 23:29:18 crosscell nova-conductor[25503]: DEBUG nova.scheduler.utils [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] Attempting to claim resources in the placement API for instance 7a69a14e-6a00-426a-a35c-5340597c30af {{(pid=25503) claim_resources /opt/stack/nova/nova/scheduler/utils.py:1002}}
Apr 02 23:29:18 crosscell nova-conductor[25503]: ERROR oslo_messaging.rpc.server [None req-c5a8d5f3-8270-4a75-ac66-a9908b6f209d admin admin] Exception during message handling: CantStartEngineError: No sql_connection parameter is established

stack@crosscell:~$ openstack server list
+--------------------------------------+--------------+--------+----------+--------------------------+---------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------+--------+----------+--------------------------+---------+
| 7a69a14e-6a00-426a-a35c-5340597c30af | server1cell1 | BUILD | | cirros-0.4.0-x86_64-disk | m1.tiny |
+--------------------------------------+--------------+--------+----------+--------------------------+---------+

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints