resize reschedule results in CantStartEngineError during up-call to InstanceMappings table

Bug #1781300 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Matt Riedemann
Pike
Won't Fix
Low
Unassigned
Queens
Confirmed
Low
Unassigned
Rocky
Fix Committed
Low
Matt Riedemann
Stein
Fix Committed
Low
Matt Riedemann

Bug Description

Seen here:

http://logs.openstack.org/27/581727/1/check/tempest-full-py3/15d7fdc/controller/logs/screen-n-cpu.txt#_Jul_11_13_32_54_822996

Jul 11 13:32:54.822996 ubuntu-xenial-rax-ord-0000660028 nova-compute[22966]: ERROR nova.compute.manager [None req-2b322ff2-8b41-4066-921d-f801f9defdaf tempest-DeleteServersTestJSON-1048472163 tempest-DeleteServersTestJSON-1048472163] [instance: 968d92c5-c972-4368-a2ce-fe8aac8c656c] Error trying to reschedule: oslo_messaging.rpc.client.RemoteError: Remote error: CantStartEngineError No sql_connection parameter is established
Jul 11 13:32:54.823342 ubuntu-xenial-rax-ord-0000660028 nova-compute[22966]: ['Traceback (most recent call last):\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming\n res = self.dispatcher.dispatch(message)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_messaging/rpc/server.py", line 226, in inner\n return func(*args, **kwargs)\n', ' File "/opt/stack/nova/nova/conductor/manager.py", line 71, in wrapper\n context, instance.uuid)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper\n result = fn(cls, context, *args, **kwargs)\n', ' File "/opt/stack/nova/nova/objects/instance_mapping.py", line 72, in get_by_instance_uuid\n db_mapping = cls._get_by_instance_uuid_from_db(context, instance_uuid)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 992, in wrapper\n with self._transaction_scope(context):\n', ' File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__\n return next(self.gen)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1042, in _transaction_scope\n context=context) as resource:\n', ' File "/usr/lib/python3.5/contextlib.py", line 59, in __enter__\n return next(self.gen)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 645, in _session\n bind=self.connection, mode=self.mode)\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 409, in _create_session\n self._start()\n', ' File "/usr/local/lib/python3.5/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 496, in _start\n engine_args, maker_args)\n', ' File "/usr/local/lib/py
Jul 11 13:32:54.824136 ubuntu-xenial-rax-ord-0000660028 nova-compute[22966]: thon3.5/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 518, in _setup_for_connection\n "No sql_connection parameter is established")\n', 'oslo_db.exception.CantStartEngineError: No sql_connection parameter is established\n'].

This is because in a default superconductor mode in devstack, the n-cpu and n-cond-cell1 services aren't configured for the nova API DB and can't hit the instance mappings table in the API DB, but when nova-compute casts to the cell conductor's migrate_server method, it's decorated with the @targets_cell decorator which attempts to find the instance mapping for the instance to get the cell and blows up.

In this reschedule scenario, the instance (and context) are actually already targeted to a cell so we should be able to just short-circuit the targets_cell decorator.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This goes back to Pike when targets_cell() was added:

https://review.openstack.org/#/c/438022/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/581912

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/581912
Reason: This is the wrong fix as Mel pointed out, the cell targeting isn't persisted across rpc and we only get away with it in unit/functional tests because of the CheatingSerializer.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Note related bug 1781286.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Fix is here https://review.openstack.org/#/c/581912/ - it wasn't tracked by launchpad for some reason.

Changed in nova:
assignee: Matt Riedemann (mriedem) → nobody
status: In Progress → Triaged
status: Triaged → In Progress
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

Marking this as low severity since most deployments are probably running all services (except maybe nova-compute) with the api_db/connection configured so they wouldn't hit this, but it's definitely something we can hit in a multinode devstack environment in the gate.

Changed in nova:
importance: Medium → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/581912
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=22dd4fca3837d6cea3f983b021aa6907695a540b
Submitter: Zuul
Branch: master

commit 22dd4fca3837d6cea3f983b021aa6907695a540b
Author: Matt Riedemann <email address hidden>
Date: Thu Sep 13 13:28:24 2018 -0400

    Noop CantStartEngineError in targets_cell if API DB not configured

    For reschedules during resize (migrate_server method),
    the InstanceMapping query in targets_cell is an "up call"
    to the API DB which will fail with a CantStartEngineError
    if the cell conductor is not configured for the API DB.

    This changes the targets_cell decorator to handle the
    CantStartEngineError and if the API DB is not configured,
    we assume we're in the cell conductor and just ignore the
    error, otherwise if the API DB is configured we assume
    we're in the super-conductor and reraise as before.

    Change-Id: I0a413eb4f8a94500941e53b9a294d7cdb45d2a1c
    Closes-Bug: #1781300

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/663030

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is low priority since most production deployments are probably not running with split conductors and the cell conductor being isolated from the API DB, it's mostly only an issue we hit in devstack, therefore I probably won't backport this past Stein unless someone needs it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/663030
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d4bc147c38c5fe7c4adb5925225ced16d2ff4757
Submitter: Zuul
Branch: stable/stein

commit d4bc147c38c5fe7c4adb5925225ced16d2ff4757
Author: Matt Riedemann <email address hidden>
Date: Thu Sep 13 13:28:24 2018 -0400

    Noop CantStartEngineError in targets_cell if API DB not configured

    For reschedules during resize (migrate_server method),
    the InstanceMapping query in targets_cell is an "up call"
    to the API DB which will fail with a CantStartEngineError
    if the cell conductor is not configured for the API DB.

    This changes the targets_cell decorator to handle the
    CantStartEngineError and if the API DB is not configured,
    we assume we're in the cell conductor and just ignore the
    error, otherwise if the API DB is configured we assume
    we're in the super-conductor and reraise as before.

    Change-Id: I0a413eb4f8a94500941e53b9a294d7cdb45d2a1c
    Closes-Bug: #1781300
    (cherry picked from commit 22dd4fca3837d6cea3f983b021aa6907695a540b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.2

This issue was fixed in the openstack/nova 19.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/686276

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/686276
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=20de81c7c0dfe234805fa9552b1b83f3ea6df7a1
Submitter: Zuul
Branch: stable/rocky

commit 20de81c7c0dfe234805fa9552b1b83f3ea6df7a1
Author: Matt Riedemann <email address hidden>
Date: Thu Sep 13 13:28:24 2018 -0400

    Noop CantStartEngineError in targets_cell if API DB not configured

    For reschedules during resize (migrate_server method),
    the InstanceMapping query in targets_cell is an "up call"
    to the API DB which will fail with a CantStartEngineError
    if the cell conductor is not configured for the API DB.

    This changes the targets_cell decorator to handle the
    CantStartEngineError and if the API DB is not configured,
    we assume we're in the cell conductor and just ignore the
    error, otherwise if the API DB is configured we assume
    we're in the super-conductor and reraise as before.

    Change-Id: I0a413eb4f8a94500941e53b9a294d7cdb45d2a1c
    Closes-Bug: #1781300
    (cherry picked from commit 22dd4fca3837d6cea3f983b021aa6907695a540b)
    (cherry picked from commit d4bc147c38c5fe7c4adb5925225ced16d2ff4757)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.3.0

This issue was fixed in the openstack/nova 18.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.