Conductor: fails to clean up networking resources due to _destroy_build_request CantStartEngineError

Bug #1736946 reported by Gary Kotton
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Newton
Won't Fix
High
Unassigned
Ocata
Fix Committed
High
Matt Riedemann
Pike
Fix Committed
High
Matt Riedemann

Bug Description

If libvirt fails to deploy instance - for example due to problematic vif type being passed. The conductor will fail to clean up resources. This fails with the exception below. This is due to the fact that the cell mapping was not invoked.

Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00mTraceback (most recent call last):
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    res = self.dispatcher.dispatch(message)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    return self._do_dispatch(endpoint, method, ctxt, args)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    result = func(ctxt, **new_args)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/opt/stack/nova/nova/conductor/manager.py", line 559, in build_instances
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    self._destroy_build_request(context, instance)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/opt/stack/nova/nova/conductor/manager.py", line 477, in _destroy_build_request
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    context, instance.uuid)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    result = fn(cls, context, *args, **kwargs)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/opt/stack/nova/nova/objects/build_request.py", line 176, in get_by_instance_uuid
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    db_req = cls._get_by_instance_uuid_from_db(context, instance_uuid)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 983, in wrapper
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    with self._transaction_scope(context):
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    return self.gen.next()
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1033, in _transaction_scope
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    context=context) as resource:
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    return self.gen.next()
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 638, in _session
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    bind=self.connection, mode=self.mode)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 403, in _create_session
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    self._start()
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 489, in _start
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    engine_args, maker_args)
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 511, in _setup_for_connection
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m    "No sql_connection parameter is established")
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00mCantStartEngineError: No sql_connection parameter is established
Dec  7 09:12:50 utu1604template nova-conductor[22761]: ERROR oslo_messaging.rpc.server #033[01;35m#033[00m

Tags: cells
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/526356

Changed in nova:
assignee: nobody → Gary Kotton (garyk)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote : Re: Conductor: fails to clean up networking resources
Changed in nova:
assignee: Gary Kotton (garyk) → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

This was regressed in newton: https://review.openstack.org/#/c/364005/

Changed in nova:
importance: Undecided → High
tags: added: cells
Matt Riedemann (mriedem)
summary: - Conductor: fails to clean up networking resources
+ Conductor: fails to clean up networking resources due to
+ _destroy_build_request CantStartEngineError
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by garyk (<email address hidden>) on branch: master
Review: https://review.openstack.org/526356
Reason: https://review.openstack.org/#/c/528835/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Newton is basically end of life at this point (we were actually supposed to do that in October I think).

While this bug was introduced in Newton, I don't think we need to hold newton-eol up for this bug, because this will only fail if your conductor service is not configured for the API database and I assume that was pretty common config until at least Ocata when we started requiring cells v2 and more so in Pike when we started pushing for split MQ for cells v2 deployments (where you have a top-level conductor service that can talk to the API DB and a conductor service per cell with it's own MQ and DB, and was isolated from the top-level MQ and API DB).

So in the interest of newton-eol, I'm not going to hold that up for this.

Changed in nova:
assignee: Matt Riedemann (mriedem) → Ed Leafe (ed-leafe)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Ed Leafe (ed-leafe) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Ed Leafe (ed-leafe)
Matt Riedemann (mriedem)
Changed in nova:
assignee: Ed Leafe (ed-leafe) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/529697

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/529703

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/528835
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f8747407fc6ac0dd2669065a79094dd2c284a4b2
Submitter: Zuul
Branch: master

commit f8747407fc6ac0dd2669065a79094dd2c284a4b2
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 18 17:41:26 2017 -0500

    Don't try to delete build request during a reschedule

    If populate_retry failed because of MaxRetriesExceeded,
    don't try to delete build requests because they should
    already be gone from the initial create attempt, plus
    we should assume the cell conductor can't reach the API
    database anyway.

    Similar for hitting NoValidHost during a reschedule. We
    can tell if we're doing a reschedule by the num_attempts
    value in filter_properties, populated via populate_retry,
    which will be >1 during a reschedule.

    Change-Id: I0b3ec6bb098ca32ffd32a61d4f9dcf426c3faf46
    Closes-Bug: #1736946

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/529697
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=96acf3db0bc9ecaddb04fc9effc4b9ca1553103d
Submitter: Zuul
Branch: stable/pike

commit 96acf3db0bc9ecaddb04fc9effc4b9ca1553103d
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 18 17:41:26 2017 -0500

    Don't try to delete build request during a reschedule

    If populate_retry failed because of MaxRetriesExceeded,
    don't try to delete build requests because they should
    already be gone from the initial create attempt, plus
    we should assume the cell conductor can't reach the API
    database anyway.

    Similar for hitting NoValidHost during a reschedule. We
    can tell if we're doing a reschedule by the num_attempts
    value in filter_properties, populated via populate_retry,
    which will be >1 during a reschedule.

    Change-Id: I0b3ec6bb098ca32ffd32a61d4f9dcf426c3faf46
    Closes-Bug: #1736946
    (cherry picked from commit cf88a27c6250043859306d47ae82a63535ad507f)

Revision history for this message
Gary Kotton (garyk) wrote :

The bug still occurs when an invalid vnic is returned by neutron. The code from Matt does not address it

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b3

This issue was fixed in the openstack/nova 17.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.0

This issue was fixed in the openstack/nova 16.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ocata)

Reviewed: https://review.openstack.org/529703
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f70119c842958e730d64db44d6684c1501ea3049
Submitter: Zuul
Branch: stable/ocata

commit f70119c842958e730d64db44d6684c1501ea3049
Author: Matt Riedemann <email address hidden>
Date: Mon Dec 18 17:41:26 2017 -0500

    Don't try to delete build request during a reschedule

    If populate_retry failed because of MaxRetriesExceeded,
    don't try to delete build requests because they should
    already be gone from the initial create attempt, plus
    we should assume the cell conductor can't reach the API
    database anyway.

    Similar for hitting NoValidHost during a reschedule. We
    can tell if we're doing a reschedule by the num_attempts
    value in filter_properties, populated via populate_retry,
    which will be >1 during a reschedule.

    Change-Id: I0b3ec6bb098ca32ffd32a61d4f9dcf426c3faf46
    Closes-Bug: #1736946
    (cherry picked from commit cf88a27c6250043859306d47ae82a63535ad507f)
    (cherry picked from commit 96acf3db0bc9ecaddb04fc9effc4b9ca1553103d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 15.1.1

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.