Placement API crashes with 500s in Rocky upgrade with downed compute nodes

Bug #1799892 reported by Mohammed Naser
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Tetsuro Nakamura
Rocky
Fix Committed
Medium
Matt Riedemann

Bug Description

I ran into this upgrading another environment into Rocky, deleted the problematic resource provider, but just ran into it again in another upgrade of another environment so there's something wonky. Here's the traceback:

=============
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap [req-8ad1c999-7646-4b0a-91c0-cd26a3581766 b61d42657d364008bfdc6fa715e67daf a894e8109af3430aa7ae03e0c49a0aa0 - default default] Placement API unexpected error: 19: KeyError: 19
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap Traceback (most recent call last):
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/fault_wrap.py", line 40, in __call__
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return self.application(environ, start_response)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/webob/dec.py", line 129, in __call__
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap resp = self.call_func(req, *args, **kw)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/webob/dec.py", line 193, in call_func
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return self.func(req, *args, **kwargs)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/microversion_parse/middleware.py", line 80, in __call__
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap response = req.get_response(self.application)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/webob/request.py", line 1313, in send
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap application, catch_exc_info=False)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/webob/request.py", line 1277, in call_application
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap app_iter = application(self.environ, start_response)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/handler.py", line 209, in __call__
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return dispatch(environ, start_response, self._map)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/handler.py", line 146, in dispatch
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return handler(environ, start_response)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/webob/dec.py", line 129, in __call__
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap resp = self.call_func(req, *args, **kw)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi_wrapper.py", line 29, in call_func
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap super(PlacementWsgify, self).call_func(req, *args, **kwargs)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/webob/dec.py", line 193, in call_func
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return self.func(req, *args, **kwargs)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/microversion.py", line 164, in decorated_func
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return _find_method(f, version, status_code)(req, *args, **kwargs)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/util.py", line 81, in decorated_function
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return f(req)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/handlers/allocation_candidate.py", line 316, in list_allocation_candidates
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap context, requests, limit=limit, group_policy=group_policy)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", line 3965, in get_by_requests
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap context, requests, limit=limit, group_policy=group_policy)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 993, in wrapper
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return fn(*args, **kwargs)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", line 4071, in _get_by_requests
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap context, request, sharing, has_trees)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", line 4045, in _get_by_one_request
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap return _alloc_candidates_single_provider(context, resources, rp_ids)
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", line 3490, in _alloc_candidates_single_provider
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap rp_summary = summaries[rp_id]
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap KeyError: 19
2018-10-25 09:18:29.853 7431 ERROR nova.api.openstack.placement.fault_wrap
=============

The resource provider (nova-compute) with ID 19 was down during the upgrade (it was put down for a long time ago). The only oddities I found was in the database, `root_provider_id` was set to NULL for that record too. Upon deleting the resource provider, the placement API stopped giving 500s when it tried to schedule new VMs.

In the other environment that had a problem too, it actually was the downed instance as well.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/613304

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/613305

tags: added: placement upgrades
Revision history for this message
Matt Riedemann (mriedem) wrote :

OK so it looks like the root_provider_uuid should always be set, and if the resource provider is created without a parent_provider_uuid, then it is itself the root. But this is a data migration problem because for any existing providers created before queens ( https://review.openstack.org/#/c/377138/ ) wouldn't have either of those fields, and we didn't have any kind of online data migration to set the root_provider_id for existing providers.

Revision history for this message
Matt Riedemann (mriedem) wrote :

There is an online data migration:

https://review.openstack.org/#/c/377138/62/nova/objects/resource_provider.py@917

But it's only when listing/showing resource providers. The allocation candidates code must be getting the providers and relying on the root_provider_id using sqla model objects rather than the versioned objects that do the online data migration.

This is where something like "placement-manage db online_data_migrations" would be useful.

Changed in nova:
status: New → Triaged
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
no longer affects: nova/queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/613304
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=95d0ebc3d896a1af95c369d5d12b413e36c1c8d9
Submitter: Zuul
Branch: master

commit 95d0ebc3d896a1af95c369d5d12b413e36c1c8d9
Author: Tetsuro Nakamura <email address hidden>
Date: Fri Oct 19 23:09:33 2018 +0900

    Add recreate test for bug 1799892

    There are cases where ``root_provider_id`` of a resource provider is
    set to NULL just after it is upgraded to the Rocky release. In such
    cases getting allocation candidates raises a Keyerror.

    This patch recreate that bug by simulating the situation by
    inserting the records to the database directly.

    Change-Id: Iaed912314f3e8fef2f46453a6bf12011702ae1dd
    Related-Bug:#1799892

Changed in nova:
assignee: nobody → Tetsuro Nakamura (tetsuro0907)
status: Triaged → In Progress
Changed in nova:
assignee: Tetsuro Nakamura (tetsuro0907) → Eric Fried (efried)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/619075

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/619076

Eric Fried (efried)
Changed in nova:
assignee: Eric Fried (efried) → Tetsuro Nakamura (tetsuro0907)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/613305
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cdbedac920f407e102a6916f2d943e50a1b0943d
Submitter: Zuul
Branch: master

commit cdbedac920f407e102a6916f2d943e50a1b0943d
Author: Tetsuro Nakamura <email address hidden>
Date: Fri Oct 19 23:12:20 2018 +0900

    Consider root id is None in the database case

    There are cases where ``root_provider_id`` of a resource provider is
    set to NULL just after it is upgraded to the Rocky release. In such
    cases getting allocation candidates raises a Keyerror.

    This patch fixes that bug for cases there is no sharing or nested
    providers in play.

    Change-Id: I9639d852078c95de506110f24d3f35e7cf5e361e
    Closes-Bug:#1799892

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/619075
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2ddb689f8789f1e9e845f4103617f267d990bc2b
Submitter: Zuul
Branch: stable/rocky

commit 2ddb689f8789f1e9e845f4103617f267d990bc2b
Author: Tetsuro Nakamura <email address hidden>
Date: Fri Oct 19 23:09:33 2018 +0900

    Add recreate test for bug 1799892

    There are cases where ``root_provider_id`` of a resource provider is
    set to NULL just after it is upgraded to the Rocky release. In such
    cases getting allocation candidates raises a Keyerror.

    This patch recreate that bug by simulating the situation by
    inserting the records to the database directly.

    Change-Id: Iaed912314f3e8fef2f46453a6bf12011702ae1dd
    Related-Bug:#1799892
    (cherry picked from commit 95d0ebc3d896a1af95c369d5d12b413e36c1c8d9)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/619076
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e90e89219410a771f9b6b0c4200edb0480360afe
Submitter: Zuul
Branch: stable/rocky

commit e90e89219410a771f9b6b0c4200edb0480360afe
Author: Tetsuro Nakamura <email address hidden>
Date: Fri Oct 19 23:12:20 2018 +0900

    Consider root id is None in the database case

    There are cases where ``root_provider_id`` of a resource provider is
    set to NULL just after it is upgraded to the Rocky release. In such
    cases getting allocation candidates raises a Keyerror.

    This patch fixes that bug for cases there is no sharing or nested
    providers in play.

    Change-Id: I9639d852078c95de506110f24d3f35e7cf5e361e
    Closes-Bug:#1799892
    (cherry picked from commit cdbedac920f407e102a6916f2d943e50a1b0943d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.1.0

This issue was fixed in the openstack/nova 18.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.