Race during ironic re-balance corrupts local RT ProviderTree and compute_nodes cache

Bug #1841481 reported by Matt Riedemann on 2019-08-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann
Ocata
Undecided
Unassigned
Pike
Undecided
Unassigned
Queens
Undecided
Unassigned
Rocky
Undecided
Unassigned
Stein
Undecided
Unassigned
Train
Undecided
Unassigned

Bug Description

Seen with an ironic re-balance in this job:

https://d01b2e57f0a56cb7edf0-b6bc206936c08bb07a5f77cfa916a2d4.ssl.cf5.rackcdn.com/678298/4/check/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/92c65ac/

On the subnode we see the RT detect that the node is moving hosts:

Aug 26 18:41:38.818412 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: INFO nova.compute.resource_tracker [None req-a894abee-a2f1-4423-8ede-2a1b9eef28a4 None None] ComputeNode 61dbc9c7-828b-4c42-b19c-a3716037965f moving from ubuntu-bionic-rax-ord-0010443317 to ubuntu-bionic-rax-ord-0010443319

On that new host, the ProviderTree cache is getting updated with refreshed associations for inventory:

Aug 26 18:41:38.881026 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: DEBUG nova.scheduler.client.report [None req-a894abee-a2f1-4423-8ede-2a1b9eef28a4 None None] Refreshing inventories for resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f {{(pid=747) _refresh_associations /opt/stack/nova/nova/scheduler/client/report.py:761}}

aggregates:

Aug 26 18:41:38.953685 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: DEBUG nova.scheduler.client.report [None req-a894abee-a2f1-4423-8ede-2a1b9eef28a4 None None] Refreshing aggregate associations for resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f, aggregates: None {{(pid=747) _refresh_associations /opt/stack/nova/nova/scheduler/client/report.py:770}}

and traits - but when we get traits the provider is gone:

Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager [None req-a894abee-a2f1-4423-8ede-2a1b9eef28a4 None None] Error updating resources for node 61dbc9c7-828b-4c42-b19c-a3716037965f.: ResourceProviderTraitRetrievalFailed: Failed to get traits for resource provider with UUID 61dbc9c7-828b-4c42-b19c-a3716037965f
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager Traceback (most recent call last):
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/manager.py", line 8250, in _update_available_resource_for_node
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager startup=startup)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 715, in update_available_resource
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 328, in inner
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager return f(*args, **kwargs)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 738, in _update_available_resource
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager is_new_compute_node = self._init_compute_node(context, resources)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 561, in _init_compute_node
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager if self._check_for_nodes_rebalance(context, resources, nodename):
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 516, in _check_for_nodes_rebalance
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager self._update(context, cn)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1054, in _update
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager self._update_to_placement(context, compute_node, startup)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager return Retrying(*dargs, **dkw).call(f, *args, **kw)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 206, in call
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager return attempt.get(self._wrap_exception)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager six.reraise(self.value[0], self.value[1], self.value[2])
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 970, in _update_to_placement
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager context, compute_node.uuid, name=compute_node.hypervisor_hostname)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/scheduler/client/report.py", line 858, in get_provider_tree_and_ensure_root
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager parent_provider_uuid=parent_provider_uuid)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/scheduler/client/report.py", line 666, in _ensure_resource_provider
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager self._refresh_associations(context, uuid_to_refresh, force=True)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/scheduler/client/report.py", line 778, in _refresh_associations
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager trait_info = self.get_provider_traits(context, rp_uuid)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager File "/opt/stack/nova/nova/scheduler/client/report.py", line 381, in get_provider_traits
Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager raise exception.ResourceProviderTraitRetrievalFailed(uuid=rp_uuid)
Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager ResourceProviderTraitRetrievalFailed: Failed to get traits for resource provider with UUID 61dbc9c7-828b-4c42-b19c-a3716037965f
Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.manager

That's because back on the original host, it deleted the no-longer-reported-here node:

Aug 26 18:41:38.832749 ubuntu-bionic-rax-ord-0010443317 nova-compute[19290]: INFO nova.compute.manager [None req-d5a9c4b6-f197-4f6c-8b12-8f736bbdb11c None None] Deleting orphan compute node 6 hypervisor host is 61dbc9c7-828b-4c42-b19c-a3716037965f, nodes are set([u'1d23263a-31d4-49d9-ad68-be19219c3bae', u'be80f41d-73ed-46ad-b8e4-cefb0193de36', u'f3c6add0-3eda-47d9-9624-c1f73d488066', u'2c909342-b5dc-4203-b9cb-05a8f29c6c35', u'4921f5d8-8b39-4d03-8423-8e8404128ece'])
Aug 26 18:41:38.962237 ubuntu-bionic-rax-ord-0010443317 nova-compute[19290]: INFO nova.scheduler.client.report [None req-d5a9c4b6-f197-4f6c-8b12-8f736bbdb11c None None] Deleted resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f

Every 60 seconds or so after that, the update_available_resource periodic task on the new host should correct this, but there are a couple of problems - we continue to see that the resource provider is not re-created:

Aug 26 18:42:37.122768 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: ERROR nova.compute.resource_tracker [None req-ab8d1a0e-385f-4333-bbb5-7b82250968fb None None] Skipping removal of allocations for deleted instances: Failed to retrieve allocations for resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f: {"errors": [{"status": 404, "request_id": "req-46f04ab5-2bd7-4a13-add9-4b9073587138", "detail": "The resource could not be found.\n\n Resource provider '61dbc9c7-828b-4c42-b19c-a3716037965f' not found: No resource provider with uuid 61dbc9c7-828b-4c42-b19c-a3716037965f found ", "title": "Not Found"}]}: ResourceProviderAllocationRetrievalFailed: Failed to retrieve allocations for resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f: {"errors": [{"status": 404, "request_id": "req-46f04ab5-2bd7-4a13-add9-4b9073587138", "detail": "The resource could not be found.\n\n Resource provider '61dbc9c7-828b-4c42-b19c-a3716037965f' not found: No resource provider with uuid 61dbc9c7-828b-4c42-b19c-a3716037965f found ", "title": "Not Found"}]}

First, we don't go through the RT._update flow again because when the new host detects the moved node, it adds it to the compute_nodes dict:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L513

And then fails the _update call to get the traits because the old host deleted the provider concurrently:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L516

After that, update_available_resource will see the node in RT.compute_nodes already and not call _update:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L546

Another issue is that because there was a race between the time the new host was refreshing and adding the resource provider to its local ProviderTree cache:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L640

and when the old host deleted the provider, the new host ProviderTree cache now has the provider locally but it's actually gone from placement - remember that this is where we failed to get the traits:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L778

So there seems to be two things to cleanup:

1. If we fail here:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L516

We should remove the node from the RT.compute_nodes dict - similar to this fix https://review.opendev.org/#/c/675704/. That will mean we go through RT._update on the next update_available_resource periodic task run.

2. If we fail here:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L666

We should remove the provider from the local ProviderTree cache so that the next run will find that the provider does not exist and re-create it here:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L642

Now maybe the "remove from ProviderTree cache on failure" logic there needs to be conditional on whether or not created_rp is None, I'm not sure. It might be best to just always remove the entry from the cache if we got an error trying to refresh associations for it so we're clean next time - that's likely a better question for Eric Fried.

Eric Fried (efried) wrote :

Yup, we ran into this race in update_from_provider_tree as well, for which we made _clear_provider_cache_for_tree().

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L1330-L1341

We should invoke same when _refresh_associations fails.

...Possibly *from* _refresh_associations itself.

...Keeping in mind that that guy is recursive :)

Eric Fried (efried) on 2019-08-27
Changed in nova:
assignee: nobody → Eric Fried (efried)

Related fix proposed to branch: master
Review: https://review.opendev.org/678958

Related fix proposed to branch: master
Review: https://review.opendev.org/678959

Changed in nova:
status: Triaged → In Progress
Matt Riedemann (mriedem) wrote :

I'm wondering if there is a way to recreate this with a functional test similar to https://review.opendev.org/#/c/675705/ where we'd have two compute services and a single compute node. The node would start on host1 and then we'd update the node to go from host1 to host2 in the DB and run the update_available_resource periodic on host1 which should delete the resource provider, and then run that same periodic on host2 to see if it fails. One thing we'd probably have to stub is injecting the provider into the host2 ProviderTree cache after host1 deletes the provider but before host2 tries to refresh associations for the provider, which is a bit icky but kind of the only way to recreate a race in tests.

Matt Riedemann (mriedem) wrote :

I think we can consider this a regression going back to https://review.opendev.org/#/c/526540/ which changed the _get_provider_traits method from returning None to raising an error. That doesn't mean that change was wrong, just that it's probably as early as we can say this is a recreatable (backportable) issue.

Matt Riedemann (mriedem) wrote :

Arguably the #1 fix about removing the entry from the ResourceTracker.compute_nodes dict could go back to Ocata because this is in Ocata:

https://review.opendev.org/#/q/I4253cffca3dbf558c875eed7e77711a31e9e3406

Fix proposed to branch: master
Review: https://review.opendev.org/684849

Changed in nova:
assignee: Eric Fried (efried) → Matt Riedemann (mriedem)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/684840
Reason: Let's just go with Mark's series here:

https://review.opendev.org/#/q/topic:bug/1839560+status:open

I've lost context on the bug and fixes and likely not going to drive any of this forward so we'll just go with Mark's changes.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/684849
Reason: Let's just go with Mark's series here:

https://review.opendev.org/#/q/topic:bug/1839560+status:open

I've lost context on the bug and fixes and likely not going to drive any of this forward so we'll just go with Mark's changes.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers