Updates to placement api fail if placement endpoint changes

Bug #1826382 reported by Liam Young on 2019-04-25
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Liam Young
OpenStack nova-compute charm
Undecided
Unassigned

Bug Description

If the url of the placement api changes after nova-compute has been started then placement updates fail as nova-compute appears to cache the old endpoint url.

To reproduce, update the placement endpoint to something incorrect in keystone and restart nova-compute. Errors contacting the placement api will be reported every minute or so. Now, correct the entry in keystone. The errors will continue despite the catalogue now being correct. Restarting nova-compute fixes the issue.

In my deployment this occurred when the placement end point switched from http to https after the nova-compute node had started. This resulted in the following in the nova-compute log:

2019-04-25 09:58:12.175 31793 ERROR nova.scheduler.client.report [req-18b4f522-e702-4ee1-ba85-e565c8e9ac1e - - - - -] [None] Failed to retrieve resource provider tree from placement API for UUID 4f7c6844-d3b8-4710-be2c-8691a93fb58b. Got 400: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />
 Instead use the HTTPS scheme to access this URL, please.<br />
</p>
<hr>
<address>Apache/2.4.29 (Ubuntu) Server at 10.5.0.36 Port 443</address>
</body></html>
.
2019-04-25 09:58:12.176 31793 DEBUG oslo_concurrency.lockutils [req-18b4f522-e702-4ee1-ba85-e565c8e9ac1e - - - - -] Lock "compute_resources" released by "nova.compute.resource_tracker.ResourceTracker._update_available_resource" :: held 0.099s inner /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:285
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager [req-18b4f522-e702-4ee1-ba85-e565c8e9ac1e - - - - -] Error updating resources for node juju-7a9f5c-zaza-19a393f3689b-16.project.serverstack.: nova.exception.ResourceProviderRetrievalFailed: Failed to get resource provider with UUID 4f7c6844-d3b8-4710-be2c-8691a93fb58b
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager Traceback (most recent call last):
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 7778, in _update_available_resource_for_node
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager rt.update_available_resource(context, nodename)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 721, in update_available_resource
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager self._update_available_resource(context, resources)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line 274, in inner
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager return f(*args, **kwargs)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 798, in _update_available_resource
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager self._update(context, cn)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/retrying.py", line 49, in wrapped_f
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager return Retrying(*dargs, **dkw).call(f, *args, **kw)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/retrying.py", line 206, in call
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager return attempt.get(self._wrap_exception)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/retrying.py", line 247, in get
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager six.reraise(self.value[0], self.value[1], self.value[2])
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager raise value
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/retrying.py", line 200, in call
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 960, in _update
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager self._update_to_placement(context, compute_node)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/compute/resource_tracker.py", line 912, in _update_to_placement
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager context, compute_node.uuid, name=compute_node.hypervisor_hostname)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/scheduler/client/__init__.py", line 35, in __run_method
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager return getattr(self.instance, __name)(*args, **kwargs)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/scheduler/client/report.py", line 1006, in get_provider_tree_and_ensure_root
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager parent_provider_uuid=parent_provider_uuid)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/scheduler/client/report.py", line 668, in _ensure_resource_provider
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager rps_to_refresh = self._get_providers_in_tree(context, uuid)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/scheduler/client/report.py", line 74, in wrapper
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager return f(self, *a, **k)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager File "/usr/lib/python3/dist-packages/nova/scheduler/client/report.py", line 535, in _get_providers_in_tree
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager raise exception.ResourceProviderRetrievalFailed(uuid=uuid)
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager nova.exception.ResourceProviderRetrievalFailed: Failed to get resource provider with UUID 4f7c6844-d3b8-4710-be2c-8691a93fb58b
2019-04-25 09:58:12.177 31793 ERROR nova.compute.manager

Liam Young (gnuoy) on 2019-04-25
description: updated
Changed in nova:
assignee: nobody → Liam Young (gnuoy)
Matt Riedemann (mriedem) wrote :

What release is this?

What does your [placement] configuration in nova.conf on the compute host look like?

I'm wondering if keystoneauth1 is caching something about the endpoint URL?

tags: added: compute placement
Matt Riedemann (mriedem) wrote :

Looking at stein, the SchedulerReportClient used by the ResourceTracker is a singleton per nova-compute process, and that client adapter is stored within the report client upon init:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L198

The _get_providers_in_tree method getting called has the @safe_connect decorator on it:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L493

and that should reconstruct the client adapter if we hit EndpointNotFound:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L77

Ah but we don't hit that, we get a response from placement and log that here:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L521

Which is this:

2019-04-25 09:58:12.175 31793 ERROR nova.scheduler.client.report [req-18b4f522-e702-4ee1-ba85-e565c8e9ac1e - - - - -] [None] Failed to retrieve resource provider tree from placement API for UUID 4f7c6844-d3b8-4710-be2c-8691a93fb58b. Got 400: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />
 Instead use the HTTPS scheme to access this URL, please.<br />
</p>
<hr>
<address>Apache/2.4.29 (Ubuntu) Server at 10.5.0.36 Port 443</address>
</body></html>

But is placement returning that or keystoneauth1? If it were KSA, I'd think it should raise an exception rather than a 400 so we could handle it.

I'm not sure how we can easily detect that and recreate the client adapter object.

And since you're not using https://docs.openstack.org/nova/latest/configuration/config.html#placement.endpoint_override we're getting the endpoint from the service catalog and our cached client adapter is stale...meaning you wouldn't be changing nova.conf when you changed the placement endpoint anyway, it's all dynamic through the service catalog (as intended).

We could potentially just reset the client adapter upon a SIGHUP:

https://github.com/openstack/nova/blob/stable/stein/nova/compute/manager.py#L548

But SIGHUP is performing a full restart of the service (see bug 1715374), so that's not very helpful (plus having to SIGHUP all of your computes when the placement endpoint changes kind of sucks even if HUP worked properly).

One option is we could write a decorator for the get/post/put/delete methods:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L247

That if the response is 400 with that text we'd reconstruct the client adapter and retry the request. I'm not sure if that's too heavy weight though.

Changed in nova:
status: New → Triaged
Matt Riedemann (mriedem) wrote :

I'm also not sure how common this should be expected and therefore this could be very low priority to try and deal without outside of the SIGHUP suggestion earlier (if SIGHUP weren't broken at the moment).

Liam Young (gnuoy) wrote :

Hi Matt,
    Thank you for taking the time to triage the bug. I have tried out the decorator approach you mentioned ( https://paste.ubuntu.com/p/PDtwdNrqbx/ ) and it seems to work well. I am recreating a client whenever a 400 is returned because parsing the error string looking for the http/https error message looked brittle (and Apache specific). Do you think that's ok ?

I did also look at checking whether the endpoint has changed on error by creating a new client and checking its endpoint against self._client.get_endpoint() but that felt expensive.

I'd be really gratefully for any feedback here or I could work up some unit tests and submit a patch and we could continue the discussion there.

Chris Dent (cdent) wrote :

Based on the error message, this looks like a problem with how apache (or something else in the http stack) is configured. When the configured endpoint was changed from http to https, something on apache (or haproxy maybe) was left behind that is sending port 80 to port 443 and apache is saying (correctly) "You're speaking plain HTTP to an SSL-enabled server port."

This means that the originally configured URL is still valid in a halfway fashion. That should not be the case. Either the old URL should not connect at all (which should make the report client recreate its session, re-querying keystone for an endpoint), or 404, or the URL should redirect (via HTTP not TCP) to the SSL enabled service.

However, doing the redirect would not be successful for later PUTs and POSTs, so better would be to make it fail.

Changed in charm-nova-compute:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers