Comment 2 for bug 1826382

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looking at stein, the SchedulerReportClient used by the ResourceTracker is a singleton per nova-compute process, and that client adapter is stored within the report client upon init:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L198

The _get_providers_in_tree method getting called has the @safe_connect decorator on it:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L493

and that should reconstruct the client adapter if we hit EndpointNotFound:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L77

Ah but we don't hit that, we get a response from placement and log that here:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L521

Which is this:

2019-04-25 09:58:12.175 31793 ERROR nova.scheduler.client.report [req-18b4f522-e702-4ee1-ba85-e565c8e9ac1e - - - - -] [None] Failed to retrieve resource provider tree from placement API for UUID 4f7c6844-d3b8-4710-be2c-8691a93fb58b. Got 400: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>400 Bad Request</title>
</head><body>
<h1>Bad Request</h1>
<p>Your browser sent a request that this server could not understand.<br />
Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />
 Instead use the HTTPS scheme to access this URL, please.<br />
</p>
<hr>
<address>Apache/2.4.29 (Ubuntu) Server at 10.5.0.36 Port 443</address>
</body></html>

But is placement returning that or keystoneauth1? If it were KSA, I'd think it should raise an exception rather than a 400 so we could handle it.

I'm not sure how we can easily detect that and recreate the client adapter object.

And since you're not using https://docs.openstack.org/nova/latest/configuration/config.html#placement.endpoint_override we're getting the endpoint from the service catalog and our cached client adapter is stale...meaning you wouldn't be changing nova.conf when you changed the placement endpoint anyway, it's all dynamic through the service catalog (as intended).

We could potentially just reset the client adapter upon a SIGHUP:

https://github.com/openstack/nova/blob/stable/stein/nova/compute/manager.py#L548

But SIGHUP is performing a full restart of the service (see bug 1715374), so that's not very helpful (plus having to SIGHUP all of your computes when the placement endpoint changes kind of sucks even if HUP worked properly).

One option is we could write a decorator for the get/post/put/delete methods:

https://github.com/openstack/nova/blob/stable/stein/nova/scheduler/client/report.py#L247

That if the response is 400 with that text we'd reconstruct the client adapter and retry the request. I'm not sure if that's too heavy weight though.