In the periodic task to sync cluster statuses, we catch [1] an exception when the stack doesn't exist.
However the heat client throws an exception because the trust is already deleted.
We need to catch the authentication exception too.
2017-10-26 15:12:56.275 20618 ERROR magnum.common.keystone [req-9997650a-47da-41cc-8211-3afbd9071403 - - - 4cb76a98145b11e793ae92361f002671 -] Keystone API connection failed: no password, trust_id or token found.
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall [req-9997650a-47da-41cc-8211-3afbd9071403 - - - 4cb76a98145b11e793ae92361f002671 -] Fixed interval looping call 'magnum.service.periodic.ClusterUpdateJob.update_status' failed: AuthorizationFailure: reason Keystone API connection failed: no password, trust_id or token found.
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall Traceback (most recent call last):
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 137, in _run_loop
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/service/periodic.py", line 70, in update_status
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall cdriver.update_cluster_status(self.ctx, self.cluster)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/drivers/heat/driver.py", line 83, in update_cluster_status
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall poller.poll_and_check()
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/drivers/heat/driver.py", line 172, in poll_and_check
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall stack = self.openstack_client.heat().stacks.get(
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/exception.py", line 57, in wrapped
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall return func(*args, **kw)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/clients.py", line 93, in heat
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall region_name=region_name)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/clients.py", line 44, in url_for
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall return self.keystone().session.get_endpoint(**kwargs)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/keystone.py", line 57, in session
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall auth = self._get_auth()
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/keystone.py", line 97, in _get_auth
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall message='reason %s' % msg)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall AuthorizationFailure: reason Keystone API connection failed: no password, trust_id or token found.
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall
http://paste.openstack.org/raw/626507/
[1] http://git.openstack.org/cgit/openstack/magnum/tree/magnum/drivers/heat/driver.py#n187
So we have two AuthorizationFa ilures at the same line of code: loopingcall AuthorizationFa ilure: unexpected keystone client error occurred: The request you have made requires authentication. (HTTP 401) (Request-ID: req-1d28d2b8- a043-4ac4- b01f-1ce15f170c 1e)
2017-11-16 13:36:33.998 776 ERROR oslo.service.
AND
2017-10-26 15:12:56.275 20618 ERROR oslo.service. loopingcall AuthorizationFa ilure: reason Keystone API connection failed: no password, trust_id or token found.
There is a synchronization problem apparently:
1. magnum requests heat to delete the stack
2. heat receives the stack and delete the trust and trustee user
3. magnum tries to sync the status after that but it can not reach heat with the trust credentials [1]
We need to either make the cluster context every time that we query heat OR catch the exception and :
a. pass for CREATE_IN_PROGRESS. if there is a problem and something is deleted already or haven't been created the cluster creation will timeout and become CREATE_FAILED. (Then the user can delete the cluster)
b. For DELETE_IN_PROGRESS handle the stack as missing since heat has already deleted the stack and after that the trust and trustee user.
[1] http:// git.openstack. org/cgit/ openstack/ magnum/ tree/magnum/ common/ context. py#n115