Catch AuthorizationFailure when doing poll_and _check periodically

Bug #1732684 reported by Spyros Trigazis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Magnum
New
Undecided
Unassigned

Bug Description

In the periodic task to sync cluster statuses, we catch [1] an exception when the stack doesn't exist.

However the heat client throws an exception because the trust is already deleted.

We need to catch the authentication exception too.

2017-10-26 15:12:56.275 20618 ERROR magnum.common.keystone [req-9997650a-47da-41cc-8211-3afbd9071403 - - - 4cb76a98145b11e793ae92361f002671 -] Keystone API connection failed: no password, trust_id or token found.
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall [req-9997650a-47da-41cc-8211-3afbd9071403 - - - 4cb76a98145b11e793ae92361f002671 -] Fixed interval looping call 'magnum.service.periodic.ClusterUpdateJob.update_status' failed: AuthorizationFailure: reason Keystone API connection failed: no password, trust_id or token found.
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall Traceback (most recent call last):
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 137, in _run_loop
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall result = func(*self.args, **self.kw)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/service/periodic.py", line 70, in update_status
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall cdriver.update_cluster_status(self.ctx, self.cluster)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/drivers/heat/driver.py", line 83, in update_cluster_status
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall poller.poll_and_check()
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/drivers/heat/driver.py", line 172, in poll_and_check
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall stack = self.openstack_client.heat().stacks.get(
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/exception.py", line 57, in wrapped
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall return func(*args, **kw)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/clients.py", line 93, in heat
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall region_name=region_name)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/clients.py", line 44, in url_for
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall return self.keystone().session.get_endpoint(**kwargs)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/keystone.py", line 57, in session
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall auth = self._get_auth()
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall File "/usr/lib/python2.7/site-packages/magnum/common/keystone.py", line 97, in _get_auth
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall message='reason %s' % msg)
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall AuthorizationFailure: reason Keystone API connection failed: no password, trust_id or token found.
2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall

http://paste.openstack.org/raw/626507/

[1] http://git.openstack.org/cgit/openstack/magnum/tree/magnum/drivers/heat/driver.py#n187

description: updated
Revision history for this message
Spyros Trigazis (strigazi) wrote :

So we have two AuthorizationFailures at the same line of code:
2017-11-16 13:36:33.998 776 ERROR oslo.service.loopingcall AuthorizationFailure: unexpected keystone client error occurred: The request you have made requires authentication. (HTTP 401) (Request-ID: req-1d28d2b8-a043-4ac4-b01f-1ce15f170c1e)

AND

2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall AuthorizationFailure: reason Keystone API connection failed: no password, trust_id or token found.

There is a synchronization problem apparently:
1. magnum requests heat to delete the stack
2. heat receives the stack and delete the trust and trustee user
3. magnum tries to sync the status after that but it can not reach heat with the trust credentials [1]

We need to either make the cluster context every time that we query heat OR catch the exception and :
a. pass for CREATE_IN_PROGRESS. if there is a problem and something is deleted already or haven't been created the cluster creation will timeout and become CREATE_FAILED. (Then the user can delete the cluster)
b. For DELETE_IN_PROGRESS handle the stack as missing since heat has already deleted the stack and after that the trust and trustee user.

[1] http://git.openstack.org/cgit/openstack/magnum/tree/magnum/common/context.py#n115

Revision history for this message
Spyros Trigazis (strigazi) wrote :

The error: "2017-10-26 15:12:56.275 20618 ERROR oslo.service.loopingcall AuthorizationFailure: reason Keystone API connection failed: no password, trust_id or token found."

appeared for a stack that was DELETE_COMPLETE in heat, the trust was empty and the cluster in magnum was in DELETE_IN_PROGRESS for 18 months. Apparently this cluster was carried from the an old magnum version. That bug can't happen again since the trust is deleted only when the stack is deleted. [1]

The second bug can happen in a HA environment where one instance deleted the trust and stack while another was trying to authenticate and check the stack status. When the trust is deleted we can ensure that the db is updated correctly or just pass.

[1] https://github.com/openstack/magnum/blob/master/magnum/drivers/heat/driver.py#L214

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.