[Errno 110] ETIMEDOUT with endpoint discovery and token request

Bug #1840235 reported by Rabi Mishra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
keystoneauth
In Progress
Undecided
Lance Bragstad

Bug Description

When updating large heat stacks with number of nested stacks, we create auth plugins and request number of trust tokens concurrently for every nested stack, and we see ETIMEDOUT errors[1], in spite of having large number of keystone workers. Adding retries in session.request() helps fix those errors. At present we don't seem to have a way to pass connect_retries when doing endpoint discovery or requesting tokens with plugin get_auth_ref().

It would be a good idea to initialize session with connect_retries and use that unless specified in the request.

This seems to have been mentioned as a future work in https://github.com/openstack/keystoneauth/commit/34c94a6a207bfa3cf665852b0c84bb47f37e4e0a.
[1]

2019-08-14 02:27:08.142 164299 WARNING keystoneauth.identity.generic.base [req-8ab27215-d75f-43c0-8b5f-f85d7c2cba3e - admin - - -] Failed to discover available identity versions when contacting http:/
/192.168.0.1:35357. Attempting to parse version from URL.: ConnectFailure: Unable to establish connection to http://192.168.0.1:35357: HTTPConnectionPool(host='192.168.0.1', port=35357): Max retries e
xceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f435def5f10>: Failed to establish a new connection: [Errno 110] ETIMEDOUT',))
2019-08-14 02:27:08.142 164298 DEBUG heat.engine.scheduler [req-8ab27215-d75f-43c0-8b5f-f85d7c2cba3e - admin - default default] Task update from TemplateResource "14" [f751b50e-1dea-4f32-b4fb-8f26863e
8f69] Stack "overcloud-5039msCompute-bhf4fit4v7l3" [75b36683-3847-4894-b935-b21909a104e5] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:209
2019-08-14 02:27:08.143 164307 DEBUG heat.engine.scheduler [req-8ab27215-d75f-43c0-8b5f-f85d7c2cba3e - admin - default default] Task update from TemplateResource "NodeUserData" [cbd22f69-1773-44f3-aad
f-48d7d63c4e76] Stack "overcloud-5039msCompute-bhf4fit4v7l3-83-iuqbefazivqr" [a5e591d6-5fcc-4eec-a549-d86634daada4] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:209
2019-08-14 02:27:08.143 164297 DEBUG heat.engine.scheduler [req-8ab27215-d75f-43c0-8b5f-f85d7c2cba3e - admin - default default] Task update from TemplateResource "NodeTimesyncUserData" [16b08cd3-0c61-
4922-8e4e-7b8bc629123c] Stack "overcloud-1029uCompute-xbw6u6ajet4g-27-zomtrymyl4we" [cd9c6798-7eb0-4288-96a2-03bbf11d197a] running step /usr/lib/python2.7/site-packages/heat/engine/scheduler.py:209
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server [req-8ab27215-d75f-43c0-8b5f-f85d7c2cba3e - admin - - -] Exception during message handling: DiscoveryFailure: Could not determine a suita
ble URL for the plugin

2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incoming
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 158, in wrapper
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server result = f(*args, **kwargs)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 409, in wrapped
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server return func(self, ctx, *args, **kwargs)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 998, in update_stack
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server is_registered_policy=True)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/policy.py", line 182, in enforce_stack
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server is_registered_policy=is_registered_policy)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/policy.py", line 163, in enforce
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server is_registered_policy=is_registered_policy)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/policy.py", line 148, in _enforce
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server target=target, is_registered_policy=is_registered_policy)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/policy.py", line 106, in enforce
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server is_registered_policy=is_registered_policy)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/policy.py", line 77, in _check
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server credentials = context.to_policy_values()
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 208, in to_policy_values
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server policy = super(RequestContext, self).to_policy_values()
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_context/context.py", line 314, in to_policy_values
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server 'roles': self.roles,
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 301, in roles
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server self._load_keystone_data()
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 292, in _load_keystone_data
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server auth_ref = self.auth_plugin.get_access(self.keystone_session)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/keystoneauth1/identity/base.py", line 134, in get_access
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server self.auth_ref = self.get_auth_ref(session)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/keystoneauth1/identity/generic/base.py", line 199, in get_auth_ref
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server self._plugin = self._do_create_plugin(session)
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/keystoneauth1/identity/generic/base.py", line 194, in _do_create_plugin
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server raise exceptions.DiscoveryFailure('Could not determine a suitable URL '
2019-08-14 02:27:08.143 164299 ERROR oslo_messaging.rpc.server DiscoveryFailure: Could not determine a suitable URL for the plugin

Rabi Mishra (rabi)
Changed in keystoneauth:
assignee: nobody → Rabi Mishra (rabi)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to keystoneauth (master)

Fix proposed to branch: master
Review: https://review.opendev.org/676648

Changed in keystoneauth:
status: New → In Progress
Revision history for this message
Rabi Mishra (rabi) wrote :
Download full text (8.1 KiB)

Traceback for similar ConnectTimeout issue (coming from neutron) when nova talking to neutron.

2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi [req-e1323d84-1a93-4635-b627-8cd11099c5d6 - - - - -] Unexpected exception in API method: ConnectTimeout: Request to http://192.168.0.1:9696
/v2.0/ports/c3af379c-b6c2-4fcb-82d8-5d457a331350 timed out
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 788, in wrapped
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 108, in wrapper
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 557, in create
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi **create_kwargs)
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner
2019-08-19 02:15:00.995 305551 ERROR nova.api.openstack.wsgi rv = f(*args, **kwargs)
2019-08-19 02:15:00.995 30...

Read more...

Revision history for this message
Colleen Murphy (krinkle) wrote :

The nova traceback shows it is using a keystoneauth adapter which already exposes several retry options, so I don't think that is relevant to the proposal to add it to the session object or to the bug subject "[Errno 110] ETIMEDOUT with endpoint discovery and token request".

Changed in keystoneauth:
assignee: Rabi Mishra (rabi) → Morgan Fainberg (mdrnstm)
Revision history for this message
Lance Bragstad (lbragstad) wrote :

Hi Rabi,

The beginning of the bug report claims you observed timeouts despite having multiple keystone servers to respond to incoming requests (it doesn't sound like they weren't under enough load to result in timeouts?)

In that case, do you have any more information as to what was causing those specific timeouts?

Revision history for this message
Lance Bragstad (lbragstad) wrote :

Please disregard my question in comment #4. Zane was able to provide some insight on IRC [0].

After discussing in IRC with a few other keystone maintainers and Zane, it sounds like we have a couple of options.

1.) Expose retry logic in get_access()[1] and get_endpoint()[2] methods
2.) Expose retry logic from the Adapter (which is what the current patch does [3])

Since both approaches would be adding new functionality to keystoneauth, it's unlikely they would be backported, but it looks like heat has a workaround that can be backported to stable/queens [4]. If we move forward with an approach in keystoneauth, heat could drop the workaround in master and use the keystoneauth approach.

There was some hesitation raised in IRC about constantly retrying a request when a client receives an ETIMEOUT, exacerbating the problem if the service is already struggling. Approach #1 would keep the retry logic keystone-specific.

[0] http://eavesdrop.openstack.org/irclogs/%23openstack-keystone/%23openstack-keystone.2019-08-23.log.html#t2019-08-23T16:06:27
[1] https://opendev.org/openstack/keystoneauth/src/commit/38cd5fc6c39c38a51c11683884caf9696ce5f367/keystoneauth1/identity/base.py#L112
[2] https://opendev.org/openstack/keystoneauth/src/commit/38cd5fc6c39c38a51c11683884caf9696ce5f367/keystoneauth1/adapter.py#L265
[3] https://review.opendev.org/#/c/676648/
[4] https://review.opendev.org/#/c/678193/

Revision history for this message
Rabi Mishra (rabi) wrote :
Download full text (9.0 KiB)

> There was some hesitation raised in IRC about constantly retrying a request when a client receives an ETIMEOUT, exacerbating the problem if the service is already struggling

I guess that's a choice for the consumers of keystoneauth, no? I don't get how retries would exacerbate a problem other than probably delaying a failure, unless a caller has a way to ignore these timeout errors.

> Approach #1 would keep the retry logic keystone-specific.

It would, but then all internal calls to get_auth_ref() in keystoneauth would also require changes. For example, when using the adapter interface (for calling other service) client calls, get_auth_headers()[1] calls plugin get_auth_ref() to get a token and we we would need retries there too.

The failure traceback[2] below, that makes a call to nova using the adapter interface would provide more insight to the issue I mentioned above.

[1] https://github.com/openstack/keystoneauth/blob/stable/queens/keystoneauth1/session.py#L573

[2]
2019-08-25 14:54:38.459 92046 ERROR heat.engine.resource RemoteError: Remote error: ConnectFailure Unable to establish connection to http://192.168.0.1:35357/v3/auth/tokens: HTTPConnectionPool(host='1
92.168.0.1', port=35357): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f30fb4e2250>: Failed to establ
ish a new connection: [Errno 110] ETIMEDOUT',))
2019-08-25 14:54:38.459 92046 ERROR heat.engine.resource [u'Traceback (most recent call last):\n', u' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 163, in _process_incom
ing\n res = self.dispatcher.dispatch(message)\n', u' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 220, in dispatch\n return self._do_dispatch(endpoint, method,
 ctxt, args)\n', u' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 190, in _do_dispatch\n result = func(ctxt, **new_args)\n', u' File "/usr/lib/python2.7/site-pack
ages/osprofiler/profiler.py", line 158, in wrapper\n result = f(*args, **kwargs)\n', u' File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 424, in wrapped\n return func(self, ctx, *args, **kwargs)\n', u' File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1010, in update_stack\n environment_files, files, args, template_id)\n', u' File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 966, in _prepare_stack_updates\n updated_stack.validate()\n', u' File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 158, in wrapper\n result = f(*args, **kwargs)\n', u' File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 849, in validate\n validate_value=self.strict_validate)\n', u' File "/usr/lib/python2.7/site-packages/heat/engine/parameters.py", line 538, in validate\n param.validate(validate_value, context)\n', u' File "/usr/lib/python2.7/site-packages/heat/engine/parameters.py", line 246, in validate\n self._validate(self.default(), context)\n', u' File "/usr/lib/python2.7/site-packages/heat/engine/parameters.py", line 366, in _validate\n ...

Read more...

Changed in keystoneauth:
assignee: Morgan Fainberg (mdrnstm) → Lance Bragstad (lbragstad)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to keystoneauth (master)

Reviewed: https://review.opendev.org/676648
Committed: https://git.openstack.org/cgit/openstack/keystoneauth/commit/?id=373cbdbda8ab37ee48671258bad4c7645cd8ceae
Submitter: Zuul
Branch: master

commit 373cbdbda8ab37ee48671258bad4c7645cd8ceae
Author: Rabi Mishra <email address hidden>
Date: Thu Aug 15 12:35:18 2019 +0530

    Allow initializing session with connection retries

    Though we can now set ``connect_retires`` while creating an adapter object,
    that would allow retries in case of connection timeout (ex. with session
    clients derived from Adapater/LegacyJsonAdapater), it can't be used in
    certain scenarios like endpoint discovery with auth plugin get_discovery()
    or getting AccessInfo with get_access()/get_auth_ref().

    Having ``connect_retries`` in Session constructor would allow users
    with option of setting it when creating session objects (if they want)
    and can be overridden per service with the adapter interface.

    This commit also changes the default value of ``connect_retries`` from
    0 to None to allow for adapter's to override retries on the session
    object.

    Depends-On: https://review.opendev.org/#/c/680497/

    Change-Id: Iffb671fefae23926b1f09017d9db438341eae238
    Partial-Bug: #1840235

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.