libra_api hitting timeout limits

Bug #1191137 reported by Patrick Crews
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libra
Fix Released
Critical
Andrew Hutchings

Bug Description

We are seeing errors where libra_api connections timeout and this is causing things to go badly.

After running high stress workloads on a test environment, we see the database contains a number of lb's that are stuck in PENDING_DELETE / PENDING_UPDATE / BUILD states, despite the libra_worker code showing that these actions were successfully completed.

One idea forwarded from discussions was to increase the timeout value (and perhaps to make this configurable).
2013-06-14 10:27:21,648: root - INFO - 15.185.163.106 - - [14/Jun/2013 10:27:21] "PUT /v1.1/loadbalancers/188392 HTTP/1.1" 202 130 0.094199

2013-06-14 10:27:21,654: root - INFO - (17122) accepted ('15.185.163.106', 50115)

2013-06-14 10:27:51,651: root - INFO - Traceback (most recent call last):

2013-06-14 10:27:51,651: root - INFO - File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 346, in fire_timers

2013-06-14 10:27:51,651: root - INFO - timer()

2013-06-14 10:27:51,651: root - INFO - File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 56, in __call__

2013-06-14 10:27:51,651: root - INFO - cb(*args, **kw)

2013-06-14 10:27:51,651: root - INFO - File "/usr/local/lib/python2.7/dist-packages/libra/api/library/gearman_client.py", line 44, in client_job

2013-06-14 10:27:51,651: root - INFO - client.send_update(data)

2013-06-14 10:27:51,652: root - INFO - File "/usr/local/lib/python2.7/dist-packages/libra/api/library/gearman_client.py", line 165, in send_update

2013-06-14 10:27:51,652: root - INFO - filter(LoadBalancer.status != 'DELETED').\

2013-06-14 10:27:51,652: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2104, in all

2013-06-14 10:27:51,652: root - INFO - return list(self)

2013-06-14 10:27:51,652: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2216, in __iter__

2013-06-14 10:27:51,652: root - INFO - return self._execute_and_instances(context)

2013-06-14 10:27:51,652: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2229, in _execute_and_instances

2013-06-14 10:27:51,652: root - INFO - close_with_result=True)

2013-06-14 10:27:51,652: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2220, in _connection_from_session

2013-06-14 10:27:51,652: root - INFO - **kw)

2013-06-14 10:27:51,652: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 798, in connection

2013-06-14 10:27:51,653: root - INFO - close_with_result=close_with_result)

2013-06-14 10:27:51,653: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 802, in _connection_for_bind

2013-06-14 10:27:51,653: root - INFO - return self.transaction._connection_for_bind(engine)

2013-06-14 10:27:51,653: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 298, in _connection_for_bind

2013-06-14 10:27:51,653: root - INFO - conn = bind.contextual_connect()

2013-06-14 10:27:51,653: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect

2013-06-14 10:27:51,653: root - INFO - self.pool.connect(),

2013-06-14 10:27:51,653: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 272, in connect

2013-06-14 10:27:51,653: root - INFO - return _ConnectionFairy(self).checkout()

2013-06-14 10:27:51,653: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 425, in __init__

2013-06-14 10:27:51,653: root - INFO - rec = self._connection_record = pool._do_get()

2013-06-14 10:27:51,654: root - INFO - File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 769, in _do_get

2013-06-14 10:27:51,654: root - INFO - (self.size(), self.overflow(), self._timeout))

2013-06-14 10:27:51,654: root - INFO - TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30

2013-06-14 10:27:51,747: root - INFO - Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/wsgi.py", line 383, in handle_one_response
    result = self.application(self.environ, start_response)
  File "/usr/lib/python2.7/dist-packages/hp/middleware/cs_auth_token.py", line 241, in __call__
    return super(CsAuthProtocol, self).__call__(env, start_response)
  File "/usr/lib/python2.7/dist-packages/keystone/middleware/auth_token.py", line 184, in __call__
    return self.app(env, start_response)
  File "/usr/local/lib/python2.7/dist-packages/pecan/middleware/recursive.py", line 56, in __call__
    return self.application(environ, start_response)
  File "/usr/local/lib/python2.7/dist-packages/pecan/middleware/errordocument.py", line 75, in __call__
    app_iter = self.app(environ, replacement_start_response)
  File "/usr/local/lib/python2.7/dist-packages/pecan/core.py", line 559, in __call__
    self.handle_request(state.request, state.response)
  File "/usr/local/lib/python2.7/dist-packages/pecan/core.py", line 495, in handle_request
    result = controller(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/libra/api/controllers/load_balancers.py", line 86, in get
    filter(LoadBalancer.id == load_balancer_id).\
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2145, in first
    ret = list(self[0:1])
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2012, in __getitem__
    return list(res)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2216, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2229, in _execute_and_instances
    close_with_result=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2220, in _connection_from_session
    **kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 798, in connection
    close_with_result=close_with_result)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 802, in _connection_for_bind
    return self.transaction._connection_for_bind(engine)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 298, in _connection_for_bind
    conn = bind.contextual_connect()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1661, in contextual_connect
    self.pool.connect(),
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 272, in connect
    return _ConnectionFairy(self).checkout()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 425, in __init__
    rec = self._connection_record = pool._do_get()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 769, in _do_get
    (self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30

Revision history for this message
Patrick Crews (patrick-crews) wrote :
Revision history for this message
Patrick Crews (patrick-crews) wrote :

Also - gearman timeout is being hit / we should alter that as well.

Revision history for this message
Patrick Crews (patrick-crews) wrote :

api.log:8257:2013-06-14 00:52:11,985: libra.api.library.gearman_client - WARNING - Gearman timeout talking to <haproxy_node>

Revision history for this message
Patrick Crews (patrick-crews) wrote :

Currently testing with this patch:
diff --git a/libra/api/library/gearman_client.py b/libra/api/library/gearman_client.py
index c3f5df2..2a0de23 100644
--- a/libra/api/library/gearman_client.py
+++ b/libra/api/library/gearman_client.py
@@ -199,7 +199,7 @@ class GearmanClientThread(object):
     def _send_message(self, message):
         job_status = self.gearman_client.submit_job(
             self.host, message, background=False, wait_until_complete=True,
- max_retries=3, poll_timeout=30.0
+ max_retries=10, poll_timeout=120.0
         )
         if job_status.state == 'UNKNOWN':
             # Gearman server connection failed

Revision history for this message
Andrew Hutchings (linuxjedi) wrote :

The timeout was due to threads not freeing up the SQLAlchemy pool when finished. They stacked up and depleted the pool causing a timeout whilst queries waited for pool allocation.

This has been fixed by:

1. All threads and API commands closing their pool session when done

2. SQLAlchemy pool increased to 20 with an emergency pool of 10.

In stage it has been running for 8 hours maintaining the pool at 5 connections.

This has also been applied to the Admin API which shouldn't be hit by the same problems (but helps cleanup if an error occurs)

Changed in libra:
assignee: nobody → Andrew Hutchings (linuxjedi)
importance: Undecided → Critical
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.