Distributed Cloud: seeing QueuePool limit exception for dcorch audit

Bug #1889292 reported by Gerry Kopec
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Yuxing

Bug Description

Brief Description
-----------------
While running distributed cloud system with lots of subclouds, seeing QueuePool limit exceptions during dcorch audit. When these happen the audit will be unable verify if the particular subcloud resource is synced between the system controller and subcloud until next audit attempt.

Severity
--------
Minor

Steps to Reproduce
------------------
Set up large distributed cloud system. Leave running in steady state.

Expected Behavior
------------------
Observed behaviour of /var/log/dcorch/dcorch.log. Should run clean without exceptions.

Actual Behavior
----------------
See QueuePool error
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30

Reproducibility
---------------
Intermittent - will affect 10 minute audit ~5% of the time. Multiple resources are affected during the same audit when this happens.

System Configuration
--------------------
Distributed Cloud - system controller

Branch/Pull Time/Commit
-----------------------
2020-06-27_18-35-20

Last Pass
---------
none

Timestamp/Logs
--------------
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread [-] Unexpected error while auditing users: TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread Traceback (most recent call last):
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib/python2.7/site-packages/dcorch/engine/sync_thread.py", line 513, in sync_audit
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread abort_resources)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib/python2.7/site-packages/dcorch/engine/sync_thread.py", line 614, in audit_find_missing
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread db_sc_resource = self.get_db_subcloud_resource(m_rsrc_db.id)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib/python2.7/site-packages/dcorch/engine/sync_thread.py", line 226, in get_db_subcloud_resource
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread self.ctxt, rsrc_id, self.subcloud_engine.subcloud.id)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib/python2.7/site-packages/dcorch/objects/subcloud_resource.py", line 70, in get_by_resource_and_subcloud
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread context, res_id, subcloud_id)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib/python2.7/site-packages/dcorch/db/api.py", line 209, in subcloud_resource_get_by_resource_and_subcloud
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread context, resource_id, subcloud_id)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib/python2.7/site-packages/dcorch/db/sqlalchemy/api.py", line 149, in wrapper
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread return f(*args, **kwargs)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib/python2.7/site-packages/dcorch/db/sqlalchemy/api.py", line 641, in subcloud_resource_get_by_resource_and_subcloud
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread return query.one()
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2814, in one
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread ret = self.one_or_none()
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2784, in one_or_none
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread ret = list(self)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2855, in __iter__
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread return self._execute_and_instances(context)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2876, in _execute_and_instances
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread close_with_result=True)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2885, in _get_bind_args
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread **kw
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2867, in _connection_from_session
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread conn = self.session.connection(**kw)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 998, in connection
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread execution_options=execution_options)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1005, in _connection_for_bind
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread conn = engine.contextual_connect(**kw)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread self._wrap_pool_connect(self.pool.connect, None),
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread return fn()
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 387, in connect
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread return _ConnectionFairy._checkout(self)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 766, in _checkout
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread fairy = _ConnectionRecord.checkout(pool)
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 516, in checkout
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread rec = pool._do_get()
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 1131, in _do_get
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread (self.size(), self.overflow(), self._timeout))
2020-07-24 17:40:47.233 1576608 ERROR dcorch.engine.sync_thread TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30

Test Activity
-------------
Distributed Cloud system testing

Workaround
----------
none

Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Revision history for this message
Bart Wensley (bartwensley) wrote :

It looks like we previously increased the DB connection pool limits for dcmanager - looks like we need to do the same thing for dcorch:
https://review.opendev.org/#/c/718490

Changed in starlingx:
assignee: nobody → Yuxing (yuxing)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/745141

Changed in starlingx:
status: New → In Progress
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as stx.5.0 gating.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/745141
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=7be9954a560e93cfac422d2f86265f6aa934aea2
Submitter: Zuul
Branch: master

commit 7be9954a560e93cfac422d2f86265f6aa934aea2
Author: Yuxing Jiang <email address hidden>
Date: Wed Aug 5 12:26:59 2020 -0400

    Database connection exhausts during dcorch audit

    In a distributed cloud system with lots of subclouds, seeing QueuePool
    limit exception during dcorch audit, and the particular subcloud cannot
    synced with the system controller until the next audit attempt. This
    change increases the overflow limit to 500 on the db connection and
    restrict the db connection size to 1 to prevent the idle connection
    remaining in the pool.

    Change-Id: I93eae7b7e8927f4a70d0073bf4619bcf0de15df7
    Closes-Bug: 1889292
    Signed-off-by: Yuxing Jiang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.