convergence: QueuePool limit of size 5 overflow 10 reached

Bug #1491185 reported by Angus Salkeld
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Medium
Anant Patil

Bug Description

If you have a really large template with a lot of small resources that can quickly made the DB gets hit very hard.

see: http://lists.openstack.org/pipermail/openstack-dev/2015-September/073401.html

u'TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
timed out, timeout 30\n'

I think limiting the number of resource checks to a number based on the CPU count might be a good idea.
This should also prevent one user from "swamping" the service with jobs.

Angus Salkeld (asalkeld)
Changed in heat:
importance: Undecided → Medium
status: New → Triaged
Angus Salkeld (asalkeld)
tags: added: convergence-bugs
Rico Lin (rico-lin)
Changed in heat:
assignee: nobody → Rico Lin (rico-lin)
Revision history for this message
Angus Salkeld (asalkeld) wrote :

I think this bug should focus on finding the root cause of the excessive sessions.

For background: http://lists.openstack.org/pipermail/openstack-dev/2015-September/073587.html
Also see: http://stackoverflow.com/questions/3360951/sql-alchemy-connection-time-out

We are probably leaking sessions, we need to debug that first before jumping to conclusions.

summary: - convergence: limit the number of concurrent resource actions
+ convergence: QueuePool limit of size 5 overflow 10 reached
Revision history for this message
Angus Salkeld (asalkeld) wrote :

Note: http://docs.sqlalchemy.org/en/latest/core/pooling.html#using-connection-pools-with-multiprocessing

since we have 5 connections and 8 engines, if they are shared, this could be a problem.

"
It’s critical that when using a connection pool, and by extension when using an Engine created via create_engine(), that the pooled connections are not shared to a forked process.
"

our workers fork, see:
https://github.com/openstack/oslo.service/blob/master/oslo_service/service.py#L424

I haven't looked into oslo.db to see if it handles this.

Revision history for this message
Angus Salkeld (asalkeld) wrote :

If I run heat-engine under pdb and put a break here:

(Pdb) b heat/db/sqlalchemy/api.py:49
Breakpoint 1 at /home/angus/work/heat/heat/db/sqlalchemy/api.py:49
(Pdb) b /usr/local/lib/python2.7/dist-packages/oslo_service/service.py:422
Breakpoint 2 at /usr/local/lib/python2.7/dist-packages/oslo_service/service.py:422

(to see if we run "fork" before "get_engine")

> /usr/local/lib/python2.7/dist-packages/oslo_service/service.py(422)_start_child()
-> pid = os.fork()
(Pdb) c
INFO oslo_service.service [] Started child 5498
> /usr/local/lib/python2.7/dist-packages/oslo_service/service.py(422)_start_child()
-> pid = os.fork()
(Pdb) WARNING oslo_config.cfg [] Option "qpid_hostname" from group "DEFAULT" is deprecated. Use option "qpid_hostname" from group "oslo_messaging_qpid".
INFO heat.engine.worker [] Starting engine_worker (1.2) in engine f4366532-eca5-42d9-a108-3474fe9164a7.
> /home/angus/work/heat/heat/db/sqlalchemy/api.py(49)get_facade()
-> if not _facade:

So we ARE forking before setting up the DB, so this is unlikely to be the problem.

Rico Lin (rico-lin)
Changed in heat:
assignee: Rico Lin (rico-lin) → nobody
Rico Lin (rico-lin)
Changed in heat:
assignee: nobody → Rico Lin (rico-lin)
Revision history for this message
Zane Bitter (zaneb) wrote :

There was a reference loop in the client plugin code that was probably keeping the DB session alive (the context would not get removed until the garbage collector ran). That would probably have looked a bit like a session leak. It would be interesting to retest this with the fix for bug 1508134 and see if this still happens.

Revision history for this message
Anant Patil (ananta) wrote :

This issue is still seen when I test with a large template. The error is easily reproducible when I have a resource group with more than 200 resources (I could reproduce with RandomString and TestResource).

When all the 200 resources are triggered, there are 200 workers spawned to work on them simultaneously, and all of them try to open a DB session and then they fail after hitting the limit.

I am thinking of two possible solutions:
1. Cache the sessions in the worker for a stack request. All the workers use the same session for serving request for a stack. Caching has its own problems like when to clear them etc.
2. We could implement request throttling for converge resource requests. Basically, the WorkerService will en-queue the resource request and give it to the pool of thread (eventlet) if there is free thread. This will also help us scale very well for bigger stacks. The pool size should be configurable and it would depend on host configuration/environment.

Revision history for this message
Anant Patil (ananta) wrote :

I tried with the following configuring and things worked. Tested on my machine having i5 processor 4cpus and 12 GB RAM.

[DEFAULT]
max_resources_per_stack = -1
rpc_response_timeout = 3600

[database]
max_overflow=1000
max_pool_size=32

The max number of greenlet that can spawned by RPC server to serve the request is 1000 by default (from what I could find that the oslo.messaging is using futurist lib's GreenThreadPoolExecutor for "eventlet" type of executor and the default size there is 1000. Ref: https://github.com/openstack/futurist/blob/d9231f56bf237c35d37b304d4ecdf037be1fdb25/futurist/_futures.py#L324 ). Hence, I configured the max_overflow to 1000 and kept the max_pool_size to 32.

For the sake of analysis, I have noted the time taken for creating a stack with a resource group with various sizes and with above mentioned configuration.

 -------------------------------------------------------------------
| Size | Time taken (secs) | max_db_cons |
 -------------------------------------------------------------------
| 200 | 130 | 190 |
 -------------------------------------------------------------------
| 400 | 610 | 260 |
 -------------------------------------------------------------------
| 800 | 2040 | 260 |
 ------------------------------------------------------------------
| 1600 | 4680 | 260 |
 ------------------------------------------------------------------

max_db_cons is taken from mysql's show processlist command. So, there were around 260 connections at a time made by heat engines (4 of them).

I also found that most of the time is spent in updating the sync_point for stack. Every resource in the resource group tries to update the sync_point of its stack and lot of queries are contending for DB connection and we use atomic_key to avoid any synchronization issues, and it almost becomes serialized query. So all the 400 or 800 resources try to update a single row and most of the time and memory is spent there, the actual resources are completed pretty quickly.

Revision history for this message
Rico Lin (rico-lin) wrote :

Cool I'm doing the exactly same test!
Anant
Thanks for the details, get more clue on this.
I think you're right about this.
Will update patch for this issue.

Revision history for this message
Anant Patil (ananta) wrote :

I tried solution (1) from my ealier comment (comment #5) and it doesn't seem to work. Since all the tasks are actually run simultaneously in eventlets, a session cannot be shared across them. When I tried, I saw lot of mysql exceptions, mainly:

DBError: reentrant call inside <_io.BufferedReader name=10>
InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction.
DBConnectionError: (pymysql.err. OperationalError) (2014, 'Command Out of Sync')

It probably was a bad idea as the DB sessions are not designed to work this way.

So we are left with option to cap the number of resources heat processes simultaneously. I will try this and will post my findings.

Revision history for this message
Anant Patil (ananta) wrote :

Digging more into oslo.messaging, I found that the greenlet queue pool size is set to 64 by default: https://github.com/openstack/oslo.messaging/blob/213176657d55a65ecd2dfebc7320651f5442761c/oslo_messaging/_executors/impl_pooledexecutor.py#L28

This overrides the default size of 1000 in the futurist library: https://github.com/openstack/futurist/blob/d9231f56bf237c35d37b304d4ecdf037be1fdb25/futurist/_futures.py#L324

This is the reason why max_db_connections stayed at 260 even when I had given max_overflow as 1000 (see comment #6). 4 workers with 64 greenlets each spawned 256 threads for working on resources and each opened a session to DB; so around 256 DB sessions would be open. By default, max number of DB sessions could be 15 for each worker (5 is pool size and 10 is max overflow), which means 60 DB sessions are allowed for 4 workers, and that is not enough when we actually need at least 256.

Trying with max_overflow set to 64 worked well. So, one of the solutions could be to configure the max_overflow to 64 (or what ever the rpc pool size configured) when initializing the DB connection with sqlalchemy.

Revision history for this message
Anant Patil (ananta) wrote :

While exploring on how we could throttle the request, I found that the GreenThreadPoolExecutor already allows us to do so. There is a configuration option exposed by oslo.messagin using which we can set the number of eventlets the executor can have: https://github.com/openstack/oslo.messaging/blob/213176657d55a65ecd2dfebc7320651f5442761c/oslo_messaging/_executors/impl_pooledexecutor.py#L28

By default it is 64.

Since the sqlalchemy, by default allows to open 15 maximum sessions in parallel, I wanted to limit the size of this GreenThreadPoolExecutor to less than 16.

When I set the executor_thread_pool_size size to 10, I found following results:
Size Time max_db_cons
200 63 40
400 166 44
800 510 45
1600 1584 45

Time is in seconds. Size is resource group size of RandomString resource. max_db_cons is the maximum number of connections opened to DB.

The results are far better than what I found when I set the database's max_pool_size and max_overflow to higher numbers, to match with greenthread pool size. These results are better than that found in comment #6.

When I set the pool size to 8, results were little bit better than when set to 10:
Size Time max_db_cons
200 57 36
400 150 36
800 453 36
1600 1531 37

I think we should set the default executor_thread_pool_size to 10 or something so that we don't see the "QueuePool limit exceeding" error. The convergence worker service can have this option set, by default, if it is not set in the heat.conf.

The downside with setting it to 10 is that the number of resources that can be provisioned in parallel will be less, but I guess we can leave with it as long as we can document that one can increase the executor_thread_pool_size to improve on number of parallel tasks. BTW, by default the heat engine spawns as many processes as there are processors, so having a pool size of 10 is not very bad.

Also, we can automatically set the database's max_overflow to executor_thread_pool_size so that we never hit this issue and user doesn't have to set the database option whenever they increase the thread pool size. It can be automatically taken care.

Revision history for this message
Anant Patil (ananta) wrote :

Rico Lin, could I assign this to myself if its ok?

Revision history for this message
Anant Patil (ananta) wrote :

One more thing worth mentioning is that when we have a big green thread pool size, the memory consumed by process is very high. So, by default we should try to limit it to a small number.

Revision history for this message
Rico Lin (rico-lin) wrote :

Anant: of course yes,You already done a great research on this.
I think you find the best answer to this.

Changed in heat:
assignee: Rico Lin (rico-lin) → Anant Patil (ananta)
Revision history for this message
Rico Lin (rico-lin) wrote :

I also thinking about reducing the conflict chances by introduce a waiting time at sync function to allow the while condition sleep for a short time, which can also avoid high throughput into database or queue.
I have give it a test and seems solved the problem.
I think we can work this issue together. Although your solution already capable to solve this issue perfectly.

Revision history for this message
Anant Patil (ananta) wrote :

Rico Lin, sure. We should probably choose a random time to sleep before updating the database, but I am not sure if that will bring any improvement. Need to test the solutions before we decide to do so. Effectively avoiding contention will definitely help things run faster.

Another thing I was thinking was to have the input_data of stack's sync_point as a counter instead of a JSON based input data. Increasing a counter may be faster that updating a JSON, especially when the resource count is very high. Like in case of a resource group with size 1000. the JSON of input model is pretty big.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/261234

Changed in heat:
status: Triaged → In Progress
Revision history for this message
Anant Patil (ananta) wrote :

After some thoughts, I am planning to submit a patch where we set the max_overflow option of database to a appropriate value. The executor_thread_pool_size should be the value of max_overflow as the worker can spawn those many eventlets in parallel under load.

Reducing the thread pool size instead would compromise on number of resources/stacks that can be processed simultaneously. It is not uncommon practice to choose a higher value for max_overflow, especially when systems are designed for horizontally scaling.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Anant Patil (<email address hidden>) on branch: master
Review: https://review.openstack.org/261234
Reason: Fix should set the database max_overflow config instead of reducing the executor pool size. Reduced pool size limits number of workers that can work on resources.

Setting the DB configuration to match the number of workers is better option which will help heat scale.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/262374

Revision history for this message
Rico Lin (rico-lin) wrote :

Anant: I have done some survey on change the input data format which you suggested above, and I think it's a great idea, since we kind of only using it as a counter and lock in sync point. Is it ok if I pick it up?

Changed in heat:
milestone: none → mitaka-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/262374
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=d7c2de60cb9ed0d3ddb077d603b7028f8d91d238
Submitter: Jenkins
Branch: master

commit d7c2de60cb9ed0d3ddb077d603b7028f8d91d238
Author: Anant Patil <email address hidden>
Date: Wed Dec 30 09:51:39 2015 +0530

    Set DB max_overflow to match thread pool size

    Set max_overflow config option of database to match the number of
    workers that can be spawned simultaneously. The maximum number of RPC
    workers that can be spawned is determined by executor_thread_pool_size
    config option.

    This ensures there won't be lack of DB sessions when many worker
    threads/eventlets are spawned when the engine is under load.

    The DB connections that are part of overflow bucket are closed
    automatically when they are not in use, so heat doesn't run out of
    'lsof' limit.

    Closes-Bug: #1491185
    Change-Id: I4fcfe50b53b4d476f7483bb860667c42d0c54a19

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/heat 6.0.0.0b3

This issue was fixed in the openstack/heat 6.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.