Ceilometer

QueuePool ERROR in ceilometer-collector log after successful tempest run

Bug #1243251 reported by David Kranz on 2013-10-22

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	Ceilometer	Fix Released	High	Mehdi Abaakouk	Ceilometer 2014.1 "icehouse"

Bug Description

From http://logs.openstack.org/29/43829/13/check/gate-tempest-devstack-vm-neutron-large-ops/49983d5

2013-10-21 21:03:14.615 5337 ERROR ceilometer.collector.dispatcher.database [-] Failed to record metering data: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database Traceback (most recent call last):
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/opt/stack/new/ceilometer/ceilometer/collector/dispatcher/database.py", line 65, in record_metering_data
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database self.storage_conn.record_metering_data(meter)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/opt/stack/new/ceilometer/ceilometer/storage/impl_sqlalchemy.py", line 171, in record_metering_data
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database source = session.query(Source).get(data['source'])
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 775, in get
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database return self._load_on_ident(key)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2514, in _load_on_ident
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database return q.one()
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2184, in one
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database ret = list(self)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2227, in __iter__
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database return self._execute_and_instances(context)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2240, in _execute_and_instances
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database close_with_result=True)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2231, in _connection_from_session
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database **kw)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 784, in connection
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database close_with_result=close_with_result)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 788, in _connection_for_bind
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database return self.transaction._connection_for_bind(engine)
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 313, in _connection_for_bind
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database conn = bind.contextual_connect()
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2489, in contextual_connect
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database self.pool.connect(),
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 236, in connect
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database return _ConnectionFairy(self).checkout()
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 401, in __init__
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database rec = self._connection_record = pool._do_get()
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 738, in _do_get
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database (self.size(), self.overflow(), self._timeout))
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
2013-10-21 21:03:14.615 5337 TRACE ceilometer.collector.dispatcher.database

Tags:

Mehdi Abaakouk (sileht) on 2013-10-22

Changed in ceilometer:
assignee:	nobody → Mehdi Abaakouk (sileht)
importance:	Undecided → High

Revision history for this message

gordon chung (chungg) wrote on 2013-10-22:

this looks like a scalability issue. the logs look like the collector is flooded with meters: http://logs.openstack.org/68/52768/1/check/check-tempest-devstack-vm-full/76f83a4/logs/screen-ceilometer-collector.txt.gz

we use oslo code to handle session/flushing... might be a place to look at (maybe there is a config option we need to set to handle such loads.)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-10-23: Fix proposed to ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/53396

Changed in ceilometer:
status:	New → In Progress

Revision history for this message

gordon chung (chungg) wrote on 2013-10-23:

i put some logging to capture sessions in my environment: http://paste.openstack.org/show/49269/

from the looks of it, ceilometer is releasing sessions after meters are recorded, its just that Tempest tests are opening sessions quicker than previous sessions are released so eventually we're exhausting the default pool size. we could rework the sql model so we can write quicker or add sharding support... but is there a way to increase pool size for devstack setup? seems more a configuration problem rather than Ceilometer bug right now (although we do need to improve performance to handle high volume environments)

Julien Danjou (jdanjou) on 2013-10-25

tags:

added: havana-backport-potential

Julien Danjou (jdanjou) on 2013-11-20

Changed in ceilometer:
importance:	High → Critical

Thierry Carrez (ttx) on 2013-11-26

Changed in ceilometer:
milestone:	none → icehouse-1

Revision history for this message

Julien Danjou (jdanjou) wrote on 2013-11-26:

https://review.openstack.org/#/c/56173/ was proposed instead in devstack-gate, but got refused, as it's clearly a problem we want to fix in Ceilometer if possible.

For now, it would be good to have this patch put into devstack instead as requested by the devstack-gate maintainers.

Changed in ceilometer:
status:	In Progress → Confirmed

Thierry Carrez (ttx) on 2013-12-04

Changed in ceilometer:
milestone:	icehouse-1 → icehouse-2

Julien Danjou (jdanjou) on 2013-12-04

Changed in ceilometer:
importance:	Critical → High

Revision history for this message

Mehdi Abaakouk (sileht) wrote on 2013-12-04:

The issue seems occurs less than before.

My last checks report 3 occurs in 1 weeks: http://bit.ly/IGUCx9

My understanding of the issue:

greenthread1(
   rpc IO: get a sample form rpc wire
   DB IO: record this sample to db
          * acquire a session from the sqlachemy pool
          * do the mysql requests (eventlet can switch to an other greenthread here)
          * release the session
) > perhaps this green_thread will be pause during the DB IO and an other one is started:
   greenthread2(
       rpc IO: get an other sample try to record it
       DB IO: record the second sample to db
   )

when does eventlet switch back to the first one ?
The question is how does eventlet orchestrate all of this IO ? Seems badly sometimes

during a tempest test (with errors) ~ 6 samples/sec are retrieved by the collector but one sample is really recorded in db every 4 seconds, just before the sqlachemy error we can see that the previous db operation have take 1 minute.

The default max sqlachemy pools size is 15 (with 5 persistent and 10 created ondemand if needed).

Conclusion if eventlet schedule the 16 rpc IO before doing the first sql request finish, the sqlachemy session pool is exhausted.

Revision history for this message

Sean Dague (sdague) wrote on 2013-12-20:

@gordon

There is an incorrect meme that tempest is *OMG Load* in the base case, and it really isn't. I could understand tuning defaults lower under tempest because of the environment, but the fact that the defaults can't handle starting 125 guests in a cloud over 30 minutes seems crazy.

So I'd say this remains a critical bug on the ceilometer side. If it can't handle this moderate load, I'm not sure how it would function in a real environment.

So the defaults should increase in the project, not in devstack.

OpenStack Infra (hudson-openstack) on 2013-12-20

Changed in ceilometer:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-12-21: Fix merged to ceilometer (master)

Reviewed: https://review.openstack.org/63424
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=64bc862f6fa936dbe5dd2d402240a81da40bc5a5
Submitter: Jenkins
Branch: master

commit 64bc862f6fa936dbe5dd2d402240a81da40bc5a5
Author: Mehdi Abaakouk <email address hidden>
Date: Fri Dec 20 15:07:39 2013 +0100

Ensure we are not exhausting the sqlalchemy pool

    Because of eventlet, notification messages can be consumed more quickly
    that than they are wroted to the database, this have to effect to exhaust
    the sqlalchemy connection pool and raise:

TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
timed out, timeout 30

    The patch creates a eventlet pool to ensure that the pool never exhausts
    and prints a warning message in the log to inform that perhaps the pool
    size need to be increased.

Fixes bug #1243251

Change-Id: I5ccdc141c441db32fab361783a31a96015516878

Changed in ceilometer:
status:	In Progress → Fix Committed

Revision history for this message

gordon chung (chungg) wrote on 2013-12-23:

Sean, wasn't trying to imply that Tempest were putting an unrealistic load on Ceilometer -- we definitely need to improve performance.

the devstack preference was more so that, given the highly variable environments Ceilometer can be interacting with, i would expect deployers to be configuring pool size values (and the like) at deployment rather than relying on Ceilometer default values.

that said, Mehdi had a clever solution to catch these errors and emit warnings to reconfig Ceilometer. in regards to the log errors in tempest, that should be resolved.

Thierry Carrez (ttx) on 2014-01-22