Too many reconnections to the SQLalchemy engine

Bug #1549516 reported by Dina Belova
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Invalid
Undecided
Unassigned
oslo.db
Invalid
Medium
Unassigned

Bug Description

=== Issue Description ===

It looks like for every DB request oslo.db is reconnecting to the SQLalchemy engine, that leads to "SELECT 1" request to the database per every meaningful request.

=== Prelude(<DinaBelova>) ===

I was testing osprofiler library (OpenStack profiler) changes, that are currently on review for Nova, Neutron and Keystone + OSprofiler integration, trying to perform nova-boot requests. After generating the trace for this request, I got the following html report: https://dinabelova.github.io/nova-boot-keystone-cache-turn-on.html . Total number of DB operations done for this request is 417, that seems too much for the instance creation. Half of these requests is "SELECT 1" requests, that are used by oslo.db per engine connection via _connect_ping_listener function - https://github.com/openstack/oslo.db/blob/master/oslo_db/sqlalchemy/engines.py#L53

I ensured that all of these requests are coming from this method via adding _connect_ping_listener tracing https://dinabelova.github.io/nova-boot-oslodb-ping-listener-profiled.html - so we can see that all "SELECT 1" requests are placed under db_ping_listener section in the trace.

These "SELECT 1"s are in fact spending 1/3 of all time SQLalchemy engine in oslo.db is spending on all requests. This seems to be a bug.

=== Env description & spets to reproduce ===

I have devstack environment with latest 1.1.0 osprofiler installed. To install profiler on the devstack env I used the following additions to the local.conf:

enable_plugin ceilometer https://git.openstack.org/openstack/ceilometer master
enable_plugin osprofiler https://git.openstack.org/openstack/osprofiler master

Additionally I've used the following changes:
- Nova: https://review.openstack.org/254703
- Nova client: https://review.openstack.org/#/c/254699/
- Neutron: https://review.openstack.org/273951
- Neutron client: https://review.openstack.org/281508
- Keystone: https://review.openstack.org/103368

Also I've modified standard keystone.conf to turn memcache caching:

[cache]

memcache_servers = 127.0.0.1:11211
backend = oslo_cache.memcache_pool
enabled = True

Then you can simply run nova --profile SECRET_KEY boot --image <image_id> --flavor 42 vm1 to generate all notifications and then osprofiler trace show --html <trace_id> --out nova-boot.html using the trace id printed in the bottom of nova boot output.

Revision history for this message
Mike Bayer (zzzeek) wrote :

"SELECT 1" is emitted on every pool checkout and does not correspond to an actual new database connection. are you just seeing the "SELECT 1" or are you also seeing many new TCP connections?

Revision history for this message
Mike Bayer (zzzeek) wrote :

both nova and keystone are moving towards the modernized "enginefacade" which should reduce the number of individual connection pool uses. right now everytime nova does get_session(), we get another one of these connects. with modern enginefacade the connects would all be shared. Not sure if the nova being used here is the one with that new API yet.

longer term, it is possible to just do away with the "SELECT 1" - Openstack uses this "pessimistic" approach across the board but I think it largely is to mitigate lots of "cant connect" issues that are due to mis-configured HAProxy setups (for a long time people had HAProxy setups with a 90s timeout, should be 90m for pooled connections) as well as the excessive greenlet issue described at http://lists.openstack.org/pipermail/openstack-dev/2015-December/082717.html.

However without the pessimistic approach, it means if a single MySQL db is restarted, all openstack services would report an error when they are hit with some connections in the pool, which would then resolve on the next access. unless all methods everywhere were wrapped in a retry loop.

Revision history for this message
Dina Belova (dbelova) wrote :

Some results of the IRC discussion:

- situation should become better for keystone if rerun the test with https://review.openstack.org/#/c/257458/ change applied (merged ~4 hours ago)
- for Nova situation should be better due to engine facade refactoring done ~2 weeks ago - need to recheck if my env included this commit

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

for keystone :)

Changed in keystone:
status: New → Invalid
Revision history for this message
Mike Bayer (zzzeek) wrote :

there is another potential way to mitigate this more short term, which is to do the "SELECT 1" ping only some of the time. E.g. if a pool checkout occurs and we see that we've been "idle" for some time, we do a SELECT 1, then for subsequent checkouts within a short time period from that one, we don't do it again. This would cover the case of the MySQL database being restarted and allowing the app to recover without error once the DB is running again.

Changed in oslo.db:
status: New → Confirmed
importance: Undecided → Medium
D G Lee (lyqgenius)
Changed in oslo.db:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.