neutron-api lost connections to MySQL server

Bug #1806784 reported by John George on 2018-12-04
40
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack AODH Charm
Medium
Unassigned
OpenStack Designate Charm
Medium
Unassigned
OpenStack neutron-api charm
Medium
Unassigned

Bug Description

Foundation cloud builds are seeing many lost connections to the MySQL server from neutron-api.

The bundle and crashdump are attached.

2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines [req-ea783de8-dbd5-4b41-972e-8d76c0a135fb - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines context)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 508, in do_execute
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 165, in execute
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 321, in _query
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 860, in query
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1061, in _read_query_result
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines result.read()
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1349, in read
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet()
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 991, in _read_packet
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1037, in _read_bytes
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

John George (jog) wrote :
description: updated
James Page (james-page) wrote :

I just de-duped three other bugs related to this against this bug report; it would appear that we have a mismatch between the 'idle_timeout' set in openstack servers vs the idle timeout in mysql which results in this type of error message in the openstack service logs.

idle_timeout has been superceeded with connection_recycle_timeout so we need at a minimum to update templates to use this new configuration option name, and we should set sane defaults for both openstack services and mysql (in the form of percona-cluster).

Changed in charm-neutron-api:
status: New → Triaged
importance: Undecided → High
Changed in charm-designate:
status: New → Triaged
Changed in charm-aodh:
status: New → Triaged
Changed in charm-designate:
importance: Undecided → High
Changed in charm-aodh:
importance: Undecided → High
importance: High → Medium
Changed in charm-designate:
importance: High → Medium
Changed in charm-neutron-api:
importance: High → Medium
James Page (james-page) wrote :

Note that this is not service impacting - the query being run is the connection checker query used to validate a connection before its released from the connection pool; it will be discarded and a new connection created.

James Page (james-page) wrote :

Configuration option switched from idle_timeout to connection_recycle_timeout at queens.

James Page (james-page) wrote :

(it defaults to 3600 which is fairly sane).

James Page (james-page) wrote :

The default in percona-cluster should be 28800 for termination of idle connections.

Dmitrii Shcherbakov (dmitriis) wrote :

sqlalchemy docs for reference on "SELECT 1":

http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-pessimistic
"Dealing with Disconnects
The connection pool has the ability to refresh individual connections as well as its entire set of connections, setting the previously pooled connections as “invalid”. A common use case is allow the connection pool to gracefully recover when the database server has been restarted, and all previously established connections are no longer functional. There are two approaches to this.

Disconnect Handling - Pessimistic
The pessimistic approach refers to emitting a test statement on the SQL connection at the start of each connection pool checkout, to test that the database connection is still viable. Typically, this is a simple statement like “SELECT 1”, but may also make use of some DBAPI-specific method to test the connection for liveness."

This bug affects a project where all timeouts are commented out in designate units:

# grep timeout designate.conf
#tcp_recv_timeout = 0.5
#poll_timeout = 30
#idle_timeout = 3600
#timeout = 30
#idle_timeout = 3600

And mysql is configured with 180 seconds timeout (apparently FCE templates are using this 180 value):

From /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf:

# Seconds before clearing idle connections
wait_timeout = 180

$ juju config mysql wait-timeout
180

Due to this discrepancy I see the following message happening in /var/log/designate/designate-mdns.log:

2019-02-25 15:26:43.058 141152 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']

Increasing mysql wait-timeout to 3600 doesn't seem really a good idea (maybe will add memory impact to mysql due to long lasting connections), and there isn't a charm option in designate charm to change idle_timeout to 3600.

James Page (james-page) wrote :

Appears that the impacted deployments have the wait-timeout option configured to 180 seconds which means that there is an inconsistency between what oslo is expecting (3600 seconds) vs what PXC is doing (180 seconds) which would explain why this error is seen.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers