neutron-api lost connections to MySQL server

Bug #1806784 reported by John George
40
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack AODH Charm
Triaged
Medium
Unassigned
OpenStack Designate Charm
Triaged
Medium
Unassigned
OpenStack Neutron API Charm
Triaged
Medium
Unassigned
OpenStack Nova Cloud Controller Charm
Triaged
Undecided
Gabriel Ramirez
OpenStack Percona Cluster Charm
In Progress
Undecided
Gabriel Ramirez

Bug Description

Foundation cloud builds are seeing many lost connections to the MySQL server from neutron-api.

The bundle and crashdump are attached.

2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines [req-ea783de8-dbd5-4b41-972e-8d76c0a135fb - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines context)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 508, in do_execute
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 165, in execute
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 321, in _query
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 860, in query
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1061, in _read_query_result
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines result.read()
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1349, in read
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet()
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 991, in _read_packet
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1037, in _read_bytes
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

Revision history for this message
John George (jog) wrote :
description: updated
Revision history for this message
John George (jog) wrote :
Revision history for this message
James Page (james-page) wrote :

I just de-duped three other bugs related to this against this bug report; it would appear that we have a mismatch between the 'idle_timeout' set in openstack servers vs the idle timeout in mysql which results in this type of error message in the openstack service logs.

idle_timeout has been superceeded with connection_recycle_timeout so we need at a minimum to update templates to use this new configuration option name, and we should set sane defaults for both openstack services and mysql (in the form of percona-cluster).

Changed in charm-neutron-api:
status: New → Triaged
importance: Undecided → High
Changed in charm-designate:
status: New → Triaged
Changed in charm-aodh:
status: New → Triaged
Changed in charm-designate:
importance: Undecided → High
Changed in charm-aodh:
importance: Undecided → High
importance: High → Medium
Changed in charm-designate:
importance: High → Medium
Changed in charm-neutron-api:
importance: High → Medium
Revision history for this message
James Page (james-page) wrote :

Note that this is not service impacting - the query being run is the connection checker query used to validate a connection before its released from the connection pool; it will be discarded and a new connection created.

Revision history for this message
James Page (james-page) wrote :

Configuration option switched from idle_timeout to connection_recycle_timeout at queens.

Revision history for this message
James Page (james-page) wrote :

(it defaults to 3600 which is fairly sane).

Revision history for this message
James Page (james-page) wrote :

The default in percona-cluster should be 28800 for termination of idle connections.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

sqlalchemy docs for reference on "SELECT 1":

http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-pessimistic
"Dealing with Disconnects
The connection pool has the ability to refresh individual connections as well as its entire set of connections, setting the previously pooled connections as “invalid”. A common use case is allow the connection pool to gracefully recover when the database server has been restarted, and all previously established connections are no longer functional. There are two approaches to this.

Disconnect Handling - Pessimistic
The pessimistic approach refers to emitting a test statement on the SQL connection at the start of each connection pool checkout, to test that the database connection is still viable. Typically, this is a simple statement like “SELECT 1”, but may also make use of some DBAPI-specific method to test the connection for liveness."

Revision history for this message
Fabio Augusto Miranda Martins (fabio.martins) wrote :

This bug affects a project where all timeouts are commented out in designate units:

# grep timeout designate.conf
#tcp_recv_timeout = 0.5
#poll_timeout = 30
#idle_timeout = 3600
#timeout = 30
#idle_timeout = 3600

And mysql is configured with 180 seconds timeout (apparently FCE templates are using this 180 value):

From /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf:

# Seconds before clearing idle connections
wait_timeout = 180

$ juju config mysql wait-timeout
180

Due to this discrepancy I see the following message happening in /var/log/designate/designate-mdns.log:

2019-02-25 15:26:43.058 141152 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']

Increasing mysql wait-timeout to 3600 doesn't seem really a good idea (maybe will add memory impact to mysql due to long lasting connections), and there isn't a charm option in designate charm to change idle_timeout to 3600.

Revision history for this message
James Page (james-page) wrote :

Appears that the impacted deployments have the wait-timeout option configured to 180 seconds which means that there is an inconsistency between what oslo is expecting (3600 seconds) vs what PXC is doing (180 seconds) which would explain why this error is seen.

Changed in charm-neutron-api:
assignee: nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-designate:
assignee: nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-aodh:
assignee: nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-percona-cluster:
assignee: nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-nova-cloud-controller:
assignee: nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-neutron-api:
assignee: Gabriel Ramirez (gabriel1109) → nobody
Changed in charm-designate:
assignee: Gabriel Ramirez (gabriel1109) → nobody
Changed in charm-aodh:
assignee: Gabriel Ramirez (gabriel1109) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/729414

Changed in charm-percona-cluster:
status: New → In Progress
Changed in charm-nova-cloud-controller:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-percona-cluster (master)

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-percona-cluster/+/729414
Reason: Abandoning for a couple of reasons: one there has been no action from the OP to the reviewer's feedback, and two, the percona charm is essentially in maintenance mode.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.