OpenStack Designate Charm

neutron-api lost connections to MySQL server

Bug #1806784 reported by John George on 2018-12-04

This bug affects 4 people

	Status	Importance	Assigned to
OpenStack AODH Charm	Triaged	Medium	Unassigned
OpenStack Designate Charm	Triaged	Medium	Unassigned
OpenStack Neutron API Charm	Triaged	Medium	Unassigned
OpenStack Nova Cloud Controller Charm	Triaged	Undecided	Gabriel Ramirez
OpenStack Percona Cluster Charm	In Progress	Undecided	Gabriel Ramirez

Bug Description

Foundation cloud builds are seeing many lost connections to the MySQL server from neutron-api.

The bundle and crashdump are attached.

2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines [req-ea783de8-dbd5-4b41-972e-8d76c0a135fb - - - - -] Database connection was found disconnected; reconnecting: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines context)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 508, in do_execute
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines cursor.execute(statement, parameters)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 165, in execute
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines result = self._query(query)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 321, in _query
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines conn.query(q)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 860, in query
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1061, in _read_query_result
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines result.read()
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1349, in read
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines first_packet = self.connection._read_packet()
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 991, in _read_packet
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines packet_header = self._read_bytes(4)
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1037, in _read_bytes
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2018-12-04 00:21:16.595 182093 ERROR oslo_db.sqlalchemy.engines pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

See original description

Tags:

Revision history for this message

John George (jog) wrote on 2018-12-04:

425288fd-dc94-439a-80a9-896c9b955a99_bundle.yaml Edit (30.3 KiB, text/plain)

description:

updated

Revision history for this message

John George (jog) wrote on 2018-12-04:

The juju-crashdump is available at https://oil-jenkins.canonical.com/artifacts/425288fd-dc94-439a-80a9-896c9b955a99/generated/openstack/juju-crashdump-openstack-2018-12-04-00.20.10.tar.gz

Revision history for this message

James Page (james-page) wrote on 2018-12-06:

I just de-duped three other bugs related to this against this bug report; it would appear that we have a mismatch between the 'idle_timeout' set in openstack servers vs the idle timeout in mysql which results in this type of error message in the openstack service logs.

idle_timeout has been superceeded with connection_recycle_timeout so we need at a minimum to update templates to use this new configuration option name, and we should set sane defaults for both openstack services and mysql (in the form of percona-cluster).

Changed in charm-neutron-api:
status:	New → Triaged
importance:	Undecided → High
Changed in charm-designate:
status:	New → Triaged
Changed in charm-aodh:
status:	New → Triaged
Changed in charm-designate:
importance:	Undecided → High
Changed in charm-aodh:
importance:	Undecided → High
importance:	High → Medium
Changed in charm-designate:
importance:	High → Medium
Changed in charm-neutron-api:
importance:	High → Medium

Revision history for this message

James Page (james-page) wrote on 2018-12-06:

Note that this is not service impacting - the query being run is the connection checker query used to validate a connection before its released from the connection pool; it will be discarded and a new connection created.

Revision history for this message

James Page (james-page) wrote on 2018-12-06:

Configuration option switched from idle_timeout to connection_recycle_timeout at queens.

Revision history for this message

James Page (james-page) wrote on 2018-12-06:

(it defaults to 3600 which is fairly sane).

Revision history for this message

James Page (james-page) wrote on 2018-12-06:

The default in percona-cluster should be 28800 for termination of idle connections.

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2018-12-06:

sqlalchemy docs for reference on "SELECT 1":

http://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-pessimistic
"Dealing with Disconnects
The connection pool has the ability to refresh individual connections as well as its entire set of connections, setting the previously pooled connections as “invalid”. A common use case is allow the connection pool to gracefully recover when the database server has been restarted, and all previously established connections are no longer functional. There are two approaches to this.

Disconnect Handling - Pessimistic
The pessimistic approach refers to emitting a test statement on the SQL connection at the start of each connection pool checkout, to test that the database connection is still viable. Typically, this is a simple statement like “SELECT 1”, but may also make use of some DBAPI-specific method to test the connection for liveness."

Revision history for this message

Fabio Augusto Miranda Martins (fabio.martins) wrote on 2019-02-25:

This bug affects a project where all timeouts are commented out in designate units:

# grep timeout designate.conf
#tcp_recv_timeout = 0.5
#poll_timeout = 30
#idle_timeout = 3600
#timeout = 30
#idle_timeout = 3600

And mysql is configured with 180 seconds timeout (apparently FCE templates are using this 180 value):

From /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf:

# Seconds before clearing idle connections
wait_timeout = 180

$ juju config mysql wait-timeout
180

Due to this discrepancy I see the following message happening in /var/log/designate/designate-mdns.log:

2019-02-25 15:26:43.058 141152 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1']

Increasing mysql wait-timeout to 3600 doesn't seem really a good idea (maybe will add memory impact to mysql due to long lasting connections), and there isn't a charm option in designate charm to change idle_timeout to 3600.

Revision history for this message

James Page (james-page) wrote on 2019-03-14:

#10

Appears that the impacted deployments have the wait-timeout option configured to 180 seconds which means that there is an inconsistency between what oslo is expecting (3600 seconds) vs what PXC is doing (180 seconds) which would explain why this error is seen.

Gabriel Ramirez (gabriel1109) on 2020-05-14

Changed in charm-neutron-api:
assignee:	nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-designate:
assignee:	nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-aodh:
assignee:	nobody → Gabriel Ramirez (gabriel1109)

Gabriel Ramirez (gabriel1109) on 2020-05-19

Changed in charm-percona-cluster:
assignee:	nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-nova-cloud-controller:
assignee:	nobody → Gabriel Ramirez (gabriel1109)
Changed in charm-neutron-api:
assignee:	Gabriel Ramirez (gabriel1109) → nobody
Changed in charm-designate:
assignee:	Gabriel Ramirez (gabriel1109) → nobody
Changed in charm-aodh:
assignee:	Gabriel Ramirez (gabriel1109) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-05-19: Fix proposed to charm-percona-cluster (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/729414

Changed in charm-percona-cluster:
status:	New → In Progress

Alex Kavanagh (ajkavanagh) on 2021-02-23

Changed in charm-nova-cloud-controller:
status:	New → Triaged

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-16: Change abandoned on charm-percona-cluster (master)

#12

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-percona-cluster/+/729414
Reason: Abandoning for a couple of reasons: one there has been no action from the OP to the reviewer's feedback, and two, the percona charm is essentially in maintenance mode.