cs:xenial/percona-cluster-247 startup race condition: MySQL server has gone away

Bug #1654086 reported by David Britton
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Landscape Server
Fix Released
High
David Britton
OpenStack Percona Cluster Charm
Fix Released
High
David Ames
percona-cluster (Juju Charms Collection)
Invalid
High
David Ames

Bug Description

Found this in the mysql unit logs:

2017-01-04 20:06:13 INFO juju-log shared-db:71: Writing file /var/lib/charm/mysql/mysql-glance.passwd root:root 660
2017-01-04 20:06:13 INFO shared-db-relation-changed Traceback (most recent call last):
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/shared-db-relation-changed", line 726, in <module>
2017-01-04 20:06:13 INFO shared-db-relation-changed main()
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/shared-db-relation-changed", line 719, in main
2017-01-04 20:06:13 INFO shared-db-relation-changed hooks.execute(sys.argv)
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/core/hookenv.py", line 715, in execute
2017-01-04 20:06:13 INFO shared-db-relation-changed self._hooks[hook_name]()
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/shared-db-relation-changed", line 533, in shared_db_changed
2017-01-04 20:06:13 INFO shared-db-relation-changed db_helper)
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/shared-db-relation-changed", line 475, in configure_db_for_hosts
2017-01-04 20:06:13 INFO shared-db-relation-changed password = db_helper.configure_db(host, database, username)
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/contrib/database/mysql.py", line 319, in configure_db
2017-01-04 20:06:13 INFO shared-db-relation-changed self.create_grant(database, username, remote_ip, password)
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/contrib/database/mysql.py", line 118, in create_grant
2017-01-04 20:06:13 INFO shared-db-relation-changed password))
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 226, in execute
2017-01-04 20:06:13 INFO shared-db-relation-changed self.errorhandler(self, exc, value)
2017-01-04 20:06:13 INFO shared-db-relation-changed File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
2017-01-04 20:06:13 INFO shared-db-relation-changed raise errorvalue
2017-01-04 20:06:13 INFO shared-db-relation-changed _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')
2017-01-04 20:06:13 ERROR juju.worker.uniter.operation runhook.go:107 hook "shared-db-relation-changed" failed: exit status 1

Getting on the node wasn't a lot more help... The service can't be restarted because of already existing processes, killing those yields different errors. Feels like something has got in a pretty bad bind.

Now, looks like originally, it hit this error:

SREP_SST: [INFO] Streaming with xbstream (20170104 20:05:55.435)
WSREP_SST: [INFO] Using socat as streamer (20170104 20:05:55.453)
WSREP_SST: [INFO] Using /tmp/tmp.TN5S6deAAd as innobackupex temporary directory (20170104 20:05:56.241)
WSREP_SST: [INFO] Streaming GTID file before SST (20170104 20:05:56.295)
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio TCP:10.5.1.90:4444; RC=( ${PIPESTATUS[@]} ) (20170104 20:05:56.320)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20170104 20:05:56.378)
WSREP_SST: [INFO] Streaming the backup to joiner at 10.5.1.90 4444 (20170104 20:06:06.408)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/mysql/my.cnf --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:10.5.1.90:4444; RC=( ${PIPESTATUS[@]} ) (20170104 20:06:06.410)
20:06:11 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=2
max_threads=25002
thread_count=4
connection_count=2
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 10013238 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x171a450
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f945c031e58 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0x9176db]
/usr/sbin/mysqld(handle_fatal_signal+0x378)[0x663418]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f958ae2e390]
[0x7f92080000a8]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 188
Status: KILL_CONNECTION

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
WSREP_SST: [ERROR] innobackupex finished with error: 9. Check /var/lib/percona-xtradb-cluster//innobackup.backup.log (20170104 20:06:11.702)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20170104 20:06:11.747)
WSREP_SST: [INFO] Cleaning up temporary directories (20170104 20:06:11.757)

After talking with thedac, it seems our max-connections are set too high for xenial (25000). Thedac and beisner suggested 2000 for a default. I suggest the charm should probably do this as well.

Tags: landscape
David Britton (dpb)
Changed in landscape:
milestone: none → 16.12
tags: added: landscape
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Theory: I think it's very specifically a 5.5 to 5.6 thing, not necessarily the init sys or server release. performance_schema flipped default values, and now it grabs the mem in advance [1]

So with 5.5, there was no down side to setting max connections way higher than it needed to be or should be. But with 5.6, you pay for that with memory allocation (or potentially OOM, which causes a weird wild goose chase of errors).

I'd lean toward keeing keeping the charm's low default max conn so that it is usuably for non-OpenStack contexts out of the box. Exposing a config flag to bit flip the performance_schema config might be worth exploring. That, or 5.6 users will have to get nitty gritty about tuning their db settings to the resources available.

[1] http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-6.html

Changed in landscape:
status: New → In Progress
assignee: nobody → David Britton (davidpbritton)
importance: Undecided → High
Changed in landscape:
status: In Progress → Fix Committed
Revision history for this message
David Ames (thedac) wrote :

The issue with percona cluster is the 5.6 version of mysql defaulting to performance_schema=on [1] at server startup. This allocates all the memory that would be required to handle max-connections plus several other memory settings. See the mysql memory calculator [2] to see how this plays out.

With 5.5 this was not the an issue. We occasionally saw connection exhaustion in HA clouds with max-connections near 2000. The common practice became to set max-connections unrealistically high near 10k or 20k. In the move to 5.6 on Xenial this became a problem.

Discussed this with the team and the plan is to make performance_schema [1] configurable and default to off at startup in the percona cluster charm. This will get us 5.5 like behavior and avoid over committing memory based on max-connections.

Note this is somewhat of a duplicate of [3]

[1] http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-6.html#mysqld-5-6-6-performance-schema
[2] http://www.mysqlcalculator.com/
[3] https://bugs.launchpad.net/charms/+source/percona-cluster/+bug/1600001

Changed in percona-cluster (Juju Charms Collection):
status: New → In Progress
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 17.01
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/417131

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/417131
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=e47c4ad13517996963867e6d6a4765a0a8aa29f4
Submitter: Jenkins
Branch: master

commit e47c4ad13517996963867e6d6a4765a0a8aa29f4
Author: David Ames <email address hidden>
Date: Thu Jan 5 10:08:55 2017 -0800

    Default performance schema to off

    The 5.6 version of mysql defauls to performance_schema=on at server
    startup. This allocates all the memory that would be required to
    handle max-connections plus several other memory settings.

    We occasionally saw connection exhaustion in HA clouds with
    max-connections near 2000. The common practice became to set
    max-connections unrealistically high near 10k or 20k. In the move to
    5.6 on Xenial this became a problem as the memory requirements for
    values that high are unrealistic.

    This change adds performance schema as a configuration option and
    defaults to off giving 5.5 like behavior. It also considerably
    updates documentation and highlights the need to carefully consider
    memory related configuration settings for production environments.

    Make the amulet test python2 and python3 capable.

    Change-Id: Id01095ac2f73fa041b25b2602496a4c04a13e3ef
    Partial-Bug: #1654086
    Closes-Bug: #1600001

Changed in landscape:
milestone: 16.12 → 17.01
Chad Smith (chad.smith)
Changed in landscape:
status: Fix Committed → Fix Released
James Page (james-page)
Changed in charm-percona-cluster:
assignee: nobody → David Ames (thedac)
importance: Undecided → High
status: New → In Progress
Changed in percona-cluster (Juju Charms Collection):
status: In Progress → Invalid
Revision history for this message
James Page (james-page) wrote :

The improved default buffer size handling in master branch will also improve this problem; I'm wondering whether we need todo one last change here to set max-connections better when the user does not provide any information.

Changed in charm-percona-cluster:
milestone: none → 17.05
David Ames (thedac)
Changed in charm-percona-cluster:
status: In Progress → Fix Committed
James Page (james-page)
Changed in charm-percona-cluster:
milestone: 17.05 → 17.08
James Page (james-page)
Changed in charm-percona-cluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.