Failed to start mysql (max retries reached)

Bug #1745432 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Percona Cluster Charm
Fix Released
High
David Ames

Bug Description

An HA deployment of charm-percona-cluster failed because a unit entered 'error' state:

http://paste.ubuntu.com/26459277/

from /var/log/mysql/error.log:

http://paste.ubuntu.com/26459298/

The unit later made it to Ready state, but any unit going to 'error' state at any time is fatal for us.

crashdump is attached - the error was on mysql/1.

This is with commit dc19ecb from https://github.com/openstack/charm-percona-cluster

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
description: updated
Revision history for this message
David Ames (thedac) wrote :

This bug is a collision when two or more units are trying to join/restart at the same time. From the logs you can see mysq/1 (the failing unit) and mysql/2 are both connecting near 15:48:0x in [0].

We have the distributed_wait function [1] that rabbitmq-server uses. This should address this issue.

[0] https://pastebin.ubuntu.com/26459496/
[1] https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/hahelpers/cluster.py#L370

Changed in charm-percona-cluster:
status: New → Triaged
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 18.02
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/538062

Changed in charm-percona-cluster:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/538062
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=bd5474ce2f7c9eb659d4b1ead8cfcc26992e986a
Submitter: Zuul
Branch: master

commit bd5474ce2f7c9eb659d4b1ead8cfcc26992e986a
Author: David Ames <email address hidden>
Date: Thu Jan 25 15:20:29 2018 -0800

    Avoid simultaneous restarts

    It is possible for two or more percona-cluster units to simultaneously
    attempt to restart and join the cluster. When this race condition
    occurs one unit may error with:
    "Failed to start mysql (max retries reached)"

    We already have the control mechanism distributed_wait used in other
    charms. This change implements this mechanism for percona-cluster.

    Configuration options allow for fine tuning. The balance is time vs
    tolerance for collision errors. CI systems may tolerate the occasion
    false positive for time saved. Where production deployments can
    sacrifice a bit of time for a guaranteed deploy.

    Change-Id: I52e7f8e410ecd77a7a142d44b43414e33eff3a6e
    Closes-Bug: #1745432

Changed in charm-percona-cluster:
status: In Progress → Fix Committed
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again on bd5474ce. crashdump is attached, it was on mysql/0.

Changed in charm-percona-cluster:
status: Fix Committed → New
Revision history for this message
Liam Young (gnuoy) wrote :

This does seem to be the same bug. I've confirmed the deployed charm has the bd5474ce but looking at the logs mysql/0 and mysql/1 have interlaced restarts:

2018-01-27 17:21:23 mysql/1 Percona shutsdown
2018-01-27 17:21:25 mysql/0 Percona shutsdown

2018-01-27 17:21:26 mysql/1 Percona starts
2018-01-27 17:21:28 mysql/0 Percona starts

2018-01-27 17:21:31 mysql/1 Percona shutsdown
2018-01-27 17:21:33 mysql/0 Percona shutsdown

2018-01-27 17:21:34 mysql/1 Percona starts
2018-01-27 17:21:37 mysql/0 Percona starts

2018-01-27 17:21:37 mysql/0 Percona Generates a new UUID: 87dc7bcb-0386-11e8-b483-00163e4c67e0.
2018-01-27 17:21:39 mysql/1 Percona shutsdown

2018-01-27 17:21:42 mysql/1 Percona starts
2018-01-27 17:21:47 mysql/1 Percona shutsdown

2018-01-27 17:21:50 mysql/1 Percona starts
2018-01-27 17:21:50 mysql/1 Percona Generates a new UUID: 8fb95848-0386-11e8-b483-00163ef4ad83.

2018-01-27 17:21:54 mysql/0 Percona shutsdown
2018-01-27 17:21:59 mysql/0 Percona panics

Revision history for this message
Liam Young (gnuoy) wrote :

The failure to service until 17:26 but in the previous five minutes mysql/0 and mysql/1 repeatedly come up and abort due to timeouts. I think systemd is restarting percona in a very close time interval. First and foremost we need to ensure that units are not trying to bootstrap at the same time but having inspected the logs other pertinent questions are:

a) Are the timeouts when percona is talking to itself genuine ?
b) Does it matter that percona tries to cluster with rabbit ?
c) Given mysql/2 is the leader and is started ahead of mysql/0 and mysql/1 why do units 0 and 1 not try and cluster with it immediately.

Revision history for this message
Liam Young (gnuoy) wrote :

Was the percona charm option 'min-cluster-size' set for this deployment ? I don't seem to be able to find deployment configuration options in the crash dump.

Revision history for this message
Ashley Lai (alai) wrote :

Please see the bundle attached. We don't run any 'juju config' on this deployment. There is also an overlay on this deployment, will attach it next.

Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
Liam Young (gnuoy) wrote :

I am most confused by the fact that mysql/0 and mysql/1 try and
bootstrap without the leader (mysql/2).

mysql/0: 192.168.33.154
mysql/1: 192.168.33.163
mysql/2: 192.168.33.148

Checking the logs the leader does not seem to change from mysql/2:
$ grep "promoted to leadership" mysql_*/var/log/juju/unit-mysql-*
mysql_2/var/log/juju/unit-mysql-2.log:2018-01-27 17:06:21 INFO juju.worker.leadership tracker.go:200 mysql/2 promoted to leadership of mysql

mysql_1/var/log/juju/unit-mysql-1.log:
2018-01-27 17:22:02 DEBUG juju-log Cluster is bootstrapped - configuring mysql on this node
2018-01-27 17:22:02 DEBUG juju-log Leader is bootstrapped uuid: 215ad116-0385-11e8-a972-2f8ebbc25a6b
2018-01-27 17:22:04 DEBUG juju-log Writing file /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf root:root 444

mysql_1/var/log/mysql/error.log
2018-01-27 17:23:12 54028 [ERROR] Aborting
2018-01-27 17:23:12 54028 [ERROR] WSREP: wsrep::connect(gcomm://192.168.33.163,192.168.33.154) failed: 7

Revision history for this message
Liam Young (gnuoy) wrote :

Ok, I missed that percona charm in the deployment was missing fac45afc so it was missing the code that forces the leader ip into the cluster host list.

The bug present in juju-crashdump-c7e971f0-89a4-4b78-91a8-6ace13aeebdf.tar.gz was fixed by fac45afc.

Changed in charm-percona-cluster:
status: New → Fix Committed
Ryan Beisner (1chb1n)
Changed in charm-percona-cluster:
status: Fix Committed → Fix Released
Revision history for this message
Chris Gregan (cgregan) wrote :

I think this defect has regressed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.