OpenStack Percona Cluster Charm

Failed to start mysql (max retries reached)

Bug #1745432 reported by Jason Hobbs on 2018-01-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Percona Cluster Charm	Fix Released	High	David Ames	OpenStack Percona Cluster Charm 18.02

Bug Description

An HA deployment of charm-percona-cluster failed because a unit entered 'error' state:

http://paste.ubuntu.com/26459277/

from /var/log/mysql/error.log:

http://paste.ubuntu.com/26459298/

The unit later made it to Ready state, but any unit going to 'error' state at any time is fatal for us.

crashdump is attached - the error was on mysql/1.

This is with commit dc19ecb from https://github.com/openstack/charm-percona-cluster

See original description

Tags:

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-01-25:

juju-crashdump-c0c4dcab-72b3-4fc2-aca5-27a3d434c6f0.tar.gz Edit (14.8 MiB, application/x-tar)

description:

updated

Revision history for this message

David Ames (thedac) wrote on 2018-01-25:

This bug is a collision when two or more units are trying to join/restart at the same time. From the logs you can see mysq/1 (the failing unit) and mysql/2 are both connecting near 15:48:0x in [0].

We have the distributed_wait function [1] that rabbitmq-server uses. This should address this issue.

[0] https://pastebin.ubuntu.com/26459496/
[1] https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/hahelpers/cluster.py#L370

Changed in charm-percona-cluster:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → David Ames (thedac)
milestone:	none → 18.02

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-25: Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/538062

Changed in charm-percona-cluster:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-01-26: Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/538062
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=bd5474ce2f7c9eb659d4b1ead8cfcc26992e986a
Submitter: Zuul
Branch: master

commit bd5474ce2f7c9eb659d4b1ead8cfcc26992e986a
Author: David Ames <email address hidden>
Date: Thu Jan 25 15:20:29 2018 -0800

Avoid simultaneous restarts

    It is possible for two or more percona-cluster units to simultaneously
    attempt to restart and join the cluster. When this race condition
    occurs one unit may error with:
    "Failed to start mysql (max retries reached)"

We already have the control mechanism distributed_wait used in other
charms. This change implements this mechanism for percona-cluster.

    Configuration options allow for fine tuning. The balance is time vs
    tolerance for collision errors. CI systems may tolerate the occasion
    false positive for time saved. Where production deployments can
    sacrifice a bit of time for a guaranteed deploy.

Change-Id: I52e7f8e410ecd77a7a142d44b43414e33eff3a6e
Closes-Bug: #1745432

Changed in charm-percona-cluster:
status:	In Progress → Fix Committed

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-01-29:

juju-crashdump-c7e971f0-89a4-4b78-91a8-6ace13aeebdf.tar.gz Edit (14.7 MiB, application/x-tar)

We hit this again on bd5474ce. crashdump is attached, it was on mysql/0.

Changed in charm-percona-cluster:
status:	Fix Committed → New

Revision history for this message

Liam Young (gnuoy) wrote on 2018-02-21:

This does seem to be the same bug. I've confirmed the deployed charm has the bd5474ce but looking at the logs mysql/0 and mysql/1 have interlaced restarts:

2018-01-27 17:21:23 mysql/1 Percona shutsdown
2018-01-27 17:21:25 mysql/0 Percona shutsdown

2018-01-27 17:21:26 mysql/1 Percona starts
2018-01-27 17:21:28 mysql/0 Percona starts

2018-01-27 17:21:31 mysql/1 Percona shutsdown
2018-01-27 17:21:33 mysql/0 Percona shutsdown

2018-01-27 17:21:34 mysql/1 Percona starts
2018-01-27 17:21:37 mysql/0 Percona starts

2018-01-27 17:21:37 mysql/0 Percona Generates a new UUID: 87dc7bcb-0386-11e8-b483-00163e4c67e0.
2018-01-27 17:21:39 mysql/1 Percona shutsdown

2018-01-27 17:21:42 mysql/1 Percona starts
2018-01-27 17:21:47 mysql/1 Percona shutsdown

2018-01-27 17:21:50 mysql/1 Percona starts
2018-01-27 17:21:50 mysql/1 Percona Generates a new UUID: 8fb95848-0386-11e8-b483-00163ef4ad83.

2018-01-27 17:21:54 mysql/0 Percona shutsdown
2018-01-27 17:21:59 mysql/0 Percona panics

Revision history for this message

Liam Young (gnuoy) wrote on 2018-02-21:

The failure to service until 17:26 but in the previous five minutes mysql/0 and mysql/1 repeatedly come up and abort due to timeouts. I think systemd is restarting percona in a very close time interval. First and foremost we need to ensure that units are not trying to bootstrap at the same time but having inspected the logs other pertinent questions are:

a) Are the timeouts when percona is talking to itself genuine ?
b) Does it matter that percona tries to cluster with rabbit ?
c) Given mysql/2 is the leader and is started ahead of mysql/0 and mysql/1 why do units 0 and 1 not try and cluster with it immediately.

Revision history for this message

Liam Young (gnuoy) wrote on 2018-02-21:

Was the percona charm option 'min-cluster-size' set for this deployment ? I don't seem to be able to find deployment configuration options in the crash dump.

Revision history for this message

Ashley Lai (alai) wrote on 2018-02-21:

bundle.yaml Edit (27.0 KiB, text/plain)

Please see the bundle attached. We don't run any 'juju config' on this deployment. There is also an overlay on this deployment, will attach it next.

Revision history for this message

Ashley Lai (alai) wrote on 2018-02-21:

#10

overlay-ssl.yaml Edit (1.1 KiB, text/plain)

Revision history for this message

Ashley Lai (alai) wrote on 2018-02-21:

#11

overlay-hostnames.yaml Edit (2.0 KiB, text/plain)

Revision history for this message

Liam Young (gnuoy) wrote on 2018-02-21:

#12

I am most confused by the fact that mysql/0 and mysql/1 try and
bootstrap without the leader (mysql/2).

mysql/0: 192.168.33.154
mysql/1: 192.168.33.163
mysql/2: 192.168.33.148

Checking the logs the leader does not seem to change from mysql/2:
$ grep "promoted to leadership" mysql_*/var/log/juju/unit-mysql-*
mysql_2/var/log/juju/unit-mysql-2.log:2018-01-27 17:06:21 INFO juju.worker.leadership tracker.go:200 mysql/2 promoted to leadership of mysql

mysql_1/var/log/juju/unit-mysql-1.log:
2018-01-27 17:22:02 DEBUG juju-log Cluster is bootstrapped - configuring mysql on this node
2018-01-27 17:22:02 DEBUG juju-log Leader is bootstrapped uuid: 215ad116-0385-11e8-a972-2f8ebbc25a6b
2018-01-27 17:22:04 DEBUG juju-log Writing file /etc/mysql/percona-xtradb-cluster.conf.d/mysqld.cnf root:root 444

mysql_1/var/log/mysql/error.log
2018-01-27 17:23:12 54028 [ERROR] Aborting
2018-01-27 17:23:12 54028 [ERROR] WSREP: wsrep::connect(gcomm://192.168.33.163,192.168.33.154) failed: 7

Revision history for this message

Liam Young (gnuoy) wrote on 2018-02-21:

#13

Ok, I missed that percona charm in the deployment was missing fac45afc so it was missing the code that forces the leader ip into the cluster host list.

The bug present in juju-crashdump-c7e971f0-89a4-4b78-91a8-6ace13aeebdf.tar.gz was fixed by fac45afc.

Changed in charm-percona-cluster:
status:	New → Fix Committed

Ryan Beisner (1chb1n) on 2018-03-09

Changed in charm-percona-cluster:
status:	Fix Committed → Fix Released

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-07-24:

#14

juju-crashdump-92c2b634-54b5-4eb4-adc4-fa1939809b51.tar.gz Edit (80.8 MiB, application/x-tar)

I think this defect has regressed.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.