Juju Charms Collection
percona-cluster package

Percona cluster doesn't establish, crm errors

Bug #1452365 reported by David Britton on 2015-05-06

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	percona-cluster (Juju Charms Collection)	Fix Released	High	Liam Young	Juju Charms Collection 15.07

Bug Description

On node 0 (notice 15:37:00):

2015-05-06 15:37:00 INFO unit.mysql/0.juju-log cmd.go:247 ha:24: Cluster configured, notifying other services
2015-05-06 15:37:18 WARNING unit.mysql/0.juju-log cmd.go:247 shared-db:32: eligible_leader is deprecated. Please consider using is_elected_leader instead.
2015-05-06 15:37:18 DEBUG unit.mysql/0.juju-log cmd.go:247 shared-db:32: No allowed_units - not setting relation settings
2015-05-06 15:37:18 DEBUG unit.mysql/0.juju-log cmd.go:247 shared-db:32: No return data - not setting relation settings
2015-05-06 15:37:19 WARNING unit.mysql/0.juju-log cmd.go:247 shared-db:33: eligible_leader is deprecated. Please consider using is_elected_leader instead.

On node 1 (Notice 15:36:01):

2015-05-06 15:34:25 INFO unit.mysql/1.juju-log cmd.go:247 cluster:0: Making dir /etc/mysql root:root 755
2015-05-06 15:34:25 INFO unit.mysql/1.juju-log cmd.go:247 cluster:0: Writing file /etc/mysql/my.cnf root:root 444
2015-05-06 15:34:25 INFO unit.mysql/1.juju-log cmd.go:247 cluster:0: '/usr/lib/ocf/resource.d/percona/mysql_monitor' already exists, skipping
2015-05-06 15:35:30 WARNING unit.mysql/1.juju-log cmd.go:247 ha:24: is_leader is deprecated. Please consider using is_crm_leader instead.
2015-05-06 15:35:30 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 5 more times (delay=2)
2015-05-06 15:35:32 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 4 more times (delay=4)
2015-05-06 15:35:36 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 3 more times (delay=6)
2015-05-06 15:35:42 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 2 more times (delay=8)
2015-05-06 15:35:50 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 1 more times (delay=10)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 Traceback (most recent call last):
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/ha-relation-changed", line 488, in <module>
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 main()
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/ha-relation-changed", line 482, in main
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 hooks.execute(sys.argv)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 544, in execute
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 self._hooks[hook_name]()
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/ha-relation-changed", line 440, in ha_relation_changed
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 if (clustered and is_leader(LEADER_RES)):
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 127, in is_leader
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 return is_crm_leader(resource)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/decorators.py", line 42, in _retry_on_exception_inner_2
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 return f(*args, **kwargs)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 119, in is_crm_leader
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 raise CRMResourceNotFound("CRM resource %s not found" % (resource))
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 charmhelpers.contrib.hahelpers.cluster.CRMResourceNotFound: CRM resource grp_percona_cluster not found
2015-05-06 15:36:01 ERROR juju.worker.uniter.operation runhook.go:86 hook "ha-relation-changed" failed: exit status 1

Node 2 is similar (15:36:03 failure timestamp)

Tags:

Related branches

lp:~gnuoy/charm-helpers/current-dc-leader

Merged into lp:charm-helpers at revision 374

Adam Collard (community): Needs Fixing on 2015-05-15

James Page: Approve on 2015-05-15

lp:~gnuoy/charms/trusty/percona-cluster/use-dc

Merged into lp:~openstack-charmers-archive/charms/trusty/percona-cluster/next at revision 61

Billy Olsen: Approve on 2015-05-22

lp:~freyes/charm-helpers/current-dc-none

Merged into lp:charm-helpers at revision 385

Liam Young (community): Approve on 2015-06-08

lp:~gnuoy/charms/trusty/percona-cluster/use-dc-stable

Merged into lp:~openstack-charmers-archive/charms/trusty/percona-cluster/trunk at revision 61

James Page: Approve on 2015-06-08

Revision history for this message

Edward Hope-Morley (hopem) wrote on 2015-05-06:

Dave, this is often a result of the corosync resource used to determine leader dithering across units during deploy time (which I assume is the case for you). A couple of points here; firstly, this problem will simple go away once we have juju leadership election supported in all the charms (since we will no longer rely on a corosync resource). Second, can ypu try setting the hacluster cluster_count option to the number of units in your mysql cluster. This will prevent corosync from reaching quorum until the specified number of units have joined the cluster. In practice that has drastically reduce the incidence of this kind of issue for me. Hope that helps.

Revision history for this message

David Britton (dpb) wrote on 2015-05-06: Re: [Bug 1452365] Re: not waiting long enough for cluster lead to establish?

I confirmed this is set in this deployment. i'm guessing it's not enough
to prevent what i'm running into. :(

On Wed, May 6, 2015 at 10:31 AM, Edward Hope-Morley <
<email address hidden>> wrote:

> Dave, this is often a result of the corosync resource used to determine
> leader dithering across units during deploy time (which I assume is the
> case for you). A couple of points here; firstly, this problem will
> simple go away once we have juju leadership election supported in all
> the charms (since we will no longer rely on a corosync resource).
> Second, can ypu try setting the hacluster cluster_count option to the
> number of units in your mysql cluster. This will prevent corosync from
> reaching quorum until the specified number of units have joined the
> cluster. In practice that has drastically reduce the incidence of this
> kind of issue for me. Hope that helps.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1452365
>
> Title:
> not waiting long enough for cluster lead to establish?
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/charms/+source/percona-cluster/+bug/1452365/+subscriptions
>

--
David Britton <email address hidden>

Revision history for this message

David Britton (dpb) wrote on 2015-05-07: Re: not waiting long enough for cluster lead to establish?

juju resolved --retry on the failed units seems to fix it.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2015-05-12:

all-machines.log.gz Edit (731.2 KiB, application/octet-stream)

Here are some logs from the failure I just had.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2015-05-12:

juju-status.out Edit (25.6 KiB, text/plain)

juju status output

Revision history for this message

Adam Collard (adam-collard) wrote on 2015-05-13:

unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 element rsc_colocation: validity error : IDREF attribute with-rsc references an unknown ID "cl_mysql_monitor"
unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 error: main: #011CIB did not pass DTD/schema validation
unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 Errors found during check: config not valid
unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 INFO: commit forced

interesting log lines from Andreas' logs

Edward Hope-Morley (hopem) on 2015-05-13

tags:

added: openstack

Adam Collard (adam-collard) on 2015-05-13

summary:

- not waiting long enough for cluster lead to establish?
+ Percona cluster doesn't establish, crm errors

Adam Collard (adam-collard) on 2015-05-14

Changed in percona-cluster (Juju Charms Collection):
status:	New → Confirmed

Revision history for this message

Liam Young (gnuoy) wrote on 2015-05-14:

Download full text (5.0 KiB)

I think this is cause by r54 and the introduction of the mysql monitor. The monitor initially fails (I think because of a timeout when the vip is switching around). Corosync appears to be waiting for that monitor to complete before declaring that grp_percona_cluster is running. This wait for the check pushes the is_crm_leader retry past it's window.

Below is from a recent deploy which showed this error:

As the stack trace shows the error is coming from is_crm_leader. Which runs

crm resource show grp_percona_cluster

It is run 5 times with a delay to give corosync time to settle. The logs show that
iit does attempt 5 times and then thook errors with the following exception...

From percona/0 juju-log

15-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 Traceback (most recent call last):
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 488, in <module>
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 main()
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 482, in main
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 hooks.execute(sys.argv)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/core/hookenv.py", line 557, in execute
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 self._hooks[hook_name]()
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 440, in ha_relation_changed
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 if (clustered and is_leader(LEADER_RES)):
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 127, in is_leader
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 return is_crm_leader(resource)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/core/decorators.py", line 42, in _retry_on_exception_inner_2
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 return f(*args, **kwargs)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 119, in is_crm_leader
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 raise CRMResourceNotFound("CRM resource %s not found" % (resource))
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 charmhelpers.contrib.hahelpers.cluster.CRMResourceNotFound: CRM resource grp_percona_cluster not found
2015-05-14 10:08:13 ERROR juju.worker.uniter.operation runhook.go:89 hook "ha-relation-changed" failed: exit status 1

crm status and ...

Below is from a recent deploy which showed this error:

As the stack trace shows the error is coming from is_crm_leader. Which runs

crm resource show grp_percona_cluster

It is run 5 times with a delay to give corosync time to settle. The  logs show that
iit does attempt 5 times and then thook errors with the following exception...

From percona/0 juju-log

15-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 Traceback (most recent call last):
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40   File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 488, in <module>
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40     main()
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40   File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 482, in main
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40     hooks.execute(sys.argv)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40   File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/core/hookenv.py", line 557, in execute
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40     self._hooks[hook_name]()
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40   File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 440, in ha_relation_changed
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40     if (clustered and is_leader(LEADER_RES)):
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40   File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 127, in is_leader
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40     return is_crm_leader(resource)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40   File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/core/decorators.py", line 42, in _retry_on_exception_inner_2
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40     return f(*args, **kwargs)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40   File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 119, in is_crm_leader
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40     raise CRMResourceNotFound("CRM resource %s not found" % (resource))
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 charmhelpers.contrib.hahelpers.cluster.CRMResourceNotFound: CRM resource grp_percona_cluster not found
2015-05-14 10:08:13 ERROR juju.worker.uniter.operation runhook.go:89 hook "ha-relation-changed" failed: exit status 1

crm status and crm resource show grp_percona_cluster at the time of above:

Online: [ juju-lytrusty-machine-1 juju-lytrusty-machine-2 juju-lytrusty-machine-3 ]
resource grp_percona_cluster is NOT running
Thu May 14 10:08:12 UTC 2015
Last updated: Thu May 14 10:08:13 2015
Last change: Thu May 14 10:07:41 2015 via crmd on juju-lytrusty-machine-2
Stack: corosync
Current DC: juju-lytrusty-machine-2 (168108143) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured

Online: [ juju-lytrusty-machine-1 juju-lytrusty-machine-2 juju-lytrusty-machine-3 ]
resource grp_percona_cluster is NOT running
Thu May 14 10:08:14 UTC 2015
Last updated: Thu May 14 10:08:14 2015
Last change: Thu May 14 10:07:41 2015 via crmd on juju-lytrusty-machine-2
Stack: corosync
Current DC: juju-lytrusty-machine-2 (168108143) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured

Status flips about 1 minute later:

Online: [ juju-lytrusty-machine-1 juju-lytrusty-machine-2 juju-lytrusty-machine-3 ]

Resource Group: grp_percona_cluster
     res_mysql_vip      (ocf::heartbeat:IPaddr2):       Started juju-lytrusty-machine-1 
 Clone Set: cl_mysql_monitor [res_mysql_monitor]
     Started: [ juju-lytrusty-machine-1 juju-lytrusty-machine-2 ]
     Stopped: [ juju-lytrusty-machine-3 ]

Failed actions:
    res_mysql_monitor_monitor_1000 (node=juju-lytrusty-machine-3, call=-1, rc=1, status=Timed Out, last-rc-change=Thu May 14 10:09:10 2015
, queued=0ms, exec=0ms
): unknown error
resource grp_percona_cluster is running on: juju-lytrusty-machine-1 
Thu May 14 10:09:12 UTC 2015
Last updated: Thu May 14 10:09:12 2015
Last change: Thu May 14 10:07:41 2015 via crmd on juju-lytrusty-machine-3
Stack: corosync
Current DC: juju-lytrusty-machine-2 (168108143) - partition with quorum
Version: 1.1.10-42f2063
3 Nodes configured
4 Resources configured

Revision history for this message

Liam Young (gnuoy) wrote on 2015-05-14:

unsurprisingly

=== modified file 'hooks/charmhelpers/contrib/hahelpers/cluster.py'
--- hooks/charmhelpers/contrib/hahelpers/cluster.py 2015-03-03 02:26:12 +0000
+++ hooks/charmhelpers/contrib/hahelpers/cluster.py 2015-05-14 10:39:30 +0000
@@ -95,7 +95,7 @@
return False

-@retry_on_exception(5, base_delay=2, exc_type=CRMResourceNotFound)
+@retry_on_exception(5, base_delay=8, exc_type=CRMResourceNotFound)
def is_crm_leader(resource, retry=False):
"""
Returns True if the charm calling this is the elected corosync leader,

fixed it for me. I'd like to do some more digging on how the monitor works before I propose it as a fix

Revision history for this message

Felipe Reyes (freyes) wrote on 2015-05-14:

Regarding if the monitor interferes with grp_percona_cluster,
yes, it does, because grp_percona_cluster doesn't have a node to
be allocated until the monitor sets the writable/readable
properties[0], this won't happen until the percona cluster is
funcional according to /usr/bin/clustercheck[1]

Considering this is happening because we need is_leader(), I
suggest to use another approach to decide who the leader is, this
is using the current DC, from the pacemaker's doc[2]:

"The DC (Designated Controller) node is where all the decisions
are made and if the current DC fails a new one is elected from
the remaining cluster nodes. The choice of DC is of no
significance to an administrator beyond the fact that its logs
will generally be more interesting."

The cluster will decide who will be the DC at a very early stage,
because it's needed by pacemaker internals, this will allow us to
reduce (if not remove at all) a big number failures due to
latency in the cluster's configuration.

$ sudo crm status | grep 'Current DC'
Current DC: trusty-pcmkr-2 (1001) - partition with quorum

[0] http://bazaar.launchpad.net/~openstack-charmers/charms/trusty/percona-cluster/next/view/head:/hooks/percona_hooks.py#L422
[1] http://bazaar.launchpad.net/~openstack-charmers/charms/trusty/percona-cluster/next/view/head:/ocf/percona/mysql_monitor#L526
[2] http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-status.html

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2015-05-15:

#10

I'm trying a deployment with percona-cluster synced with this charm-helpers change that got merged.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2015-05-15:

#11

So far, two deploys, two successes.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2015-05-18:

#12

I had many more successes, but that was just with the charm-helpers change/sync, which can't fix this on its own. So they must have been pure luck.

Andreas Hasenack (ahasenack) on 2015-05-22

Changed in percona-cluster (Juju Charms Collection):
status:	Confirmed → In Progress

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2015-05-25:

#13

Can we get a backport into the stable series?

Changed in percona-cluster (Juju Charms Collection):
status:	In Progress → Fix Committed

Felipe Reyes (freyes) on 2015-05-25

Changed in percona-cluster (Juju Charms Collection):
assignee:	nobody → Liam Young (gnuoy)
milestone:	none → 15.07

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-05-26:

#14

@Andreas - I'm also +1 on a backport, but would like to see it bake in /next just a bit longer before backporting.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-05-27:

#15

Testing a bit more today, we found some issues.

- If the DC is none or an error occurs when trying to query the DC, then the check for the DC simply returns False, whereas it previously would raise an Exception and retry.

- For some reason, it seems that the relation data is not properly being set in all cases that allow the resources to be set properly. It may or may not be specifically related to this change - but it is an observation.

I plan on spending a bit more time investigating what is going on and reverting this change if necessary until we can resolve these issues.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-05-29:

#16

FWIW, the relation data not being set was deemed to be https://bugs.launchpad.net/charm-helpers/+bug/1459175. The offending code has been rolled back.

tags:

added: backport-potential

James Page (james-page) on 2015-08-10

Changed in percona-cluster (Juju Charms Collection):
status:	Fix Committed → Fix Released

David Britton (dpb) on 2015-10-02

tags:

added: kanban-cross-team

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Juju Charms Collectionpercona-cluster package

Percona cluster doesn't establish, crm errors

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Juju Charms Collection
percona-cluster package