Percona cluster doesn't establish, crm errors

Bug #1452365 reported by David Britton
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
percona-cluster (Juju Charms Collection)
Fix Released
High
Liam Young

Bug Description

On node 0 (notice 15:37:00):

2015-05-06 15:37:00 INFO unit.mysql/0.juju-log cmd.go:247 ha:24: Cluster configured, notifying other services
2015-05-06 15:37:18 WARNING unit.mysql/0.juju-log cmd.go:247 shared-db:32: eligible_leader is deprecated. Please consider using is_elected_leader instead.
2015-05-06 15:37:18 DEBUG unit.mysql/0.juju-log cmd.go:247 shared-db:32: No allowed_units - not setting relation settings
2015-05-06 15:37:18 DEBUG unit.mysql/0.juju-log cmd.go:247 shared-db:32: No return data - not setting relation settings
2015-05-06 15:37:19 WARNING unit.mysql/0.juju-log cmd.go:247 shared-db:33: eligible_leader is deprecated. Please consider using is_elected_leader instead.

On node 1 (Notice 15:36:01):

2015-05-06 15:34:25 INFO unit.mysql/1.juju-log cmd.go:247 cluster:0: Making dir /etc/mysql root:root 755
2015-05-06 15:34:25 INFO unit.mysql/1.juju-log cmd.go:247 cluster:0: Writing file /etc/mysql/my.cnf root:root 444
2015-05-06 15:34:25 INFO unit.mysql/1.juju-log cmd.go:247 cluster:0: '/usr/lib/ocf/resource.d/percona/mysql_monitor' already exists, skipping
2015-05-06 15:35:30 WARNING unit.mysql/1.juju-log cmd.go:247 ha:24: is_leader is deprecated. Please consider using is_crm_leader instead.
2015-05-06 15:35:30 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 5 more times (delay=2)
2015-05-06 15:35:32 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 4 more times (delay=4)
2015-05-06 15:35:36 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 3 more times (delay=6)
2015-05-06 15:35:42 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 2 more times (delay=8)
2015-05-06 15:35:50 INFO unit.mysql/1.juju-log cmd.go:247 ha:24: Retrying 'is_crm_leader' 1 more times (delay=10)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 Traceback (most recent call last):
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/ha-relation-changed", line 488, in <module>
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 main()
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/ha-relation-changed", line 482, in main
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 hooks.execute(sys.argv)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/hookenv.py", line 544, in execute
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 self._hooks[hook_name]()
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/ha-relation-changed", line 440, in ha_relation_changed
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 if (clustered and is_leader(LEADER_RES)):
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 127, in is_leader
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 return is_crm_leader(resource)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/core/decorators.py", line 42, in _retry_on_exception_inner_2
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 return f(*args, **kwargs)
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-mysql-1/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 119, in is_crm_leader
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 raise CRMResourceNotFound("CRM resource %s not found" % (resource))
2015-05-06 15:36:01 INFO unit.mysql/1.ha-relation-changed logger.go:40 charmhelpers.contrib.hahelpers.cluster.CRMResourceNotFound: CRM resource grp_percona_cluster not found
2015-05-06 15:36:01 ERROR juju.worker.uniter.operation runhook.go:86 hook "ha-relation-changed" failed: exit status 1

Node 2 is similar (15:36:03 failure timestamp)

Related branches

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Dave, this is often a result of the corosync resource used to determine leader dithering across units during deploy time (which I assume is the case for you). A couple of points here; firstly, this problem will simple go away once we have juju leadership election supported in all the charms (since we will no longer rely on a corosync resource). Second, can ypu try setting the hacluster cluster_count option to the number of units in your mysql cluster. This will prevent corosync from reaching quorum until the specified number of units have joined the cluster. In practice that has drastically reduce the incidence of this kind of issue for me. Hope that helps.

Revision history for this message
David Britton (dpb) wrote : Re: [Bug 1452365] Re: not waiting long enough for cluster lead to establish?

I confirmed this is set in this deployment. i'm guessing it's not enough
to prevent what i'm running into. :(

On Wed, May 6, 2015 at 10:31 AM, Edward Hope-Morley <
<email address hidden>> wrote:

> Dave, this is often a result of the corosync resource used to determine
> leader dithering across units during deploy time (which I assume is the
> case for you). A couple of points here; firstly, this problem will
> simple go away once we have juju leadership election supported in all
> the charms (since we will no longer rely on a corosync resource).
> Second, can ypu try setting the hacluster cluster_count option to the
> number of units in your mysql cluster. This will prevent corosync from
> reaching quorum until the specified number of units have joined the
> cluster. In practice that has drastically reduce the incidence of this
> kind of issue for me. Hope that helps.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1452365
>
> Title:
> not waiting long enough for cluster lead to establish?
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/charms/+source/percona-cluster/+bug/1452365/+subscriptions
>

--
David Britton <email address hidden>

Revision history for this message
David Britton (dpb) wrote : Re: not waiting long enough for cluster lead to establish?

juju resolved --retry on the failed units seems to fix it.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Here are some logs from the failure I just had.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

juju status output

Revision history for this message
Adam Collard (adam-collard) wrote :

unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 element rsc_colocation: validity error : IDREF attribute with-rsc references an unknown ID "cl_mysql_monitor"
unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 error: main: #011CIB did not pass DTD/schema validation
unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 Errors found during check: config not valid
unit-hacluster-mysql-2[4474]: 2015-05-12 19:55:29 INFO unit.hacluster-mysql/2.hanode-relation-changed logger.go:40 INFO: commit forced

interesting log lines from Andreas' logs

tags: added: openstack
summary: - not waiting long enough for cluster lead to establish?
+ Percona cluster doesn't establish, crm errors
Changed in percona-cluster (Juju Charms Collection):
status: New → Confirmed
Revision history for this message
Liam Young (gnuoy) wrote :
Download full text (5.0 KiB)

I think this is cause by r54 and the introduction of the mysql monitor. The monitor initially fails (I think because of a timeout when the vip is switching around). Corosync appears to be waiting for that monitor to complete before declaring that grp_percona_cluster is running. This wait for the check pushes the is_crm_leader retry past it's window.

Below is from a recent deploy which showed this error:

As the stack trace shows the error is coming from is_crm_leader. Which runs

crm resource show grp_percona_cluster

It is run 5 times with a delay to give corosync time to settle. The logs show that
iit does attempt 5 times and then thook errors with the following exception...

From percona/0 juju-log

15-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 Traceback (most recent call last):
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 488, in <module>
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 main()
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 482, in main
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 hooks.execute(sys.argv)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/core/hookenv.py", line 557, in execute
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 self._hooks[hook_name]()
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/ha-relation-changed", line 440, in ha_relation_changed
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 if (clustered and is_leader(LEADER_RES)):
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 127, in is_leader
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 return is_crm_leader(resource)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/core/decorators.py", line 42, in _retry_on_exception_inner_2
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 return f(*args, **kwargs)
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 File "/var/lib/juju/agents/unit-percona-0/charm/hooks/charmhelpers/contrib/hahelpers/cluster.py", line 119, in is_crm_leader
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 raise CRMResourceNotFound("CRM resource %s not found" % (resource))
2015-05-14 10:08:13 INFO unit.percona/0.ha-relation-changed logger.go:40 charmhelpers.contrib.hahelpers.cluster.CRMResourceNotFound: CRM resource grp_percona_cluster not found
2015-05-14 10:08:13 ERROR juju.worker.uniter.operation runhook.go:89 hook "ha-relation-changed" failed: exit status 1

crm status and ...

Read more...

Revision history for this message
Liam Young (gnuoy) wrote :

unsurprisingly

=== modified file 'hooks/charmhelpers/contrib/hahelpers/cluster.py'
--- hooks/charmhelpers/contrib/hahelpers/cluster.py 2015-03-03 02:26:12 +0000
+++ hooks/charmhelpers/contrib/hahelpers/cluster.py 2015-05-14 10:39:30 +0000
@@ -95,7 +95,7 @@
     return False

-@retry_on_exception(5, base_delay=2, exc_type=CRMResourceNotFound)
+@retry_on_exception(5, base_delay=8, exc_type=CRMResourceNotFound)
 def is_crm_leader(resource, retry=False):
     """
     Returns True if the charm calling this is the elected corosync leader,

fixed it for me. I'd like to do some more digging on how the monitor works before I propose it as a fix

Revision history for this message
Felipe Reyes (freyes) wrote :

Regarding if the monitor interferes with grp_percona_cluster,
yes, it does, because grp_percona_cluster doesn't have a node to
be allocated until the monitor sets the writable/readable
properties[0], this won't happen until the percona cluster is
funcional according to /usr/bin/clustercheck[1]

Considering this is happening because we need is_leader(), I
suggest to use another approach to decide who the leader is, this
is using the current DC, from the pacemaker's doc[2]:

"The DC (Designated Controller) node is where all the decisions
 are made and if the current DC fails a new one is elected from
 the remaining cluster nodes. The choice of DC is of no
 significance to an administrator beyond the fact that its logs
 will generally be more interesting."

The cluster will decide who will be the DC at a very early stage,
because it's needed by pacemaker internals, this will allow us to
reduce (if not remove at all) a big number failures due to
latency in the cluster's configuration.

$ sudo crm status | grep 'Current DC'
Current DC: trusty-pcmkr-2 (1001) - partition with quorum

[0] http://bazaar.launchpad.net/~openstack-charmers/charms/trusty/percona-cluster/next/view/head:/hooks/percona_hooks.py#L422
[1] http://bazaar.launchpad.net/~openstack-charmers/charms/trusty/percona-cluster/next/view/head:/ocf/percona/mysql_monitor#L526
[2] http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-cluster-status.html

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'm trying a deployment with percona-cluster synced with this charm-helpers change that got merged.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

So far, two deploys, two successes.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I had many more successes, but that was just with the charm-helpers change/sync, which can't fix this on its own. So they must have been pure luck.

Changed in percona-cluster (Juju Charms Collection):
status: Confirmed → In Progress
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Can we get a backport into the stable series?

Changed in percona-cluster (Juju Charms Collection):
status: In Progress → Fix Committed
Felipe Reyes (freyes)
Changed in percona-cluster (Juju Charms Collection):
assignee: nobody → Liam Young (gnuoy)
milestone: none → 15.07
Revision history for this message
Billy Olsen (billy-olsen) wrote :

@Andreas - I'm also +1 on a backport, but would like to see it bake in /next just a bit longer before backporting.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Testing a bit more today, we found some issues.

- If the DC is none or an error occurs when trying to query the DC, then the check for the DC simply returns False, whereas it previously would raise an Exception and retry.

- For some reason, it seems that the relation data is not properly being set in all cases that allow the resources to be set properly. It may or may not be specifically related to this change - but it is an observation.

I plan on spending a bit more time investigating what is going on and reverting this change if necessary until we can resolve these issues.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

FWIW, the relation data not being set was deemed to be https://bugs.launchpad.net/charm-helpers/+bug/1459175. The offending code has been rolled back.

tags: added: backport-potential
James Page (james-page)
Changed in percona-cluster (Juju Charms Collection):
status: Fix Committed → Fix Released
David Britton (dpb)
tags: added: kanban-cross-team
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.