2 of 3 hacluster-mysql units blocked with "Resource: res_mysql_monitor not running"

Bug #1837401 reported by Jason Hobbs on 2019-07-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack hacluster charm
Undecided
Liam Young
OpenStack percona-cluster charm
Undecided
James Page

Bug Description

Using the -next charms for an FCB deploy, two of three hacluster-mysql units got stuck in blocked state like this:

  hacluster-mysql/2 blocked idle 10.244.41.56 Resource: res_mysql_monitor not running
  hacluster-mysql/1 blocked idle 10.244.40.235 Resource: res_mysql_monitor not running
  hacluster-mysql/0* active idle 10.244.40.217 Unit is ready and clustered

Jason Hobbs (jason-hobbs) wrote :
Jason Hobbs (jason-hobbs) wrote :
tags: added: cdo-release-blocker
James Page (james-page) on 2019-07-22
Changed in charm-percona-cluster:
status: New → In Progress
assignee: nobody → James Page (james-page)
Ryan Beisner (1chb1n) wrote :

For clarity, the percona-cluster (mysql) charm application status is:
    application-status:
      current: active
      message: Unit is ready
      since: 21 Jul 2019 13:40:23Z

It is the hacluster (hacluster-mysql) charm application state that is problematic:
    application-status:
      current: blocked
      message: 'Resource: res_mysql_monitor not running'
      since: 21 Jul 2019 13:28:12Z

This info is in the attached juju-crashdump juju_status.yaml file.

Ryan Beisner (1chb1n) wrote :

FYI, mysqld is running and listening on all 3 mysql units:

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 171285/mysqld
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 171285/mysqld

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 91359/mysqld
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 91359/mysqld

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 154422/mysqld
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 154422/mysqld

This info is in the attached juju-crashdump.

Reviewed: https://review.opendev.org/672126
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=ccb44ce2bcd23fa0748bb9f51718fc2e3917e565
Submitter: Zuul
Branch: master

commit ccb44ce2bcd23fa0748bb9f51718fc2e3917e565
Author: Chris MacNaughton <email address hidden>
Date: Mon Jul 22 13:21:12 2019 -0500

    Add logging when a resource is not running

    Change-Id: I34ca8068b5e373c6e0374d61656a7e3cb22ce901
    Partial-Bug: #1837401

James Page (james-page) wrote :

@jason-hobbs any chance of a repeat test with the improved logging in hacluster charm?

Jason Hobbs (jason-hobbs) wrote :

We had a successful run yesterday, we have more runs coming today

Dmitrii Shcherbakov (dmitriis) wrote :

It might also be useful to create /tmp/mysql_monitor.ocf.ra.debug dir in the charm which is used by mysql_monitor, see:

https://github.com/openstack/charm-percona-cluster/blob/1c4d99123235d9f4334a05988982d0dac54ebbd3/ocf/percona/mysql_monitor#L587-L606

I would use something under /var/log though instead of /tmp (we could modify the resource inline in the charm) because we capture /var/log/ in crashdumps.

Dmitrii Shcherbakov (dmitriis) wrote :

I have another environment which was idle for some time with the same symptoms and the following message in /var/log/pacemaker.log printed in several places (just like in pacemaker.log in the original crashdump):

Jul 22 10:43:13 [36479] juju-5e21cd-1-lxd-6 pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for res_mysql_monitor:1 on juju-5e21cd-0-lxd-6: not installed (5)

It also has one unit with the pacemaker resource in the "started" state and a unit which has a failed action and the "cl_mysql_monitor" resource in the "stopped' state.

Jason Hobbs (jason-hobbs) wrote :

Here's a crashdump from a reproduction in our lab, using the newest version of the charms

Dmitrii Shcherbakov (dmitriis) wrote :

Filed a separate bug about increasing Pacemaker verbosity in general: https://bugs.launchpad.net/charm-hacluster/+bug/1837668

It looks like the percona charm is asking the hacluster charm, via relation, to start monitoring the mysql resource *potentially* before the actual percona cluster is a cluster:

2019-07-23 20:00:32 DEBUG juju-log cluster:31: Notifying peers that percona is bootstrapped (uuid=07928334-ad82-11e9-8e6c-7f00ef700f43)
2019-07-23 20:00:32 DEBUG juju-log cluster:31: Waiting on hacluster to complete clustering, not clustered yet.
2019-07-23 20:00:32 INFO juju-log cluster:31: Installing /usr/lib/ocf/resource.d/percona/mysql_monitor
2019-07-23 20:00:33 DEBUG juju-log cluster:31: VIP HA: VIP is set 192.168.33.11
2019-07-23 20:00:33 DEBUG juju-log cluster:31: Checking for minimum of 3 peer units
2019-07-23 20:00:33 INFO juju-log cluster:31: Sufficient number of peer units to form cluster 3
2019-07-23 20:00:33 DEBUG juju-log cluster:31: mysql/0 is not yet clustered
2019-07-23 20:00:33 DEBUG juju-log cluster:31: Waiting on hacluster to complete clustering, not clustered yet.

The referenced "mysql_monitor" file is dropped at https://github.com/openstack/charm-percona-cluster/blob/master/hooks/percona_hooks.py#L583 , and then the hacluster's HA relation is updated.

If the hacluster charm sets the crm resource before percona is clustered, this failure can occur.

Liam Young (gnuoy) on 2019-07-24
Changed in charm-hacluster:
assignee: nobody → Liam Young (gnuoy)

Reviewed: https://review.opendev.org/672492
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=9844f2fd3e1246a53f1088055c4d5f9cea703278
Submitter: Zuul
Branch: master

commit 9844f2fd3e1246a53f1088055c4d5f9cea703278
Author: Liam Young <email address hidden>
Date: Wed Jul 24 10:29:12 2019 +0000

    Never give up on res_mysql_monitor

    Configure pacemaker to never give up on the res_mysql_monitor
    resource and to recheck 5 seconds after a failure. This is
    achieved using migration-threshold and failure-timeout
    options *1.

    *1 https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html

    Change-Id: If19bea77eb5dee9e9eeff105ab98dce1b2de9f74
    Closes-Bug: #1837401

Changed in charm-percona-cluster:
status: In Progress → Fix Committed

The linked fix hasn't resolved this issue as it was reproduced today with the fix

Changed in charm-percona-cluster:
status: Fix Committed → Confirmed
Jason Hobbs (jason-hobbs) wrote :

We hit it again with charm: cs:~thedac/percona-cluster-5 being used.

Artifacts, including bundle and crashdump, are here: https://oil-jenkins.canonical.com/artifacts/03b16f47-45c9-49cb-84d8-b27eb84e8f22/index.html

Reviewed: https://review.opendev.org/673137
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=9391662e19ad3bde997258d9f665d238692c971c
Submitter: Zuul
Branch: master

commit 9391662e19ad3bde997258d9f665d238692c971c
Author: Liam Young <email address hidden>
Date: Sat Jul 27 12:13:16 2019 +0000

    Ensure ocf file is installed

    Ensure ocf file is installed early so that mysql monitor resource can
    start as soon as it is defined. Otherwise it will be down and may
    not come backup without manual intervention.

    Change-Id: I5da6da4ddde25219e06ef47bba5f3004ccf678ee
    Closes-Bug: #1837401

Changed in charm-percona-cluster:
status: Confirmed → Fix Committed
James Page (james-page) on 2019-08-07
Changed in charm-percona-cluster:
milestone: none → 19.07
David Ames (thedac) on 2019-08-12
Changed in charm-percona-cluster:
status: Fix Committed → Fix Released

Change abandoned by Chris MacNaughton (icey) (<email address hidden>) on branch: master
Review: https://review.opendev.org/673081
Reason: Abandoned in favor of alternate approach

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers