OpenStack Percona Cluster Charm

2 of 3 hacluster-mysql units blocked with "Resource: res_mysql_monitor not running"

Bug #1837401 reported by Jason Hobbs on 2019-07-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	Invalid	Undecided	Liam Young
	OpenStack Percona Cluster Charm	Fix Released	Undecided	James Page	OpenStack Percona Cluster Charm 19.07

Bug Description

Using the -next charms for an FCB deploy, two of three hacluster-mysql units got stuck in blocked state like this:

  hacluster-mysql/2 blocked idle 10.244.41.56 Resource: res_mysql_monitor not running
  hacluster-mysql/1 blocked idle 10.244.40.235 Resource: res_mysql_monitor not running
  hacluster-mysql/0* active idle 10.244.40.217 Unit is ready and clustered

Tags:

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-07-22:

juju-crashdump-openstack-2019-07-21-16.49.11.tar.gz Edit (262.4 MiB, application/x-tar)

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-07-22:

bundle.yaml Edit (46.7 KiB, text/plain)

tags:

added: cdo-release-blocker

James Page (james-page) on 2019-07-22

Changed in charm-percona-cluster:
status:	New → In Progress
assignee:	nobody → James Page (james-page)

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2019-07-22:

For clarity, the percona-cluster (mysql) charm application status is:
    application-status:
      current: active
      message: Unit is ready
      since: 21 Jul 2019 13:40:23Z

It is the hacluster (hacluster-mysql) charm application state that is problematic:
    application-status:
      current: blocked
      message: 'Resource: res_mysql_monitor not running'
      since: 21 Jul 2019 13:28:12Z

This info is in the attached juju-crashdump juju_status.yaml file.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2019-07-22:

FYI, mysqld is running and listening on all 3 mysql units:

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 171285/mysqld
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 171285/mysqld

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 91359/mysqld
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 91359/mysqld

tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 154422/mysqld
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 154422/mysqld

This info is in the attached juju-crashdump.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-22: Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/672126
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=ccb44ce2bcd23fa0748bb9f51718fc2e3917e565
Submitter: Zuul
Branch: master

commit ccb44ce2bcd23fa0748bb9f51718fc2e3917e565
Author: Chris MacNaughton <email address hidden>
Date: Mon Jul 22 13:21:12 2019 -0500

Add logging when a resource is not running

Change-Id: I34ca8068b5e373c6e0374d61656a7e3cb22ce901
Partial-Bug: #1837401

Revision history for this message

James Page (james-page) wrote on 2019-07-23:

@jason-hobbs any chance of a repeat test with the improved logging in hacluster charm?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-07-23:

We had a successful run yesterday, we have more runs coming today

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-07-23:

It might also be useful to create /tmp/mysql_monitor.ocf.ra.debug dir in the charm which is used by mysql_monitor, see:

https://github.com/openstack/charm-percona-cluster/blob/1c4d99123235d9f4334a05988982d0dac54ebbd3/ocf/percona/mysql_monitor#L587-L606

I would use something under /var/log though instead of /tmp (we could modify the resource inline in the charm) because we capture /var/log/ in crashdumps.

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-07-23:

mysql-failed-unit-23-07-2019.txt Edit (1.2 KiB, text/plain)

I have another environment which was idle for some time with the same symptoms and the following message in /var/log/pacemaker.log printed in several places (just like in pacemaker.log in the original crashdump):

Jul 22 10:43:13 [36479] juju-5e21cd-1-lxd-6 pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for res_mysql_monitor:1 on juju-5e21cd-0-lxd-6: not installed (5)

It also has one unit with the pacemaker resource in the "started" state and a unit which has a failed action and the "cl_mysql_monitor" resource in the "stopped' state.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-07-23:

#10

juju-crashdump-openstack-2019-07-23-21.07.23.tar.gz Edit (177.4 MiB, application/x-tar)

Here's a crashdump from a reproduction in our lab, using the newest version of the charms

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2019-07-23:

#11

Filed a separate bug about increasing Pacemaker verbosity in general: https://bugs.launchpad.net/charm-hacluster/+bug/1837668

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-07-23:

#12

It looks like the percona charm is asking the hacluster charm, via relation, to start monitoring the mysql resource *potentially* before the actual percona cluster is a cluster:

2019-07-23 20:00:32 DEBUG juju-log cluster:31: Notifying peers that percona is bootstrapped (uuid=07928334-ad82-11e9-8e6c-7f00ef700f43)
2019-07-23 20:00:32 DEBUG juju-log cluster:31: Waiting on hacluster to complete clustering, not clustered yet.
2019-07-23 20:00:32 INFO juju-log cluster:31: Installing /usr/lib/ocf/resource.d/percona/mysql_monitor
2019-07-23 20:00:33 DEBUG juju-log cluster:31: VIP HA: VIP is set 192.168.33.11
2019-07-23 20:00:33 DEBUG juju-log cluster:31: Checking for minimum of 3 peer units
2019-07-23 20:00:33 INFO juju-log cluster:31: Sufficient number of peer units to form cluster 3
2019-07-23 20:00:33 DEBUG juju-log cluster:31: mysql/0 is not yet clustered
2019-07-23 20:00:33 DEBUG juju-log cluster:31: Waiting on hacluster to complete clustering, not clustered yet.

The referenced "mysql_monitor" file is dropped at https://github.com/openstack/charm-percona-cluster/blob/master/hooks/percona_hooks.py#L583 , and then the hacluster's HA relation is updated.

If the hacluster charm sets the crm resource before percona is clustered, this failure can occur.

Liam Young (gnuoy) on 2019-07-24

Changed in charm-hacluster:
assignee:	nobody → Liam Young (gnuoy)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-26: Fix merged to charm-percona-cluster (master)

#13

Reviewed: https://review.opendev.org/672492
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=9844f2fd3e1246a53f1088055c4d5f9cea703278
Submitter: Zuul
Branch: master

commit 9844f2fd3e1246a53f1088055c4d5f9cea703278
Author: Liam Young <email address hidden>
Date: Wed Jul 24 10:29:12 2019 +0000

Never give up on res_mysql_monitor

    Configure pacemaker to never give up on the res_mysql_monitor
    resource and to recheck 5 seconds after a failure. This is
    achieved using migration-threshold and failure-timeout
    options *1.

*1 https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_failure_response.html

Change-Id: If19bea77eb5dee9e9eeff105ab98dce1b2de9f74
Closes-Bug: #1837401

Changed in charm-percona-cluster:
status:	In Progress → Fix Committed

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-07-26:

#14

The linked fix hasn't resolved this issue as it was reproduced today with the fix

Changed in charm-percona-cluster:
status:	Fix Committed → Confirmed

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-07-30:

#15

We hit it again with charm: cs:~thedac/percona-cluster-5 being used.

Artifacts, including bundle and crashdump, are here: https://oil-jenkins.canonical.com/artifacts/03b16f47-45c9-49cb-84d8-b27eb84e8f22/index.html

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-31:

#16

Reviewed: https://review.opendev.org/673137
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=9391662e19ad3bde997258d9f665d238692c971c
Submitter: Zuul
Branch: master

commit 9391662e19ad3bde997258d9f665d238692c971c
Author: Liam Young <email address hidden>
Date: Sat Jul 27 12:13:16 2019 +0000

Ensure ocf file is installed

    Ensure ocf file is installed early so that mysql monitor resource can
    start as soon as it is defined. Otherwise it will be down and may
    not come backup without manual intervention.

Change-Id: I5da6da4ddde25219e06ef47bba5f3004ccf678ee
Closes-Bug: #1837401

Changed in charm-percona-cluster:
status:	Confirmed → Fix Committed

James Page (james-page) on 2019-08-07

Changed in charm-percona-cluster:
milestone:	none → 19.07

David Ames (thedac) on 2019-08-12

Changed in charm-percona-cluster:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-21: Change abandoned on charm-percona-cluster (master)

#17

Change abandoned by Chris MacNaughton (icey) (<email address hidden>) on branch: master
Review: https://review.opendev.org/673081
Reason: Abandoned in favor of alternate approach

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2019-11-01:

#18

The issue was on the percona-cluster charm; the hacluster charm was correctly reporting the missing resource monitor.

Changed in charm-hacluster:
status:	New → Invalid

Revision history for this message

David Ames (thedac) wrote on 2020-08-07:

#19

This is not quite a duplicate as the underlying problem was different.

For errors like:
crmd[14520]: error: Failed to retrieve meta-data for ocf:percona:mysql_monitor

The fix in Bug#1890470 resolves this error.

https://bugs.launchpad.net/charm-hacluster/+bug/1890470

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.