hacluster-mysql unit remains blocked reporting res_mysql_monitor not running

Bug #1890470 reported by John George
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Invalid
High
Unassigned
OpenStack Percona Cluster Charm
Fix Released
High
David Ames

Bug Description

This solutionsqa openstack-charmers-next release test timed out waiting for one of the three hacluster-mysql units, which remained in blocked with the juju status message:
Resource: res_mysql_monitor not running

syslog on the unit (hacluster-mysql/1) shows the following related messages:
Aug 5 07:25:48 juju-1e2915-4-lxd-13 crmd[14520]: error: Failed to retrieve meta-data for ocf:percona:mysql_monitor
Aug 5 07:25:49 juju-1e2915-4-lxd-13 crmd[14520]: notice: Result of probe operation for res_mysql_e9b8d8d_vip on juju-1e2915-4-lxd-13: 7 (not running)

The logs and juju crashdump for this bionic/ussuri bundle are available by following the artifacts link at the bottom of the test run page:

https://solutions.qa.canonical.com/qa/testRun/63a0e101-bb85-4c6a-ae0e-2e56b8a0da29

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Please link the bundle so that we can determine charm config and topology, as well as source/origin version details. Thank you.

Changed in charm-hacluster:
status: New → Incomplete
Revision history for this message
John George (jog) wrote :
Revision history for this message
John George (jog) wrote :
Changed in charm-hacluster:
status: Incomplete → New
John George (jog)
tags: added: cdo-qa cdo-release-blocker foundations-engine
Revision history for this message
David Ames (thedac) wrote :

Previously when we have seen CRM fail to set up during deployment we added the following to the resource definition make sure it retries:

"meta migration-threshold=INFINITY"

It looks like percona-cluster is in fact doing that [0]

[0] https://github.com/openstack/charm-percona-cluster/blob/master/hooks/percona_hooks.py#L164

Have we seen this more than once? Is it truly a release blocker?

Revision history for this message
John George (jog) wrote :

So far we've only hit it once but have not seen this bug prior to the proposed release under test, hence the cdo-release-blocker tag. If further investigation rules it out as an openstack issue we'll be happy to remove the release blocker tag.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

The file /usr/lib/ocf/resource.d/percona/mysql_monitor has been installed on the mysql/1 unit:

2020-08-05 07:26:29 INFO juju-log Installing /usr/lib/ocf/resource.d/percona/mysql_monitor
...
2020-08-05 07:36:35 INFO juju-log '/usr/lib/ocf/resource.d/percona/mysql_monitor' already exists, skipping

The content of this file shows a very simple implementation for retrieving the meta-data:
https://opendev.org/openstack/charm-percona-cluster/src/branch/master/ocf/percona/mysql_monitor#L610
https://opendev.org/openstack/charm-percona-cluster/src/branch/master/ocf/percona/mysql_monitor#L91

However the following crmd error happened 3 times before the file was installed:

Aug 5 07:25:48 juju-1e2915-4-lxd-13 crmd[14520]: error: Failed to retrieve meta-data for ocf:percona:mysql_monitor

So I think this is normal to have crmd complaining about not being able to find the meta-data up to the point where the mysql_monitor script gets installed. Still digging.

Changed in charm-hacluster:
status: New → In Progress
importance: Undecided → High
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

/var/log/mysql/error.log shows things going wrong. So we're looking at 3/lxd/7 a.k.a 192.168.33.35 and this file shows first that mysql successfully starts as member 11a2d158 of a cluster made of 3 members:

2020-08-05T07:29:04.728476Z 0 [Note] WSREP: (573a685c, 'tcp://0.0.0.0:4567') connection established to 11a2d158 tcp://192.168.33.35:4567
...
2020-08-05T07:29:05.228630Z 0 [Note] WSREP: declaring 11a2d158 at tcp://192.168.33.35:4567 stable
2020-08-05T07:29:05.228664Z 0 [Note] WSREP: declaring 516530b4 at tcp://192.168.33.59:4567 stable
2020-08-05T07:29:05.228928Z 0 [Note] WSREP: Node 11a2d158 state primary
2020-08-05T07:29:05.229319Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,11a2d158,3)
memb {
 11a2d158,0
 516530b4,0
 573a685c,0
 }
joined {
 }
left {
 }
partitioned {
 }
)
...
2020-08-05T07:29:05.728504Z 0 [Note] WSREP: Quorum results:
 version = 4,
 component = PRIMARY,
 conf_id = 2,
 members = 1/3 (primary/total),
 act_id = 2,
 last_appl. = -1,
 protocols = 0/7/3 (gcs/repl/appl),
 group UUID = 79180de3-d6eb-11ea-902c-42e22f62df67
...
2020-08-05T07:29:05.728627Z 2 [Note] WSREP: New cluster view: global state: 79180de3-d6eb-11ea-902c-42e22f62df67:2, view# 3: Primary, number of nodes: 3, my index: 2, protocol version 3

Then things go wrong:
2020-08-05T07:29:07.464959Z 2 [Note] WSREP: Failed to prepare for incremental state transfer: Local state UUID (f7a54291-d6ec-11ea-be12-b2c563befc83) does not match group state UUID (79180de3-d6eb-11ea-902c-42e22f62df67): 1 (Operation not permitted)
  at galera/src/replicator_str.cpp:prepare_for_IST():538. IST will be unavailable.
2020-08-05T07:29:07.468408Z 0 [Warning] WSREP: Member 2.0 (juju-1e2915-4-lxd-13) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2020-08-05T07:29:07.468605Z 2 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
2020-08-05T07:29:08.228163Z 0 [Note] WSREP: (573a685c, 'tcp://0.0.0.0:4567') turning message relay requesting off
2020-08-05T07:29:08.474006Z 0 [Warning] WSREP: Member 2.0 (juju-1e2915-4-lxd-13) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2020-08-05T07:29:08.810129Z 0 [Warning] WSREP: Could not find peer:
2020-08-05T07:29:08.810164Z 0 [Note] WSREP: 0.0 (juju-1e2915-3-lxd-7): State transfer to -1.-1 (left the group) complete.

Still digging.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Changed in charm-hacluster:
status: In Progress → Triaged
importance: High → Medium
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

actually it recovers later:

2020-08-05T07:30:57.894347Z 0 [Note] WSREP: declaring 73c5323a at tcp://192.168.33.35:4567 stable
2020-08-05T07:30:57.894383Z 0 [Note] WSREP: declaring 9a61e375 at tcp://192.168.33.59:4567 stable
2020-08-05T07:30:57.894777Z 0 [Note] WSREP: Node 73c5323a state primary
2020-08-05T07:30:57.895210Z 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(PRIM,73c5323a,9)
memb {
 73c5323a,0
 91adc7f9,0
 9a61e375,0
 }
joined {
 }
left {
 }
partitioned {
 }
)

but they seem to be all considered primary, which makes no sense:

2020-08-05T07:30:58.394324Z 0 [Note] WSREP: Quorum results:
 version = 4,
 component = PRIMARY,
 conf_id = 8,
 members = 3/3 (primary/total),
 act_id = 2,
 last_appl. = 0,
 protocols = 0/7/3 (gcs/repl/appl),
 group UUID = 79180de3-d6eb-11ea-902c-42e22f62df67

Changed in charm-hacluster:
status: Triaged → In Progress
importance: Medium → High
assignee: nobody → Aurelien Lourot (aurelien-lourot)
Changed in charm-hacluster:
status: In Progress → Triaged
assignee: Aurelien Lourot (aurelien-lourot) → nobody
David Ames (thedac)
Changed in charm-hacluster:
assignee: nobody → David Ames (thedac)
Changed in charm-percona-cluster:
status: New → Triaged
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 20.08
Changed in charm-hacluster:
status: Triaged → Invalid
assignee: David Ames (thedac) → nobody
Revision history for this message
David Ames (thedac) wrote :

I believe I have root caused the problem thanks to many hints from Aurelien and a subtle hint from a google search [0]

The key to the problem is the error:
"Aug 5 07:24:10 juju-1e2915-4-lxd-13 crmd[14520]: error: Failed to retrieve meta-data for ocf:percona:mysql_monitor"

This error is much different from an actual check of mysql that fails or when the file is missing.

From [0] the suggestion "Could it be as simple as [the ocf file] bit set (for the user)?"

Yes. Yes it can. Percona copies the mysql_monitor file early but does not set the execution bit [1].

Doing some more verification then a fix will be on the way.

[0] https://lists.clusterlabs.org/pipermail/users/2018-September/024866.html
[1] https://opendev.org/openstack/charm-percona-cluster/src/branch/master/hooks/percona_utils.py#L441

Revision history for this message
David Ames (thedac) wrote :

So I can re-create the problem by setting the mysql_monitor OCF file to 644.

I do not know what might be changing the permissions as the file that gets copied has the correct permissions.

This change [0] will add several gratuitous chmods to guarantee the mysql_monitor ocf file is executable.

[0] https://review.opendev.org/745202

David Ames (thedac)
Changed in charm-percona-cluster:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.opendev.org/745202
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=7eddd6074d83b61e28c4fd325cde32e919331243
Submitter: Zuul
Branch: master

commit 7eddd6074d83b61e28c4fd325cde32e919331243
Author: David Ames <email address hidden>
Date: Thu Aug 6 11:52:05 2020 -0700

    Guarantee executable OCF mysql_monitor file

    Closes-Bug: #1890470
    Change-Id: Iee0aa8b1ae0f364d01d30a61381af222a264b090

Changed in charm-percona-cluster:
status: In Progress → Fix Committed
Revision history for this message
David Ames (thedac) wrote :

For completeness, this bug is related to LP Bug#1837401 but not quite a duplicate.

The resolution in Bug#1837401 handles cases where the mysql_monitor file is missing entirely. This bug handles the case where the file has incorrect permissions.

https://bugs.launchpad.net/charm-percona-cluster/+bug/1837401

Changed in charm-percona-cluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.