WSREP has not yet prepared node for application use

Bug #1742683 reported by Jason Hobbs
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Percona Cluster Charm
Fix Released
High
David Ames

Bug Description

A percona cluster unit in my deploy is failing db-admin-relation-changed with this error:

http://paste.ubuntu.com/26366038/

Here is the bundle: http://paste.ubuntu.com/26366049/

I'm seeing this with cs:~openstack-charmers-next/percona-cluster-279 and juju 2.3.1.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
tags: added: foundations-engine
removed: cpe-foundation
Revision history for this message
Ryan Beisner (1chb1n) wrote :

This appears to be a race with a telegraf subordinate:

<snip>
    units:
      mysql/0:
        workload-status:
          current: error
          message: 'hook failed: "db-admin-relation-changed" for telegraf:mysql'
</snip>

I think this existing telegraf + juju core bugs are likely related:

https://bugs.launchpad.net/telegraf-charm/+bug/1718259
https://bugs.launchpad.net/charm-percona-cluster/+bug/1717590

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Liam Young (gnuoy) wrote :

Looking into this now

Changed in charm-percona-cluster:
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

One thing that would be useful to know is if there is a way to recover when we hit this, without doing a full redeploy.

Revision history for this message
Liam Young (gnuoy) wrote :

While trying to reproduce this bug I hit a new Bug #1744961

I think this bug is probably due to the db-* relations not having the same "if not seeded()" check but I'd like to be able to reproduce it first

Revision history for this message
Chris Gregan (cgregan) wrote :

Escalated to Field-high.

Revision history for this message
Liam Young (gnuoy) wrote :

Reading around it seems that "WSREP has not yet prepared node for application use" may indicate that another node in the cluster exited ungracefully. Sure enough looking at
the supplied crash dump shows:

grep -E "WSREP: caught exception in PC, s|Abort" mysql_*/var/log/mysql/error*
mysql_0/var/log/mysql/error.log:2018-01-11 09:17:39 93347 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
mysql_1/var/log/mysql/error.log:2018-01-11 09:17:39 67643 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
mysql_1/var/log/mysql/error.log:2018-01-11 09:17:39 67643 [ERROR] Aborting

$ grep "WSREP has not yet prepared node for application use" mysql_0/var/log/juju/unit-mysql-0.log | head -1
2018-01-11 09:27:16 DEBUG db-admin-relation-changed _mysql_exceptions.OperationalError: (1047, 'WSREP has not yet prepared node for application use')

So far the closest think I can find is: https://jira.percona.com/browse/PXC-1352 but this claims to be fixed in 5.6.15-25.4 and this deploy has 5.6.37-82.2

James Page (james-page)
Changed in charm-percona-cluster:
importance: Undecided → High
status: New → In Progress
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again on commit bd5474c.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/538979
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=fac45afc6057a2fd18ddb59423965e521e1963e8
Submitter: Zuul
Branch: master

commit fac45afc6057a2fd18ddb59423965e521e1963e8
Author: David Ames <email address hidden>
Date: Mon Jan 29 09:05:03 2018 -0800

    Gate db{-admin} relations until cluster is ready

    Percona-cluster was responding to db and db-admin relations before it
    was ready. This led to the error: "WSREP has not yet prepared node for
    application use."

    This change applies the same gating share-db relation already has to db
    and db-admin relations. It also condenses code used in both instances.

    This change guarantees the rendered configuration will not
    auto-bootstrap for non-leaders. This addresses Bug 1738896.

    Closes-Bug: #1742683
    Closes-Bug: #1738896
    Change-Id: If525595fd109e6a738071a3f016b9c2eabec529e

Changed in charm-percona-cluster:
status: In Progress → Fix Committed
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Changed in charm-percona-cluster:
status: Fix Committed → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again on 295b05e.

Changed in charm-percona-cluster:
status: New → Fix Committed
status: Fix Committed → New
Revision history for this message
David Ames (thedac) wrote :
Download full text (15.4 KiB)

Analysis

For the current crashdump:
mysq/0* running db-admin-relation-changed 192.168.33.145
mysql/1 running leader-settings-changed 192.168.33.151
mysql/2 running leader-settings-changed 192.168.33.177

The gating from the above change is doing what it says on the tin and yet we still see the 'WSREP has not yet prepared node for application use' error. Meaning the db-admin-relation-changed is not running until the cluster is up.

The error does NOT correlate with a restart of any of the nodes.

The nodes are actively attempting to communicate with each other but experiencing a number of timeouts.

The leader node mysq/0 is running db-admin-relation-changed when this occurs.

2018-02-01 22:58:32 DEBUG juju-log db-admin:152: All 3 percona units reporting clustered
2018-02-01 22:58:33 DEBUG juju-log db-admin:152: Opening db connection for root@localhost
2018-02-01 22:58:34 DEBUG db-admin-relation-changed Traceback (most recent call last):
2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/db-admin-relation-changed", line 833, in <module>
2018-02-01 22:58:34 DEBUG db-admin-relation-changed main()
2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/db-admin-relation-changed", line 823, in main
2018-02-01 22:58:34 DEBUG db-admin-relation-changed hooks.execute(sys.argv)
2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/core/hookenv.py", line 800, in execute
2018-02-01 22:58:34 DEBUG db-admin-relation-changed self._hooks[hook_name]()
2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/db-admin-relation-changed", line 500, in db_changed
2018-02-01 22:58:34 DEBUG db-admin-relation-changed password = db_helper.configure_db(addr, db_name, username, admin=admin)
2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/contrib/database/mysql.py", line 411, in configure_db
2018-02-01 22:58:34 DEBUG db-admin-relation-changed self.create_admin_grant(username, remote_ip, password)
2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/contrib/database/mysql.py", line 145, in create_admin_grant
2018-02-01 22:58:34 DEBUG db-admin-relation-changed password))
2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 226, in execute
2018-02-01 22:58:34 DEBUG db-admin-rel...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1742683] Re: WSREP has not yet prepared node for application use
Download full text (15.9 KiB)

I checked dns reverse lookups and network health, it all looks fine:

http://paste.ubuntu.com/26503247/

On Thu, Feb 1, 2018 at 6:12 PM, David Ames <email address hidden> wrote:
> Analysis
>
> For the current crashdump:
> mysq/0* running db-admin-relation-changed 192.168.33.145
> mysql/1 running leader-settings-changed 192.168.33.151
> mysql/2 running leader-settings-changed 192.168.33.177
>
> The gating from the above change is doing what it says on the tin and
> yet we still see the 'WSREP has not yet prepared node for application
> use' error. Meaning the db-admin-relation-changed is not running until
> the cluster is up.
>
> The error does NOT correlate with a restart of any of the nodes.
>
> The nodes are actively attempting to communicate with each other but
> experiencing a number of timeouts.
>
> The leader node mysq/0 is running db-admin-relation-changed when this
> occurs.
>
> 2018-02-01 22:58:32 DEBUG juju-log db-admin:152: All 3 percona units reporting clustered
> 2018-02-01 22:58:33 DEBUG juju-log db-admin:152: Opening db connection for root@localhost
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed Traceback (most recent call last):
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/db-admin-relation-changed", line 833, in <module>
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed main()
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/db-admin-relation-changed", line 823, in main
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed hooks.execute(sys.argv)
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/core/hookenv.py", line 800, in execute
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed self._hooks[hook_name]()
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/db-admin-relation-changed", line 500, in db_changed
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed password = db_helper.configure_db(addr, db_name, username, admin=admin)
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/contrib/database/mysql.py", line 411, in configure_db
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed self.create_admin_grant(username, remote_ip, password)
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/charmhelpers/contrib/database/mysql.py", line 145, in create_admin_grant
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed password))
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 226, in execute
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed self.errorhandler(self, exc, value)
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
> 2018-02-01 22:58:34 DEBUG db-admin-relation-changed raise errorvalue
> 2018-02-01 22:58:34 DEBUG db-admin-r...

Revision history for this message
David Ames (thedac) wrote :

Currently investigating setting gmcast.peer_timeout.

 http://galeracluster.com/documentation-webpages/galeraparameters.html#gmcast-peer-timeout

This version has peer_timeout hard coded to PT15S cs:~thedac/percona-cluster-4. If that appears to help in resource constrained environments we can make it configurable.

Changed in charm-percona-cluster:
assignee: Liam Young (gnuoy) → David Ames (thedac)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-percona-cluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/541499

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-percona-cluster (master)

Reviewed: https://review.openstack.org/541499
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=c82ca2be29ad3cbebc57248c87255351aad3db98
Submitter: Zuul
Branch: master

commit c82ca2be29ad3cbebc57248c87255351aad3db98
Author: David Ames <email address hidden>
Date: Tue Feb 6 15:24:25 2018 -0800

    Allow setting gmcast.peer_timeout value

    In some resource constrained environments particularly during deploy
    time percona-cluster nodes can experience time outs during inter-node
    communication.

    This changes makes the gmcast.peer_timeout configurable based on the
    galara cluster documentation:
    http://galeracluster.com/documentation-webpages/galeraparameters.html

    Warning: Changing this value from the default may have unintended
    consequences. This should only be used when constraint's call for it.

    Closes-Bug: #1742683
    Change-Id: If93d6ba9d0e99b6af59358a7f3ea40e3aa2a6dbc

Changed in charm-percona-cluster:
status: In Progress → Fix Committed
James Page (james-page)
Changed in charm-percona-cluster:
milestone: none → 18.02
Ryan Beisner (1chb1n)
Changed in charm-percona-cluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.