Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Freshly started joiner gets stuck in joining state

Bug #1396601 reported by Jay Janssen on 2014-11-26

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	New	Undecided	Unassigned

Bug Description

I have an intermittent problem with freshly installed nodes (nodes with a clean datadir and fresh package install via vagrant) that get wedged in a 'Joining' state.

[root@pxc3 mysql]# mysql -e "show global status like 'wsrep%'"
+------------------------------+-------------------------------------------------------+
| Variable_name | Value |
+------------------------------+-------------------------------------------------------+
| wsrep_local_state_uuid | cabef144-756b-11e4-ae70-326c53cc04d5 |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 12 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_repl_keys | 0 |
| wsrep_repl_keys_bytes | 0 |
| wsrep_repl_data_bytes | 0 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 2 |
| wsrep_received_bytes | 276 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 1 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 392 |
| wsrep_local_recv_queue_max | 392 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 194.507614 |
| wsrep_local_cached_downto | 18446744073709551615 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 1 |
| wsrep_local_state_comment | Joining |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.000000 |
| wsrep_incoming_addresses | 172.28.128.4:3306,172.28.128.7:3306,172.28.128.3:3306 |
| wsrep_evs_delayed | |
| wsrep_evs_evict_list | |
| wsrep_evs_repl_latency | 0/0/0/0/0 |
| wsrep_evs_state | OPERATIONAL |
| wsrep_gcomm_uuid | 67850d43-756d-11e4-93cc-82163e0cb5e1 |
| wsrep_cluster_conf_id | 7 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | cabef144-756b-11e4-ae70-326c53cc04d5 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 3.8(rf6147dd) |
| wsrep_ready | OFF |
+------------------------------+-------------------------------------------------------+

The node is stuck and I have to kill it. After restarting mysqld, it joins the cluster fine and normally. I cannot reproduce the 'Joining' wedge after the first time unless I rebuild the entire node from scratch.

Oddly, the other nodes see it as a member of the cluster and it receives replication (wsrep_local_recv_queue grows).

[root@pxc3 mysql]# rpm -qa | grep -i percona
percona-toolkit-2.2.11-1.noarch
percona-xtrabackup-2.2.6-5042.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.8-1.3390.rhel7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.21-25.8.938.el7.x86_64

I will attach the error log.

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2014-11-26:

Error log of wedged node Edit (20.7 KiB, text/plain)

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2014-11-26:

my.cnf of wedged node Edit (1010 bytes, text/plain)

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2014-11-26:

Corresponding Galera ticket: https://github.com/codership/galera/issues/143

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2014-11-26:

Steps to reproduce:

Install virtualbox
Install vagrant
vagrant plugin install vagrant-hostmanager

git clone
cd vagrant-percona
git submodule init; git submodule update
ln -sf Vagrantfile.pxc.rb Vagrantfile

vagrant up

# This will bring up a 3 node cluster by default. Sometimes all three nodes start fine, sometimes they do not. To reproduce the problem, try to repeatedly:

vagrant destroy pxc3 -f && vagrant up pxc3

until the pxc3 node gets errors. The errors (via puppet) will look like this:

==> pxc3: Notice: /Stage[main]/Test::User/Exec[create_test_localhost_user]/returns: ERROR 1047 (08S01) at line 1: Unknown command
==> pxc3: Error: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'localhost' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Error: /Stage[main]/Test::User/Exec[create_test_localhost_user]/returns: change from notrun to 0 failed: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'localhost' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Notice: /Stage[main]/Percona::Cluster::Sstuser/Exec[create_sst_user]/returns: ERROR 1047 (08S01) at line 1: Unknown command
==> pxc3: Error: mysql -e "GRANT LOCK TABLES, RELOAD, REPLICATION CLIENT ON *.* TO 'sst'@'localhost' IDENTIFIED BY 'secret'" returned 1 instead of one of [0]
==> pxc3: Error: /Stage[main]/Percona::Cluster::Sstuser/Exec[create_sst_user]/returns: change from notrun to 0 failed: mysql -e "GRANT LOCK TABLES, RELOAD, REPLICATION CLIENT ON *.* TO 'sst'@'localhost' IDENTIFIED BY 'secret'" returned 1 instead of one of [0]
==> pxc3: Notice: /Stage[main]/Test::User/Exec[create_test_global_user]/returns: ERROR 1047 (08S01) at line 1: Unknown command
==> pxc3: Error: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'%' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Error: /Stage[main]/Test::User/Exec[create_test_global_user]/returns: change from notrun to 0 failed: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'%' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Notice: /Stage[main]/Percona::Cluster::Clustercheckuser/Exec[create_clustercheck_user]: Dependency Exec[create_sst_user] has failures: true
==> pxc3: Warning: /Stage[main]/Percona::Cluster::Clustercheckuser/Exec[create_clustercheck_user]: Skipping because of failed dependencies
==> pxc3: Info: Creating state file /var/lib/puppet/state/state.yaml

Steps to reproduce:

Install virtualbox
Install vagrant
vagrant plugin install vagrant-hostmanager

git clone 
cd vagrant-percona
git submodule init; git submodule update
ln -sf Vagrantfile.pxc.rb Vagrantfile

vagrant up

# This will bring up a 3 node cluster by default.  Sometimes all three nodes start fine, sometimes they do not.  To reproduce the problem, try to repeatedly:

vagrant destroy pxc3 -f && vagrant up pxc3

until the pxc3 node gets errors.  The errors (via puppet) will look like this:

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-11-26:

The issue with https://github.com/codership/galera/issues/143 was restricted to 5.6.20/galera 3.5.

@Jay,

Assuming you see with these packages,
percona-toolkit-2.2.11-1.noarch
percona-xtrabackup-2.2.6-5042.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.8-1.3390.rhel7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.21-25.8.938.el7.x86_64

": Synchronized with group, ready for connections" is missing in your error log as well.

However, in case of gh143, the issue didn't occur when rsync was the SST method. Do you see the same? Also, with xtrabackup SST method + latest xtrabackup it was not reproducible in case of others.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-11-26:

Also, this may be soemthing VM specific, since an earlier reported issue on this was not reproducible on bare-metal.

Revision history for this message

David Bennett (dbpercona) wrote on 2014-11-26:

Same problem as 1373796

Note the missing '[Note] WSREP: Synchronized with group, ready for connections' at the end of the wedged node log.

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2014-11-26:

I'm a little skeptical of the duplicate status, since that bug is marked 'Fixed Released' and I'm reproducing on the latest release.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-11-27:

Jay, gdb stack dump from the hanging node would be awesome.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-11-27:

#10

Also, could you try rsync SST instead?

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2014-12-01:

#11

gdb.txt Edit (13.4 KiB, text/plain)

Attaching -- don't have debuginfo installed in my test, but I probably could if you needed it.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-12-03:

#12

Huh, is that all my.cnf? Only one slave thread? Any errors in the error log between JOINED and SYNCED?
Suspicion: two subsequent monitor drains with the same seqno.

Revision history for this message

Jay Janssen (jay-janssen) wrote on 2014-12-03: Re: [Bug 1396601] Freshly started joiner gets stuck in joining state

#13

Download full text (7.3 KiB)

That is the entire my.cnf, yes.

> On Dec 2, 2014, at 8:25 PM, Alex Yurchenko <email address hidden> wrote:
>
> *** This bug is a duplicate of bug 1373796 ***
> https://bugs.launchpad.net/bugs/1373796
>
> Huh, is that all my.cnf? Only one slave thread? Any errors in the error log between JOINED and SYNCED?
> Suspicion: two subsequent monitor drains with the same seqno.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1396601
>
> Title:
> Freshly started joiner gets stuck in joining state
>
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> New
>
> Bug description:
> I have an intermittent problem with freshly installed nodes (nodes
> with a clean datadir and fresh package install via vagrant) that get
> wedged in a 'Joining' state.
>
>
> [root@pxc3 mysql]# mysql -e "show global status like 'wsrep%'"
> +------------------------------+-------------------------------------------------------+
> | Variable_name | Value |
> +------------------------------+-------------------------------------------------------+
> | wsrep_local_state_uuid | cabef144-756b-11e4-ae70-326c53cc04d5 |
> | wsrep_protocol_version | 6 |
> | wsrep_last_committed | 12 |
> | wsrep_replicated | 0 |
> | wsrep_replicated_bytes | 0 |
> | wsrep_repl_keys | 0 |
> | wsrep_repl_keys_bytes | 0 |
> | wsrep_repl_data_bytes | 0 |
> | wsrep_repl_other_bytes | 0 |
> | wsrep_received | 2 |
> | wsrep_received_bytes | 276 |
> | wsrep_local_commits | 0 |
> | wsrep_local_cert_failures | 0 |
> | wsrep_local_replays | 0 |
> | wsrep_local_send_queue | 0 |
> | wsrep_local_send_queue_max | 1 |
> | wsrep_local_send_queue_min | 0 |
> | wsrep_local_send_queue_avg | 0.000000 |
> | wsrep_local_recv_queue | 392 |
> | wsrep_local_recv_queue_max | 392 |
> | wsrep_local_recv_queue_min | 0 |
> | wsrep_local_recv_queue_avg | 194.507614 ...

That is the entire my.cnf, yes.

> On Dec 2, 2014, at 8:25 PM, Alex Yurchenko <1396601@bugs.launchpad.net> wrote:
> 
> *** This bug is a duplicate of bug 1373796 ***
>    https://bugs.launchpad.net/bugs/1373796
> 
> Huh, is that all my.cnf? Only one slave thread? Any errors in the error log between JOINED and SYNCED?
> Suspicion: two subsequent monitor drains with the same seqno.
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1396601
> 
> Title:
>  Freshly started joiner gets stuck in joining state
> 
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
>  New
> 
> Bug description:
>  I have an intermittent problem with freshly installed nodes (nodes
>  with a clean datadir and fresh package install via vagrant) that get
>  wedged in a 'Joining' state.
> 
> 
>  [root@pxc3 mysql]# mysql -e "show global status like 'wsrep%'"
>  +------------------------------+-------------------------------------------------------+
>  | Variable_name                | Value                                                 |
>  +------------------------------+-------------------------------------------------------+
>  | wsrep_local_state_uuid       | cabef144-756b-11e4-ae70-326c53cc04d5                  |
>  | wsrep_protocol_version       | 6                                                     |
>  | wsrep_last_committed         | 12                                                    |
>  | wsrep_replicated             | 0                                                     |
>  | wsrep_replicated_bytes       | 0                                                     |
>  | wsrep_repl_keys              | 0                                                     |
>  | wsrep_repl_keys_bytes        | 0                                                     |
>  | wsrep_repl_data_bytes        | 0                                                     |
>  | wsrep_repl_other_bytes       | 0                                                     |
>  | wsrep_received               | 2                                                     |
>  | wsrep_received_bytes         | 276                                                   |
>  | wsrep_local_commits          | 0                                                     |
>  | wsrep_local_cert_failures    | 0                                                     |
>  | wsrep_local_replays          | 0                                                     |
>  | wsrep_local_send_queue       | 0                                                     |
>  | wsrep_local_send_queue_max   | 1                                                     |
>  | wsrep_local_send_queue_min   | 0                                                     |
>  | wsrep_local_send_queue_avg   | 0.000000                                              |
>  | wsrep_local_recv_queue       | 392                                                   |
>  | wsrep_local_recv_queue_max   | 392                                                   |
>  | wsrep_local_recv_queue_min   | 0                                                     |
>  | wsrep_local_recv_queue_avg   | 194.507614                                            |
>  | wsrep_local_cached_downto    | 18446744073709551615                                  |
>  | wsrep_flow_control_paused_ns | 0                                                     |
>  | wsrep_flow_control_paused    | 0.000000                                              |
>  | wsrep_flow_control_sent      | 0                                                     |
>  | wsrep_flow_control_recv      | 0                                                     |
>  | wsrep_cert_deps_distance     | 0.000000                                              |
>  | wsrep_apply_oooe             | 0.000000                                              |
>  | wsrep_apply_oool             | 0.000000                                              |
>  | wsrep_apply_window           | 0.000000                                              |
>  | wsrep_commit_oooe            | 0.000000                                              |
>  | wsrep_commit_oool            | 0.000000                                              |
>  | wsrep_commit_window          | 0.000000                                              |
>  | wsrep_local_state            | 1                                                     |
>  | wsrep_local_state_comment    | Joining                                               |
>  | wsrep_cert_index_size        | 0                                                     |
>  | wsrep_causal_reads           | 0                                                     |
>  | wsrep_cert_interval          | 0.000000                                              |
>  | wsrep_incoming_addresses     | 172.28.128.4:3306,172.28.128.7:3306,172.28.128.3:3306 |
>  | wsrep_evs_delayed            |                                                       |
>  | wsrep_evs_evict_list         |                                                       |
>  | wsrep_evs_repl_latency       | 0/0/0/0/0                                             |
>  | wsrep_evs_state              | OPERATIONAL                                           |
>  | wsrep_gcomm_uuid             | 67850d43-756d-11e4-93cc-82163e0cb5e1                  |
>  | wsrep_cluster_conf_id        | 7                                                     |
>  | wsrep_cluster_size           | 3                                                     |
>  | wsrep_cluster_state_uuid     | cabef144-756b-11e4-ae70-326c53cc04d5                  |
>  | wsrep_cluster_status         | Primary                                               |
>  | wsrep_connected              | ON                                                    |
>  | wsrep_local_bf_aborts        | 0                                                     |
>  | wsrep_local_index            | 1                                                     |
>  | wsrep_provider_name          | Galera                                                |
>  | wsrep_provider_vendor        | Codership Oy <info@codership.com>                     |
>  | wsrep_provider_version       | 3.8(rf6147dd)                                         |
>  | wsrep_ready                  | OFF                                                   |
>  +------------------------------+-------------------------------------------------------+
> 
> 
>  The node is stuck and I have to kill it.  After restarting mysqld, it joins the cluster fine and normally.  I cannot reproduce the 'Joining' wedge after the first time unless I rebuild the entire node from scratch.
> 
>  Oddly, the other nodes see it as a member of the cluster and it
>  receives replication (wsrep_local_recv_queue grows).
> 
> 
>  [root@pxc3 mysql]# rpm -qa | grep -i percona
>  percona-toolkit-2.2.11-1.noarch
>  percona-xtrabackup-2.2.6-5042.el7.x86_64
>  Percona-XtraDB-Cluster-devel-56-5.6.21-25.8.938.el7.x86_64
>  Percona-XtraDB-Cluster-shared-56-5.6.21-25.8.938.el7.x86_64
>  Percona-XtraDB-Cluster-client-56-5.6.21-25.8.938.el7.x86_64
>  Percona-XtraDB-Cluster-galera-3-3.8-1.3390.rhel7.x86_64
>  Percona-XtraDB-Cluster-server-56-5.6.21-25.8.938.el7.x86_64
> 
> 
>  I will attach the error log.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1396601/+subscriptions

Jay Janssen, Managing Consultant, Percona
http://about.me/jay.janssen

Revision history for this message

Marc Castrovinci (marc-castrovinci) wrote on 2015-04-23:

#14

I've still been having this issue with the below packages. It always seemed like an intermittent problem. Recently I was able to track down something interesting. The node gets stuck in 'joining' whenever it gets SST from a node that was started with bootstrap-pxc.

Reproduce failure:

Build and start two servers.
Build a 3rd server and specify the donor to be the node started with bootstrap-pxc
It should get hung on joining

Reproduce success:

Build and start two servers.
Build a 3rd server and specify the donor to be the node started with a standard 'service mysql start'
It should connect fine.

Any thoughts?

percona-xtrabackup_2.2.10-1.precise_amd64.deb
percona-xtradb-cluster-56_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-client-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-common-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-server-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-galera-3.x_3.9.3494.precise_amd64.deb
percona-xtradb-cluster-galera-3_3.9.3494.precise_amd64.deb

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#15

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1774

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.