Freshly started joiner gets stuck in joining state

Bug #1396601 reported by Jay Janssen on 2014-11-26
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
New
Undecided
Unassigned

Bug Description

I have an intermittent problem with freshly installed nodes (nodes with a clean datadir and fresh package install via vagrant) that get wedged in a 'Joining' state.

[root@pxc3 mysql]# mysql -e "show global status like 'wsrep%'"
+------------------------------+-------------------------------------------------------+
| Variable_name | Value |
+------------------------------+-------------------------------------------------------+
| wsrep_local_state_uuid | cabef144-756b-11e4-ae70-326c53cc04d5 |
| wsrep_protocol_version | 6 |
| wsrep_last_committed | 12 |
| wsrep_replicated | 0 |
| wsrep_replicated_bytes | 0 |
| wsrep_repl_keys | 0 |
| wsrep_repl_keys_bytes | 0 |
| wsrep_repl_data_bytes | 0 |
| wsrep_repl_other_bytes | 0 |
| wsrep_received | 2 |
| wsrep_received_bytes | 276 |
| wsrep_local_commits | 0 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_max | 1 |
| wsrep_local_send_queue_min | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 392 |
| wsrep_local_recv_queue_max | 392 |
| wsrep_local_recv_queue_min | 0 |
| wsrep_local_recv_queue_avg | 194.507614 |
| wsrep_local_cached_downto | 18446744073709551615 |
| wsrep_flow_control_paused_ns | 0 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 0.000000 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 1 |
| wsrep_local_state_comment | Joining |
| wsrep_cert_index_size | 0 |
| wsrep_causal_reads | 0 |
| wsrep_cert_interval | 0.000000 |
| wsrep_incoming_addresses | 172.28.128.4:3306,172.28.128.7:3306,172.28.128.3:3306 |
| wsrep_evs_delayed | |
| wsrep_evs_evict_list | |
| wsrep_evs_repl_latency | 0/0/0/0/0 |
| wsrep_evs_state | OPERATIONAL |
| wsrep_gcomm_uuid | 67850d43-756d-11e4-93cc-82163e0cb5e1 |
| wsrep_cluster_conf_id | 7 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | cabef144-756b-11e4-ae70-326c53cc04d5 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_index | 1 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 3.8(rf6147dd) |
| wsrep_ready | OFF |
+------------------------------+-------------------------------------------------------+

The node is stuck and I have to kill it. After restarting mysqld, it joins the cluster fine and normally. I cannot reproduce the 'Joining' wedge after the first time unless I rebuild the entire node from scratch.

Oddly, the other nodes see it as a member of the cluster and it receives replication (wsrep_local_recv_queue grows).

[root@pxc3 mysql]# rpm -qa | grep -i percona
percona-toolkit-2.2.11-1.noarch
percona-xtrabackup-2.2.6-5042.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.8-1.3390.rhel7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.21-25.8.938.el7.x86_64

I will attach the error log.

Jay Janssen (jay-janssen) wrote :
Jay Janssen (jay-janssen) wrote :
Teemu Ollakka (teemu-ollakka) wrote :
Jay Janssen (jay-janssen) wrote :

Steps to reproduce:

Install virtualbox
Install vagrant
vagrant plugin install vagrant-hostmanager

git clone
cd vagrant-percona
git submodule init; git submodule update
ln -sf Vagrantfile.pxc.rb Vagrantfile

vagrant up

# This will bring up a 3 node cluster by default. Sometimes all three nodes start fine, sometimes they do not. To reproduce the problem, try to repeatedly:

vagrant destroy pxc3 -f && vagrant up pxc3

until the pxc3 node gets errors. The errors (via puppet) will look like this:

==> pxc3: Notice: /Stage[main]/Test::User/Exec[create_test_localhost_user]/returns: ERROR 1047 (08S01) at line 1: Unknown command
==> pxc3: Error: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'localhost' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Error: /Stage[main]/Test::User/Exec[create_test_localhost_user]/returns: change from notrun to 0 failed: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'localhost' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Notice: /Stage[main]/Percona::Cluster::Sstuser/Exec[create_sst_user]/returns: ERROR 1047 (08S01) at line 1: Unknown command
==> pxc3: Error: mysql -e "GRANT LOCK TABLES, RELOAD, REPLICATION CLIENT ON *.* TO 'sst'@'localhost' IDENTIFIED BY 'secret'" returned 1 instead of one of [0]
==> pxc3: Error: /Stage[main]/Percona::Cluster::Sstuser/Exec[create_sst_user]/returns: change from notrun to 0 failed: mysql -e "GRANT LOCK TABLES, RELOAD, REPLICATION CLIENT ON *.* TO 'sst'@'localhost' IDENTIFIED BY 'secret'" returned 1 instead of one of [0]
==> pxc3: Notice: /Stage[main]/Test::User/Exec[create_test_global_user]/returns: ERROR 1047 (08S01) at line 1: Unknown command
==> pxc3: Error: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'%' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Error: /Stage[main]/Test::User/Exec[create_test_global_user]/returns: change from notrun to 0 failed: mysql -e "GRANT ALL PRIVILEGES ON *.* TO 'test'@'%' IDENTIFIED BY 'test'" returned 1 instead of one of [0]
==> pxc3: Notice: /Stage[main]/Percona::Cluster::Clustercheckuser/Exec[create_clustercheck_user]: Dependency Exec[create_sst_user] has failures: true
==> pxc3: Warning: /Stage[main]/Percona::Cluster::Clustercheckuser/Exec[create_clustercheck_user]: Skipping because of failed dependencies
==> pxc3: Info: Creating state file /var/lib/puppet/state/state.yaml

The issue with https://github.com/codership/galera/issues/143 was restricted to 5.6.20/galera 3.5.

@Jay,

Assuming you see with these packages,
 percona-toolkit-2.2.11-1.noarch
percona-xtrabackup-2.2.6-5042.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.21-25.8.938.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.8-1.3390.rhel7.x86_64
Percona-XtraDB-Cluster-server-56-5.6.21-25.8.938.el7.x86_64

": Synchronized with group, ready for connections" is missing in your error log as well.

However, in case of gh143, the issue didn't occur when rsync was the SST method. Do you see the same? Also, with xtrabackup SST method + latest xtrabackup it was not reproducible in case of others.

Also, this may be soemthing VM specific, since an earlier reported issue on this was not reproducible on bare-metal.

David Bennett (dbpercona) wrote :

Same problem as 1373796

Note the missing '[Note] WSREP: Synchronized with group, ready for connections' at the end of the wedged node log.

Jay Janssen (jay-janssen) wrote :

I'm a little skeptical of the duplicate status, since that bug is marked 'Fixed Released' and I'm reproducing on the latest release.

Alex Yurchenko (ayurchen) wrote :

Jay, gdb stack dump from the hanging node would be awesome.

Alex Yurchenko (ayurchen) wrote :

Also, could you try rsync SST instead?

Jay Janssen (jay-janssen) wrote :

Attaching -- don't have debuginfo installed in my test, but I probably could if you needed it.

Alex Yurchenko (ayurchen) wrote :

Huh, is that all my.cnf? Only one slave thread? Any errors in the error log between JOINED and SYNCED?
Suspicion: two subsequent monitor drains with the same seqno.

Download full text (7.3 KiB)

That is the entire my.cnf, yes.

> On Dec 2, 2014, at 8:25 PM, Alex Yurchenko <email address hidden> wrote:
>
> *** This bug is a duplicate of bug 1373796 ***
> https://bugs.launchpad.net/bugs/1373796
>
> Huh, is that all my.cnf? Only one slave thread? Any errors in the error log between JOINED and SYNCED?
> Suspicion: two subsequent monitor drains with the same seqno.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1396601
>
> Title:
> Freshly started joiner gets stuck in joining state
>
> Status in Percona XtraDB Cluster - HA scalable solution for MySQL:
> New
>
> Bug description:
> I have an intermittent problem with freshly installed nodes (nodes
> with a clean datadir and fresh package install via vagrant) that get
> wedged in a 'Joining' state.
>
>
> [root@pxc3 mysql]# mysql -e "show global status like 'wsrep%'"
> +------------------------------+-------------------------------------------------------+
> | Variable_name | Value |
> +------------------------------+-------------------------------------------------------+
> | wsrep_local_state_uuid | cabef144-756b-11e4-ae70-326c53cc04d5 |
> | wsrep_protocol_version | 6 |
> | wsrep_last_committed | 12 |
> | wsrep_replicated | 0 |
> | wsrep_replicated_bytes | 0 |
> | wsrep_repl_keys | 0 |
> | wsrep_repl_keys_bytes | 0 |
> | wsrep_repl_data_bytes | 0 |
> | wsrep_repl_other_bytes | 0 |
> | wsrep_received | 2 |
> | wsrep_received_bytes | 276 |
> | wsrep_local_commits | 0 |
> | wsrep_local_cert_failures | 0 |
> | wsrep_local_replays | 0 |
> | wsrep_local_send_queue | 0 |
> | wsrep_local_send_queue_max | 1 |
> | wsrep_local_send_queue_min | 0 |
> | wsrep_local_send_queue_avg | 0.000000 |
> | wsrep_local_recv_queue | 392 |
> | wsrep_local_recv_queue_max | 392 |
> | wsrep_local_recv_queue_min | 0 |
> | wsrep_local_recv_queue_avg | 194.507614 ...

Read more...

I've still been having this issue with the below packages. It always seemed like an intermittent problem. Recently I was able to track down something interesting. The node gets stuck in 'joining' whenever it gets SST from a node that was started with bootstrap-pxc.

Reproduce failure:

Build and start two servers.
Build a 3rd server and specify the donor to be the node started with bootstrap-pxc
It should get hung on joining

Reproduce success:

Build and start two servers.
Build a 3rd server and specify the donor to be the node started with a standard 'service mysql start'
It should connect fine.

Any thoughts?

percona-xtrabackup_2.2.10-1.precise_amd64.deb
percona-xtradb-cluster-56_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-client-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-common-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-server-5.6_5.6.22-25.8-978.precise_amd64.deb
percona-xtradb-cluster-galera-3.x_3.9.3494.precise_amd64.deb
percona-xtradb-cluster-galera-3_3.9.3494.precise_amd64.deb

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1774

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers