settting wsrep_provider=none hangs

Bug #1208493 reported by Seppo Jaakola
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Committed
Low
Seppo Jaakola
5.5
Fix Committed
Low
Seppo Jaakola
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Fix Released
Undecided
Unassigned
5.6
Fix Released
Undecided
Unassigned

Bug Description

Setting wsrep_provider=none command hangs in MariaDB build.

Problem seems to be that closing slave applier thread does not broadcast COND_thread_count. Strange thing is that MySQL version works well, although the relevant code path seems to be identical.

Changed in codership-mysql:
status: New → In Progress
importance: Undecided → Low
assignee: nobody → Seppo Jaakola (seppo-jaakola)
milestone: none → 5.5.32-23.7.6
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Is it because the mariadb code has added logic in thd kill?

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Seppo, isn't it a duplicate of https://bugs.launchpad.net/codership-mysql/+bug/1136571 ? I have encountered that at least once. Maybe this fix is also due for MySQL-wsrep?

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Fix, that solves replication stop problem in MariaDB: http://bazaar.launchpad.net/~codership/codership-mysql/5.5-23/revision/3897

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Fix, that solves hanging when killed client connection tries to access LOCK_global_system_variables when replication stopping is progressing: http://bazaar.launchpad.net/~codership/codership-mysql/5.5-23/revision/3898

This resource conflict was hurting both MySQL and MariaDB versions. Also hanging in lp:1136571 seems to be due to this deadlock.

Note that this fix introduces a potential race condition, which will surface if there are several connections issuing wsrep_provider=none.

Changed in percona-xtradb-cluster:
milestone: none → 5.5.33-23.7.6
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Download full text (39.6 KiB)

It still seems to hang when

'set global wsrep_provider=none' is done.

=================================================
sudo /pxc/bin/mysqld --defaults-file=/pxc/etc/my.cnf.local --basedir=/pxc --user=mysql --wsrep-cluster-address="gcomm://?pc.ignore_sb=true" --wsrep-start-position='dd00df13-e87a-11e2-0800-5cbe45ea7ee5:38570'
[sudo] password for raghavendra:
Sorry, try again.
[sudo] password for raghavendra:
130816 6:44:07 [Warning] WSREP: wsrep_sst_receive_address is set to '127.0.0.1:4001' which makes it impossible for another host to reach this one. Please set it to the address which this node can be connected at by other cluster members.
130816 6:44:07 [Note] WSREP: wsrep_start_position var submitted: 'dd00df13-e87a-11e2-0800-5cbe45ea7ee5:38570'
130816 6:44:07 [Note] WSREP: Read nil XID from storage engines, skipping position init
130816 6:44:07 [Note] WSREP: wsrep_load(): loading provider library '/pxc/lib/libgalera_smm.so'
130816 6:44:07 [Note] WSREP: wsrep_load(): Galera 2.6(r300) by Codership Oy <email address hidden> loaded succesfully.
130816 6:44:07 [Note] WSREP: Found saved state: dd00df13-e87a-11e2-0800-5cbe45ea7ee5:51367
130816 6:44:07 [Note] WSREP: Reusing existing '/pxc/datadir//galera.cache'.
130816 6:44:07 [Note] WSREP: Passing config to GCS: base_host = 127.0.0.1; base_port = 4567; cert.log_conflicts = no; gcache.dir = /pxc/datadir/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /pxc/datadir//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://127.0.0.1:4010; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
130816 6:44:07 [Note] WSREP: Assign initial position for certification: 51367, protocol version: -1
130816 6:44:07 [Note] WSREP: wsrep_sst_grab()
130816 6:44:07 [Note] WSREP: Start replication
130816 6:44:07 [Note] WSREP: Setting initial position to dd00df13-e87a-11e2-0800-5cbe45ea7ee5:51367
130816 6:44:07 [Note] WSREP: protonet asio version 0
130816 6:44:07 [Note] WSREP: backend: asio
130816 6:44:07 [Note] WSREP: GMCast version 0
130816 6:44:07 [Note] WSREP: (260c3139-0611-11e3-9874-ebebb5b53a4b, 'tcp://127.0.0.1:4010') listening at tcp://127.0.0.1:4010
130816 6:44:07 [Note] WSREP: (260c3139-0611-11e3-9874-ebebb5b53a4b, 'tcp://127.0.0.1:4010') multicast: , ttl: 1
130816 6:44:07 [Note] WSREP: EVS version 0
130816 6:44:07 [Note] WSREP: PC version 0
130816 6:44:07 [Note] WSREP: gcomm: connecting to group 'Archie', peer ''
130816 6:44:07 [Note] WSREP: Node 260c3139-0611-11e3-9874-ebebb5b53a4b state prim
130816 6:44:07 [Note] WSREP: view(view_id(PRIM,260c3139-0611-11e3-9874-ebebb5b53a4b,1) memb {
        260c3139-0611-11e3-9874-ebebb5b53a4b,
} joined {
} left {
} partitioned {
})
130816 6:44:07 [Note] WSREP: gcomm: connected
130816 6:44:07 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
130816 6:44:07 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
130816 6:44:07 [Note] WSRE...

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Full backtrace from a hung server.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

However, sending SIGQUIT explicitly (with Ctrl-\) makes it shutdown, so this is fine I guess.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Download full text (4.5 KiB)

Ignore #6, and #7, they seem to comply with behavior in #8,

however, setting wsrep_provider to galera library doesn't go down well.

=============================
Version: '5.5.33-23.7.5-debug' socket: '/pxc/datadir/pxc.sock' port: 4000 Percona XtraDB Cluster (GPL) 5.5.33-23.7.5, Revision 465, wsrep_23.7.5.r465
130816 7:30:52 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130816 7:30:52 [Note] WSREP: Assign initial position for certification: 38570, protocol version: 2
130816 7:30:52 [Note] WSREP: Synchronized with group, ready for connections
130816 7:30:52 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130816 7:30:54 [Note] WSREP: Stop replication
130816 7:30:54 [Note] WSREP: Closing send monitor...
130816 7:30:54 [Note] WSREP: Closed send monitor.
130816 7:30:54 [Note] WSREP: gcomm: terminating thread
130816 7:30:54 [Note] WSREP: gcomm: joining thread
130816 7:30:54 [Note] WSREP: gcomm: closing backend
130816 7:30:54 [Note] WSREP: view((empty))
130816 7:30:54 [Note] WSREP: Received self-leave message.
130816 7:30:54 [Note] WSREP: gcomm: closed
130816 7:30:54 [Note] WSREP: Flow-control interval: [0, 0]
130816 7:30:54 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130816 7:30:54 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 38570)
130816 7:30:54 [Note] WSREP: RECV thread exiting 0: Success
130816 7:30:54 [Note] WSREP: New cluster view: global state: dd00df13-e87a-11e2-0800-5cbe45ea7ee5:38570, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130816 7:30:54 [Note] WSREP: recv_thread() joined.
130816 7:30:54 [Note] WSREP: Closing replication queue.
130816 7:30:54 [Note] WSREP: Closing slave action queue.
130816 7:30:54 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130816 7:30:54 [Note] WSREP: applier thread exiting (code:0)
130816 7:30:54 [Note] WSREP: applier thread exiting (code:5)
130816 7:30:56 [Note] WSREP: rollbacker thread exiting
130816 7:30:56 [Note] WSREP: dtor state: CLOSED
130816 7:30:56 [Note] WSREP: apply mon: entered 0
130816 7:30:56 [Note] WSREP: apply mon: entered 0
130816 7:30:56 [Note] WSREP: mon: entered 3 oooe fraction 0 oool fraction 0
130816 7:30:56 [Note] WSREP: cert index usage at exit 0
130816 7:30:56 [Note] WSREP: cert trx map usage at exit 0
130816 7:30:56 [Note] WSREP: deps set usage at exit 0
130816 7:30:56 [Note] WSREP: avg deps dist 0
130816 7:30:56 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0
130816 7:30:56 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 38570)
130816 7:30:56 [Note] WSREP: Flushing memory map to disk...
130816 7:30:56 [Note] WSREP: Initial position: dd00df13-e87a-11e2-0800-5cbe45ea7ee5:38570
130816 7:30:56 [Note] WSREP: wsrep_load(): loading provider library 'none'
130816 7:30:56 [ERROR] WSREP: Failed to get provider options
130816 7:31:10 [Note] WSREP: Stop replication
130816 7:31:12 [Note] WSREP: Initial position: dd00df13-e87a-11e2-0800-5cbe45ea7ee5:38570
130816 7:31:12 [Note] WSREP: wsrep_load(): loading provider library '/pxc/lib/libgalera_smm.so'
130816 7:31:12 [Note] WSREP: wsrep_load(): Galera 2.6(r300) by Codership Oy <info...

Read more...

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Backtrace for #9.

Changed in percona-xtradb-cluster:
status: New → In Progress
Changed in codership-mysql:
milestone: 5.5.33-23.7.6 → none
Changed in percona-xtradb-cluster:
milestone: 5.5.33-23.7.6 → future-5.5
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Raghu, setting provider to 'none' and back is well defined. And according to your description it seemed to have been a success. You seem to have forgotten to connect to or bootstrap a primary component. "Unknown command" is a typical response to a non-primary component.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Download full text (4.2 KiB)

@Alex,

Ah yes, that could be the reason.
But I tried to bootstrap it, it doesn't work.

""
SET GLOBAL wsrep_provider_options="pc.bootstrap=1";
ERROR 1210 (HY000): Incorrect arguments to SET
""

This may be because after setting the provider again (from
'none') I see following in error log (and in variables)

131212 15:51:50 [Note] WSREP: Passing config to GCS: base_host = 127.0.0.1; base_port = 4567; cert.log_conflicts = no; gcache.dir = /pxc/datadir/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /pxc/datadir//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5

There are two issues here:

a) The config set in my.cnf is lost, it is not refreshed to
provider.

b) There are no pc, protonet, ist etc. parameters above. So they
are also lost. Hence bootstrapping is not possible.

Full log:
=================================================================================

131212 15:51:48 [Note] WSREP: wsrep_provider_update: /pxc56/lib/libgalera_smm.so
131212 15:51:48 [Note] WSREP: Stop replication
131212 15:51:48 [Note] WSREP: Provider disconnect
131212 15:51:50 [Note] WSREP: waiting for client connections to close: 1
131212 15:51:50 [Note] WSREP: Setting wsrep_ready to 0
131212 15:51:50 [Note] WSREP: Read WSREPXid from InnoDB: 5805bbc4-3038-11e3-937d-aa946afe7370:15933
131212 15:51:50 [Note] WSREP: Initial position: 5805bbc4-3038-11e3-937d-aa946afe7370:15933
131212 15:51:50 [Note] WSREP: wsrep_load(): loading provider library '/pxc56/lib/libgalera_smm.so'
131212 15:51:50 [Note] WSREP: wsrep_load(): Galera 3.2(rXXXX) by Codership Oy <email address hidden> loaded successfully.
131212 15:51:50 [Note] WSREP: CRC-32C: using hardware acceleration.
131212 15:51:50 [Note] WSREP: Found saved state: 5805bbc4-3038-11e3-937d-aa946afe7370:15933
131212 15:51:50 [Note] WSREP: Passing config to GCS: base_host = 127.0.0.1; base_port = 4567; cert.log_conflicts = no; gcache.dir = /pxc/datadir/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /pxc/datadir//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
131212 15:51:50 [Note] WSREP: Assign initial position for certification: 15933, protocol version: -1
131212 15:51:50 [Note] WSREP: wsrep_cluster_address_init: null -> null
131212 15:51:50 [Note] WSREP: refresh_provider_options: null
131212 15:53:20 [Warning] WSREP: Unknown parameter 'pc.bootstrap'
131212 15:53:20 [ERROR] WSREP: Set options returned 7
131212 15:53:20 [Note] WSREP: refresh_provider_options: pc.bootstrap=1;
131212 15:53:22 [Warning] WSREP: Unk...

Read more...

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

The " 131212 15:51:41 [ERROR] WSREP: Failed to get provider
options" may be normal for 'none' provider. So, that is fine.

Now, regarding partial restore of options, ie. without gmcast,
pc, etc. may be due to

char* wsrep_get_params(const galera::Replicator& repl)
{
    std::ostringstream os;
    os << repl.params();
    return strdup(os.str().c_str());
}

a) Only a few parameters are restored.

b) The ones restored are different from the ones set earlier in
my.cnf (I tested this by adding repl.commit_order in cnf).

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Since this issue deals with hanging related to 'none' I will
report #12, #13 separately.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Reported provider issues here https://bugs.launchpad.net/codership-mysql/+bug/1260283

As far as the main issue of this bug is concerned, setting provider to 'none' didn't hang for me. (though didn't test under high load).

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Tested with a heavy sysbench workload + setting provider=none, there was no hang (in 5.5).

131212 16:41:05 [Note] WSREP: wsrep_provider_update: none
131212 16:41:05 [Note] WSREP: Stop replication
131212 16:41:05 [Note] WSREP: Provider disconnect
131212 16:41:05 [Note] WSREP: Closing send monitor...
131212 16:41:05 [Note] WSREP: Closed send monitor.
131212 16:41:05 [Note] WSREP: gcomm: terminating thread
131212 16:41:05 [Note] WSREP: gcomm: joining thread
131212 16:41:05 [Note] WSREP: gcomm: closing backend
131212 16:41:05 [Note] WSREP: view((empty))
131212 16:41:05 [Note] WSREP: Received self-leave message.
131212 16:41:05 [Note] WSREP: gcomm: closed
131212 16:41:05 [Note] WSREP: Flow-control interval: [0, 0]
131212 16:41:05 [Note] WSREP: Received SELF-LEAVE. Closing connection.
131212 16:41:05 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 18490)
131212 16:41:05 [Note] WSREP: RECV thread exiting 0: Success
131212 16:41:05 [Note] WSREP: recv_thread() joined.
131212 16:41:05 [Note] WSREP: Closing replication queue.
131212 16:41:05 [Note] WSREP: Closing slave action queue.
131212 16:41:05 [Note] WSREP: closing connection 34
131212 16:41:05 [Note] WSREP: closing connection 33
131212 16:41:05 [Note] WSREP: closing connection 32
131212 16:41:05 [Note] WSREP: closing connection 31
131212 16:41:05 [Note] WSREP: closing connection 30
131212 16:41:05 [Note] WSREP: closing connection 29
131212 16:41:05 [Note] WSREP: closing connection 28
131212 16:41:05 [Note] WSREP: closing connection 27
131212 16:41:05 [Note] WSREP: closing connection 26
131212 16:41:05 [Note] WSREP: closing connection 25
131212 16:41:05 [Note] WSREP: closing connection 24
131212 16:41:05 [Note] WSREP: aborting replicated trx: 140488473822976
131212 16:41:05 [Note] WSREP: wsrep_abort_thd, by: 140488473822976, victim: 140488473822976
131212 16:41:05 [Note] WSREP: abort transaction: BF: COMMIT victim: COMMIT
131212 16:41:05 [Note] WSREP: cluster conflict due to high priority abort for threads:
131212 16:41:05 [Note] WSREP: Winning thread:
   THD: 23, mode: local, state: committing, conflict: no conflict, seqno: 18488
   SQL: COMMIT
131212 16:41:05 [Note] WSREP: Victim thread:
   THD: 23, mode: local, state: committing, conflict: no conflict, seqno: 18488
   SQL: COMMIT
1
............................
.............................

Full log http://paste.wnohang.net/a2c974

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Fix https://bazaar.launchpad.net/~codership/codership-mysql/5.5-23/revision/3898 is not in 5.6 tree yet. The other fixes are there (in different location).

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Raghu,

Original config is lost because tachnically we don't know if it is suitable any more for a new provider. Maybe you want to reconfigure the whole thing. So the proper way to load the provider in run time should be:

1) SET GLOBAL wsrep_provider_options='...';
2) SET GLOBAL wsrep_provider='...';
3) SET GLOBAL wsrep_cluster_address='...'; (or bootstrap if applicable)

- but there is a bug that prevents proper setting of wsrep_provider_options in runtime: https://bugs.launchpad.net/codership-mysql/5.6/+bug/1260290

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Setting wsrep_provider_options='none' hangs for me alright - on idle node. It is still there.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Alex,

@18,

Do we need to provide full wsrep_provider_options in #1? or just the ones provided in my.cnf. Also, what explains error while setting pc.bootstrap to 1? (since pc itself is missing from provider options along with others).

@19,

Do you mean wsrep_provider='none'?

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Raghu,

@18: you need to set only the options you want, same as with my.cnf

@19: indeed

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Merged the fix to release LOCK_global_system_variables when calling wsrep_stop_replication, also into wsrep-5.6.
It turned out, with new regression test, that 5.6 version is also vulnerable.
Note that wsrep_stop_replication is called both when wsrep_provider or wsrep_cluster_address is changed, so had to do the mutex release in both variable update handlers.

Fix in wsrep-5.6 is in revision: http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/4031

wsrep-5.5 does not have protection in wsrep_cluster_address variable's update handler. So this poart of fix needs to merged in wsrep-5.5 as well.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Merged the fix for wsrep_cluster_address update into wsrep-5.5, and added two more fixes for wsrep_provider changes:
* counting applier threads by wsrep_running_threads variable, which is accessed under LOCK_thread_count mutex
    There was a race condition, when appliers were counted by scanning threads, which did not hold LOCK_thread_count
* avoiding to put wsrep applier threads back to thread cache. Appliers' closing does not currently send correctly
    COND_thread_cache signal, and applier might hang when appending to thread cache

These fixes were pushed in revision: http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3936
and later cleanups in following revisions #3937 and #3938

Changed in codership-mysql:
status: In Progress → Fix Committed
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1411

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.