PXC donor node got stuck during SST (using thread pool plugin)

Bug #1206565 reported by Kenny Gryp
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Raghavendra D Prabhu

Bug Description

- a PXC cluster with 1 active node (just bootstrapped)
- another node is SST'ing with xtrabackup

- Suddenly, (I think just after the backup has been taken) the master node locks up and you get:

130730 6:03:10 [Note] WSREP: Provider paused at a6280df4-f907-11e2-b79a-3fa52bb11ac5:261230
130730 6:04:40 [ERROR] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation.
 If 'extra_port' parameter is set, you can still connect to the database with superuser account (it must be TCP connection using extra_port as TCP port) and troubleshoot the situation. A likely cause of pool blocks are clients that lock resources for long time. 'show
processlist' or 'show engine innodb status' can give additional hints.
130730 6:04:40 [Note] Threadpool has been blocked for 30 seconds

- A few minutes later, the innobackupex prepare on the JOINER was finished

- I got TOO MANY CONNECTIONS on the donor, and --extra-port wasn't set
- I tried changing max_connections using Aurimas'es method: http://www.mysqlperformanceblog.com/2010/03/23/too-many-connections-no-problem/ but did not work

- poor mans profiler output:
    458 pthread_cond_timedwait@@GLIBC_2.3.2,MDL_wait::timed_wait(THD*,,MDL_context::acquire_lock(MDL_request*,,open_table(THD*,,open_tables(THD*,,mysql_update(THD*,,mysql_execute_command(THD*),mysql_parse(THD*,,??,dispatch_command(enum_server_command,,do_command(THD*),threadpool_process_request(THD*),??,start_thread,clone,??
     40 epoll_wait,io_poll_wait(int,,??,get_event(worker_thread_t*,,??,start_thread,clone,??
     10 ??,??,??,??,start_thread,clone,??
      8 pthread_cond_wait@@GLIBC_2.3.2,gu_fifo_get_head,gcs_recv,galera::GcsActionSource::process(void*),galera::ReplicatorSMM::async_recv(void*),galera_recv,wsrep_replication_process(THD*),start_wsrep_THD,start_thread,clone,??
      3 pthread_cond_timedwait@@GLIBC_2.3.2,??,??,start_thread,clone,??
      2 pthread_cond_wait@@GLIBC_2.3.2,??,??,start_thread,clone,??
      2 pthread_cond_timedwait@@GLIBC_2.3.2,MDL_wait::timed_wait(THD*,,MDL_context::acquire_lock(MDL_request*,,open_table(THD*,,open_tables(THD*,,open_and_lock_tables(THD*,,mysql_insert(THD*,,mysql_execute_command(THD*),mysql_parse(THD*,,??,dispatch_command(enum_server_command,,do_command(THD*),threadpool_process_request(THD*),??,start_thread,clone,??
      1 select,??,??,start_thread,clone,??
      1 read,_IO_file_underflow,_IO_default_uflow,_IO_getline_info,fgets,??,??,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,wsrep_rollback_process(THD*),start_wsrep_THD,start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,galera::ServiceThd::thd_func(void*),start_thread,clone,??
      1 pthread_cond_wait@@GLIBC_2.3.2,??,gcs_core_recv,??,start_thread,clone,??
      1 pthread_cond_timedwait@@GLIBC_2.3.2,??,start_thread,clone,??
      1 poll,handle_connections_sockets(),mysqld_main(int,,__libc_start_main,_start
      1 epoll_wait,??,asio::detail::task_io_service::do_one(asio::detail::scoped_lock<asio::detail::posix_mutex>&,,??,gcomm::AsioProtonet::event_loop(gu::datetime::Period,GCommConn::run(),GCommConn::run_fn(void*),start_thread,clone,??
      1 do_sigwait,sigwait,signal_hand,start_thread,clone,??
      1

In order to resolve this, I killed alll the SST processes and mysql processes and started mysql again.

Raghu mentioned it could be related to some FTWRL...

Revision history for this message
Kenny Gryp (gryp) wrote :

Will ask the customer to add --extra-port and get SHOW PROCESSLIST and 'SHOW ENGINE INNODB STATUS' if it happens again.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Kenny,

a) Need error logs of donor and joiner.

b) innobackupex logs too

c) When the PMP was taken, (presumably on donr), was it taken when provider was paused (and not resumed) or was it taken later. Provide timestamp of PMP if possible.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

So, the solution to this is to add a SST option to use extra_port and use that extra port (after detection of threadpool) with innobackupex like

innobackupex --host --port ..

This will make innobackupex use TCP on extra_port.

Note that, since innobackupex uses mysql just for obtaining status variables. This shouldn't last long or cause any issues.

Moreso, the extra_port is specifically meant for maintenance and/or admn tasks like backups.

Also, it would be recommended to have extra_port set to 2 or more (than default 1), so that while one is being used for backup, the other can be used for logging through client (if necessary).

Changed in percona-xtradb-cluster:
milestone: none → 5.5.32-23.7.6
status: New → Triaged
assignee: nobody → Raghavendra D Prabhu (raghavendra-prabhu)
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

"Also, it would be recommended to have extra_port set to 2 or more (than default 1), so that while one is being used for backup, the other can be used for logging through client (if necessary)."

Supposed to mean extra_max_connections.

Revision history for this message
Kenny Gryp (gryp) wrote :
Download full text (7.6 KiB)

The PMP was taken later on, after the database was stuck for over an hour.

DONOR:
set starts around 4:09. However, the logs have been removed from the server, cannot access them anymore.
Then a long time nothing was logged and at the moment the database got stuck (confirmed by customer), we got those 2 errors:

130730 6:03:10 [Note] WSREP: Provider paused at a6280df4-f907-11e2-b79a-3fa52bb11ac5:261230
130730 6:04:40 [ERROR] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation.
 If 'extra_port' parameter is set, you can still connect to the database with superuser account (it must be TCP connection using extra_port as TCP port) and troubleshoot the situation. A likely cause of pool blocks are clients that lock resources for long time. 'show
processlist' or 'show engine innodb status' can give additional hints.
130730 6:04:40 [Note] Threadpool has been blocked for 30 seconds

JOINER:
130730 04:09:52 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
130730 04:09:52 mysqld_safe Skipping wsrep-recover for empty datadir: /var/lib/mysql
130730 04:09:52 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
130730 4:09:52 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
130730 4:09:52 [ERROR] An old style --language value with language specific part detected: /usr/share/mysql/english/
130730 4:09:52 [ERROR] Use --lc-messages-dir without language specific part instead.
130730 4:09:52 [Note] WSREP: Read nil XID from storage engines, skipping position init
130730 4:09:52 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
130730 4:09:52 [Note] WSREP: wsrep_load(): Galera 2.5(r150) by Codership Oy <email address hidden> loaded succesfully.
130730 4:09:52 [Note] WSREP: Found saved state: cfa1bbe6-ad61-11e2-0800-ddbec9296193:1136915979
130730 4:09:52 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
130730 4:09:52 [Note] WSREP: Passing config to GCS: base_host = 192.168.20.122; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 2G; gcs.fc_debug = 0; gcs.fc_factor = 0.99; gcs.fc_limit = 256; gcs.fc_master_slave = yes; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
130730 4:09:52 [Note] WSREP: Assign initial position for certification: 1136915979, protocol version: -1
130730 4:09:52 [Note] WSREP: wsrep_sst_grab()
130730 4:09:52 [Note] WSREP: Start replication
130730 4:09:52 [Note] WSREP: Setting initial position to cfa1bbe6-ad61-11e2-0800-ddbec9296193:1136915979
130730 4:09:52 [Note] WSREP: protonet asio version 0
130730 4:09:52 [Note] WSREP: backend: asio
130730 4:09:52 [Note] WSREP: GMCast version 0
130730 4:09:52 [Note] WSREP: (8e9245e4-f908-11e2-0800-...

Read more...

Changed in percona-xtradb-cluster:
status: Triaged → Fix Committed
Revision history for this message
Roel Van de Paar (roel11) wrote :

See also bug 1221608

Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1407

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.