Cluster Stalled with Threads on 'query end state'

Bug #1149755 reported by Jervin R on 2013-03-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Invalid
Undecided
Unassigned

Bug Description

Under certain concurrency, a whole cluster can stall where one of the nodes whose threads occupying the InnoDB queue is on 'query end state'. In this case we have found several things:

- wsrep_replicate_myisam is enabled
- Oldest transaction is a fulltext search on MyISAM table, although it was not inside the InnoDB kernel
- wsrep_flow_control not getting paused
- wsrep_last_committed is frozen
- From the GDB dump, all the threads inside InnoDB have this code path:

48 pthread_cond_wait,gu_fifo_get_head(libgalera_smm.so),gcs_recv(libgalera_smm.so),galera::GcsActionSource::process(libgalera_smm.so),galera::ReplicatorSMM::async_recv(libgalera_smm.so),galera_recv(libgalera_smm.so),wsrep_replication_process(sql_parse.cc:8343),start_wsrep_THD(mysqld.cc:4529),start_thread(libpthread.so.0),clone(libc.so.6),??

- oprofile shows majority of the CPU is on LRU dump

revin@hq /cygdrive/l/dld/29762/pt-stalk $> cat
2013_02_26_18_31_21-opreport|head -n 10
CPU: Intel Architectural Perfmon, speed 3066.1 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
46709 92.4546 mysqld buf_LRU_file_dump
513 1.0154 mysqld sync_array_print_long_waits
410 0.8115 mysqld mysqld_list_processes(THD*,
char const*, bool)
255 0.5047 mysqld srv_lock_timeout_thread
162 0.3207 mysqld find_mpvio_user(MPVIO_EXT*)
156 0.3088 mysqld
fill_schema_processlist(THD*, TABLE_LIST*, Item*)
136 0.2692 mysqld my_pthread_fastmutex_lock

revin@hq /cygdrive/l/dld/29762/pt-stalk $> cat
2013_02_26_18_38_57-opreport|head -n10
CPU: Intel Architectural Perfmon, speed 3066.1 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
30095 93.5819 mysqld buf_LRU_file_dump
430 1.3371 mysqld sync_array_print_long_waits
394 1.2252 mysqld
add_to_status(system_status_var*, system_status_var*)
225 0.6996 mysqld srv_lock_timeout_thread
56 0.1741 mysqld
calc_sum_of_all_status(system_status_var*)
48 0.1493 mysqld my_pthread_fastmutex_lock
41 0.1275 mysqld strnmov

- All nodes has about 90G buffer pool
- 2 node cluster and the problem also happens even on single node.

Even though we see buf_LRU_file_dump at the top here, it is not
conclusive because of lp:1152571 - in short, the galera libs are
not included in that. We also see gcs_recv and other galera
functions in that.

However, the issue here is "innodb_buffer_pool_restore_at_startup 1"

That is a very small interval and is certain to cause issues for
the following reasons:

You are asking the LRU_dump thread to scan the entire 90G buffer
pool's LRU every second (while holding the LRU list mutex).

I would go with a duration like 60s or so.

I am marking this issue invalid, if anything is found relating to this which may not be due to LRU, mark it "New".

Changed in percona-xtradb-cluster:
status: New → Invalid

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1305

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers