PXC crashes if fc_limit=20000 and under stress

Bug #1579987 reported by Kenn Takara
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Kenn Takara

Bug Description

On a 256MB Centos7 machine, running PXC 5.6
(1) Start up a 3 node cluster
(2) Set fc_limit=20000 on node 3
(3) "flush tables with read lock;" on node 3
(4) Start a stress on node 1

Sometimes, node 3 will block and eventually get stuck at a 32K size, the other
nodes will keep on running. Other times, it will keep on growing.

(5) after the limit has been reached, "unlock tables;" on node 3

This will start to draw down the queue size, this will then trigger an assert
at some point.

========== gdb stack trace ========

#0 0x00007ffff5f635f7 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff5f64ce8 in __GI_abort () at abort.c:90
#2 0x00007ffff5f5c566 in __assert_fail_base (
fmt=0x7ffff60ac228 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7fffda6e12be "conn->recv_q_size >= size",
file=file@entry=0x7fffda6e030c "gcs/src/gcs.cpp", line=line@entry=1775,
function=function@entry=0x7fffda6e1f80 <GCS_FIFO_POP_HEAD(gcs_conn*, long)::_PRETTY_FUNCTION_> "void GCS_FIFO_POP_HEAD(gcs_conn_t*, ssize_t)") at assert.c:92
#3 0x00007ffff5f5c612 in _GI__assert_fail (
assertion=0x7fffda6e12be "conn->recv_q_size >= size",
file=0x7fffda6e030c "gcs/src/gcs.cpp", line=1775,
function=0x7fffda6e1f80 <GCS_FIFO_POP_HEAD(gcs_conn*, long)::_PRETTY_FUNCTION_> "void GCS_FIFO_POP_HEAD(gcs_conn_t*, ssize_t)") at assert.c:101
#4 0x00007fffda64864a in GCS_FIFO_POP_HEAD (conn=0x197eee0, size=140736414615528)
at gcs/src/gcs.cpp:1775
#5 0x00007fffda648814 in gcs_recv (conn=0x197eee0, action=0x7fffd8146cf0)
at gcs/src/gcs.cpp:1811
#6 0x00007fffda6a6927 in galera::Gcs::recv (this=0x1978030, act=...)
at galera/src/galera_gcs.hpp:118
#7 0x00007fffda681399 in galera::GcsActionSource::process (this=0x1978178,
recv_ctx=0x7fffb80009a0, exit_loop=@0x7fffd8146d9f: false)
at galera/src/gcs_action_source.cpp:175
#8 0x00007fffda69dcf9 in galera::ReplicatorSMM::async_recv (this=0x1977a50,
recv_ctx=0x7fffb80009a0) at galera/src/replicator_smm.cpp:368
#9 0x00007fffda6bac6e in galera_recv (gh=0x1946c70, recv_ctx=0x7fffb80009a0)
at galera/src/wsrep_provider.cpp:239
#10 0x000000000065b3c3 in wsrep_replication_process (thd=0x7fffb80009a0)
at /root/dev/pxc/sql/wsrep_thd.cc:313
#11 0x00000000006377f2 in start_wsrep_THD (arg=0x65b30b <wsrep_replication_process(THD*)>)
at /root/dev/pxc/sql/mysqld.cc:5792
#12 0x00007ffff7bc6dc5 in start_thread (arg=0x7fffd8148700) at pthread_create.c:308
#13 0x00007ffff602421d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Revision history for this message
Kenn Takara (kenn-takara) wrote :

Set the fc_limit=20000, and then block the node ("flush tables with read lock").
The queue will eventually reach the limit (which is actually ~34800), but
before that it will reach the size of the receive queue (which is set upon
startup, and on a 256MB machine will hold 32K entries). If you then
"unlock tables", this will hit an assert in the code.

There is a bug in the fifo queue implementation (galerautils/src/gu_fifo.c)
that if the circular buffer has wrapped (so that -->tail ... head-->)
and if head and tail are in the same row, when head reaches
the end of the row,it will free() the row, thus erasing part of the entries.

The upper_limit and lower_limit can also be set to be higher than the size
of the receive queue which doesn't seem to be the behavior we want.

Changed in percona-xtradb-cluster:
status: New → Fix Committed
Changed in percona-xtradb-cluster:
milestone: none → 5.6.29-25.15
Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1901

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.