PXC crashes if fc_limit=20000 and under stress

Bug #1579987 reported by Kenn Takara on 2016-05-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Kenn Takara

Bug Description

On a 256MB Centos7 machine, running PXC 5.6
(1) Start up a 3 node cluster
(2) Set fc_limit=20000 on node 3
(3) "flush tables with read lock;" on node 3
(4) Start a stress on node 1

Sometimes, node 3 will block and eventually get stuck at a 32K size, the other
nodes will keep on running. Other times, it will keep on growing.

(5) after the limit has been reached, "unlock tables;" on node 3

This will start to draw down the queue size, this will then trigger an assert
at some point.

========== gdb stack trace ========

#0 0x00007ffff5f635f7 in __GI_raise (sig=sig@entry=6)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff5f64ce8 in __GI_abort () at abort.c:90
#2 0x00007ffff5f5c566 in __assert_fail_base (
fmt=0x7ffff60ac228 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=assertion@entry=0x7fffda6e12be "conn->recv_q_size >= size",
file=file@entry=0x7fffda6e030c "gcs/src/gcs.cpp", line=line@entry=1775,
function=function@entry=0x7fffda6e1f80 <GCS_FIFO_POP_HEAD(gcs_conn*, long)::_PRETTY_FUNCTION_> "void GCS_FIFO_POP_HEAD(gcs_conn_t*, ssize_t)") at assert.c:92
#3 0x00007ffff5f5c612 in _GI__assert_fail (
assertion=0x7fffda6e12be "conn->recv_q_size >= size",
file=0x7fffda6e030c "gcs/src/gcs.cpp", line=1775,
function=0x7fffda6e1f80 <GCS_FIFO_POP_HEAD(gcs_conn*, long)::_PRETTY_FUNCTION_> "void GCS_FIFO_POP_HEAD(gcs_conn_t*, ssize_t)") at assert.c:101
#4 0x00007fffda64864a in GCS_FIFO_POP_HEAD (conn=0x197eee0, size=140736414615528)
at gcs/src/gcs.cpp:1775
#5 0x00007fffda648814 in gcs_recv (conn=0x197eee0, action=0x7fffd8146cf0)
at gcs/src/gcs.cpp:1811
#6 0x00007fffda6a6927 in galera::Gcs::recv (this=0x1978030, act=...)
at galera/src/galera_gcs.hpp:118
#7 0x00007fffda681399 in galera::GcsActionSource::process (this=0x1978178,
recv_ctx=0x7fffb80009a0, exit_loop=@0x7fffd8146d9f: false)
at galera/src/gcs_action_source.cpp:175
#8 0x00007fffda69dcf9 in galera::ReplicatorSMM::async_recv (this=0x1977a50,
recv_ctx=0x7fffb80009a0) at galera/src/replicator_smm.cpp:368
#9 0x00007fffda6bac6e in galera_recv (gh=0x1946c70, recv_ctx=0x7fffb80009a0)
at galera/src/wsrep_provider.cpp:239
#10 0x000000000065b3c3 in wsrep_replication_process (thd=0x7fffb80009a0)
at /root/dev/pxc/sql/wsrep_thd.cc:313
#11 0x00000000006377f2 in start_wsrep_THD (arg=0x65b30b <wsrep_replication_process(THD*)>)
at /root/dev/pxc/sql/mysqld.cc:5792
#12 0x00007ffff7bc6dc5 in start_thread (arg=0x7fffd8148700) at pthread_create.c:308
#13 0x00007ffff602421d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Kenn Takara (kenn-takara) wrote :

Set the fc_limit=20000, and then block the node ("flush tables with read lock").
The queue will eventually reach the limit (which is actually ~34800), but
before that it will reach the size of the receive queue (which is set upon
startup, and on a 256MB machine will hold 32K entries). If you then
"unlock tables", this will hit an assert in the code.

There is a bug in the fifo queue implementation (galerautils/src/gu_fifo.c)
that if the circular buffer has wrapped (so that -->tail ... head-->)
and if head and tail are in the same row, when head reaches
the end of the row,it will free() the row, thus erasing part of the entries.

The upper_limit and lower_limit can also be set to be higher than the size
of the receive queue which doesn't seem to be the behavior we want.

Changed in percona-xtradb-cluster:
status: New → Fix Committed
Changed in percona-xtradb-cluster:
milestone: none → 5.6.29-25.15
Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1901

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers