PXC crashes if fc_limit=20000 and under stress
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
Fix Released
|
Undecided
|
Kenn Takara |
Bug Description
On a 256MB Centos7 machine, running PXC 5.6
(1) Start up a 3 node cluster
(2) Set fc_limit=20000 on node 3
(3) "flush tables with read lock;" on node 3
(4) Start a stress on node 1
Sometimes, node 3 will block and eventually get stuck at a 32K size, the other
nodes will keep on running. Other times, it will keep on growing.
(5) after the limit has been reached, "unlock tables;" on node 3
This will start to draw down the queue size, this will then trigger an assert
at some point.
========== gdb stack trace ========
#0 0x00007ffff5f635f7 in __GI_raise (sig=sig@entry=6)
at ../nptl/
#1 0x00007ffff5f64ce8 in __GI_abort () at abort.c:90
#2 0x00007ffff5f5c566 in __assert_fail_base (
fmt=0x7ffff60ac228 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
assertion=
file=file@
function=
#3 0x00007ffff5f5c612 in _GI__assert_fail (
assertion=
file=0x7fffda6e030c "gcs/src/gcs.cpp", line=1775,
function=
#4 0x00007fffda64864a in GCS_FIFO_POP_HEAD (conn=0x197eee0, size=1407364146
at gcs/src/
#5 0x00007fffda648814 in gcs_recv (conn=0x197eee0, action=
at gcs/src/
#6 0x00007fffda6a6927 in galera::Gcs::recv (this=0x1978030, act=...)
at galera/
#7 0x00007fffda681399 in galera:
recv_ctx=
at galera/
#8 0x00007fffda69dcf9 in galera:
recv_ctx=
#9 0x00007fffda6bac6e in galera_recv (gh=0x1946c70, recv_ctx=
at galera/
#10 0x000000000065b3c3 in wsrep_replicati
at /root/dev/
#11 0x00000000006377f2 in start_wsrep_THD (arg=0x65b30b <wsrep_
at /root/dev/
#12 0x00007ffff7bc6dc5 in start_thread (arg=0x7fffd814
#13 0x00007ffff602421d in clone () at ../sysdeps/
Changed in percona-xtradb-cluster: | |
milestone: | none → 5.6.29-25.15 |
Changed in percona-xtradb-cluster: | |
status: | Fix Committed → Fix Released |
Set the fc_limit=20000, and then block the node ("flush tables with read lock").
The queue will eventually reach the limit (which is actually ~34800), but
before that it will reach the size of the receive queue (which is set upon
startup, and on a 256MB machine will hold 32K entries). If you then
"unlock tables", this will hit an assert in the code.
There is a bug in the fifo queue implementation (galerautils/ src/gu_ fifo.c)
that if the circular buffer has wrapped (so that -->tail ... head-->)
and if head and tail are in the same row, when head reaches
the end of the row,it will free() the row, thus erasing part of the entries.
The upper_limit and lower_limit can also be set to be higher than the size
of the receive queue which doesn't seem to be the behavior we want.