SEGV in XtraDB Cluster

Bug #1078346 reported by Dan Rogers
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
Medium
Seppo Jaakola
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Unassigned

Bug Description

While in the process of switching a database server from Percona Server 5.5.28-rel29 (in a master-master relationship) to Percona XtraDB Cluster, I encountered a situation where any attempt to start the slave on the Cluster server would cause a SEGV.

A similar (or possibly identical) crash was reported as part of https://bugs.launchpad.net/galera/+bug/1036774 back in September.

As with the above report, the only way to get past that point in the binlogs was to switch back to the base Percona server version.

We've also seen 4.5 million "WSREP: skipping FK key append" messages, as described in https://bugs.launchpad.net/codership-mysql/+bug/1057910

Here's the dump from the logs. Config file is attached.

---

121108 12:24:45 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.002763' at position 84270110, relay log '/mysqlroot/mysql/relaylog/relay.000270' position: 84270256
121108 12:24:45 [Note] Slave I/O thread: connected to master '<email address hidden>:3306',replication started in log 'mysql-bin.002764' at position 402960631
18:24:45 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=16777216
read_buffer_size=524288
max_used_connections=45
max_threads=750
thread_count=35
connection_count=35
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1945858 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x8fd29400
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fcf47cb6e58 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c5fb5]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a00f4]
/lib64/libpthread.so.0(+0xf500)[0x7fdbef6a6500]
/usr/sbin/mysqld(wsrep_append_foreign_key+0xa2)[0x816cc2]
/usr/sbin/mysqld[0x84dc80]
/usr/sbin/mysqld[0x85100e]
/usr/sbin/mysqld[0x85218a]
/usr/sbin/mysqld[0x83ba01]
/usr/sbin/mysqld[0x81bc2f]
/usr/sbin/mysqld(_ZN7handler13ha_delete_rowEPKh+0x5e)[0x6a4aee]
/usr/sbin/mysqld(_ZN21Delete_rows_log_event11do_exec_rowEPK14Relay_log_info+0x148)[0x7428f8]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x22d)[0x7480fd]
/usr/sbin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14Relay_log_info+0x125)[0x5317b5]
/usr/sbin/mysqld[0x535af7]
/usr/sbin/mysqld(handle_slave_sql+0xa45)[0x537025]
/lib64/libpthread.so.0(+0x7851)[0x7fdbef69e851]
/lib64/libc.so.6(clone+0x6d)[0x7fdbee92411d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 783
Status: NOT_KILLED

You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.

Revision history for this message
Dan Rogers (drogers-l) wrote :
Revision history for this message
pservit (pservit) wrote :

Percona-XtraDB-Cluster-server-5.5.28-23.7.369.rhel6.x86_64

Today 2 of 3 nodes crashed with similar backtrace:

16:40:13 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=296
max_threads=1000
thread_count=91
connection_count=91
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1197416 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x588a2c00
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
121121 10:40:15 [Note] WSREP: (0cb18548-322e-11e2-0800-bf8bb6ded1ca, 'tcp://0.0.0.0:4567') reconnecting to b2aaa747-32f1-11e2-
0800-9dbb5a52fe4a (tcp://10.0.1.11:4567), attempt 0
stack_bottom = 7fe9da186e48 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c83f5]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a1e04]
/lib64/libpthread.so.0(+0xf500)[0x7ff77cf01500]
/usr/sbin/mysqld(wsrep_append_foreign_key+0xa2)[0x83baf2]
/usr/sbin/mysqld[0x874ab0]
/usr/sbin/mysqld[0x877e3e]
/usr/sbin/mysqld[0x878fba]
/usr/sbin/mysqld[0x862831]
/usr/sbin/mysqld[0x840daf]
/usr/sbin/mysqld(_ZN7handler13ha_delete_rowEPKh+0x5e)[0x6a68ee]
/usr/sbin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyy+0xafd)[0x78653d]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2833)[0x595f93]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x333)[0x598de3]
/usr/sbin/mysqld[0x5997a0]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1a82)[0x59c072]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x169)[0x59c499]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x15f)[0x63710f]
/usr/sbin/mysqld(handle_one_connection+0x51)[0x6372d1]
/lib64/libpthread.so.0(+0x7851)[0x7ff77cef9851]
/lib64/libc.so.6(clone+0x6d)[0x7ff77c17f11d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fe91833f960): is an invalid pointer
Connection ID (thread ID): 98095
Status: NOT_KILLED

You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.
121121 10:40:17 mysqld_safe Number of processes running now: 0
121121 10:40:17 mysqld_safe WSREP: not restarting wsrep node automatically
------------------------------------------------------------------

I tried to start mysql again but it crashed with same backtrace.
After downgrade to 5.5.24-23.6 it started successfully.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

> Percona-XtraDB-Cluster-server-5.5.28-23.7.369.rhel6.x86_64

What is galera version there?
What is wsrep_slave_threads value?

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

The log message: "WSREP: skipping FK key append" has been dropped from the latest release. Do you still see these messages in your log, or are they from time when running with previous PXC release?

The problem happens when processing a delete for a table with foreign key constraint (probably cascading). Can you pinpoint which table(s) might be accessed during this delete? Can you show the table definition for the table(s)? If sensitive information, you can send through email to: <email address hidden>.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Do you have a 'core' file in your mysql data directory, related to this crash? That would help in troubleshooting this crash.

Changed in galera:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
Revision history for this message
pservit (pservit) wrote :

>> Percona-XtraDB-Cluster-server-5.5.28-23.7.369.rhel6.x86_64

> What is galera version there?

wsrep_provider_version | 2.2(r137)

But it seems like bug in percona cluster, not galera because after downgrade to 5.5.24-23.6 with same galera version I can start server.

> What is wsrep_slave_threads value?

wsrep_slave_threads | 64

> The problem happens when processing a delete for a table with foreign key constraint (probably cascading).
> Can you pinpoint which table(s) might be accessed during this delete? Can you show the table definition for the table(s)?

Sorry, not clues in log files. It's very busy server, around 90G of data and 10k requests per second.

> Do you have a 'core' file in your mysql data directory, related to this crash? That would help in troubleshooting this crash.

No. I set ulimit -c unlimited but core file not generated.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

> No. I set ulimit -c unlimited but core file not generated.

could you please set it in /etc/security/limits.conf and also add "core-file" option to [mysqld] section in my.cnf. Core file would be truly lovely.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

This is hard to diagnose with the information at hand. Can you double check that you are not seeing warnings like:
"WSREP: skipping FK key append"
any more. They should not be produced by PXC 5.5.28 version

If you can show the full database schema, it may be possible to track the potential offending foreign key construct from there. e.g. mysqldump with --no-data option would produce that. (As schema spec is probably confidential, you better not attach it here)

Revision history for this message
Dan Rogers (drogers-l) wrote :

I've emailed Seppo a copy of our database schema. Hopefully this will help.

Changed in codership-mysql:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
status: New → In Progress
Revision history for this message
Dan Rogers (drogers-l) wrote :

I've reinstalled Percona XtraDB Cluster on one of our production slaves, and enabled core files. Hopefully I can get it to crash so a core file will be generated.

FYI, it was also necessary to perform the steps from http://www.mysqlperformanceblog.com/2011/08/26/getting-mysql-core-file-on-linux/ to cause a core file to be dumped.

Changed in codership-mysql:
status: In Progress → Fix Committed
milestone: none → 5.5.28-23.7
importance: Undecided → Medium
no longer affects: galera
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Thanks Dan!

I have pushed a potential fix for the problem discovered in pservit's case. It would be good to compare with the core analysis we can get from your setup.

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :
Changed in codership-mysql:
status: Fix Committed → Fix Released
Changed in codership-mysql:
status: Fix Released → In Progress
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

The problem still happens with the suggested fix.

Dan proposed a new fix which seems to work in his case, many thanks for this extended troubleshooting, I will push this fix in after some more testing.

Fundamental issue seems to be that Percona Server either does not load all tables, or drops some tables, from dictionary cache. Dan's fix will pull the referenced table back in cache.

Changed in codership-mysql:
milestone: 5.5.28-23.7 → 5.5.28-23.8
status: In Progress → Fix Committed
Changed in percona-xtradb-cluster:
status: New → Confirmed
Changed in percona-xtradb-cluster:
milestone: none → 5.5.29-23.8
Changed in percona-xtradb-cluster:
milestone: 5.5.29-23.8 → none
Revision history for this message
Dan Rogers (drogers-l) wrote :

Does that mean that this will no longer be in the next release of Percona Cluster?

This bug is a show stopper for us in one of our clusters, and will prevent us from upgrading to a version that doesn't have my hand-patched change.

Changed in codership-mysql:
milestone: 5.5.29-24.8 → 5.5.29-23.7.1
Changed in percona-xtradb-cluster:
milestone: none → 5.5.29-23.7.1
status: Confirmed → Fix Committed
Revision history for this message
Dan Rogers (drogers-l) wrote :

Thanks guys. :)

Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

fix released in 23.7.3

Changed in codership-mysql:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1251

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.