Reproducable crash of Percona XtraDB cluster using large updates

Bug #1269842 reported by Ales Perme on 2014-01-16
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Invalid
Medium
Unassigned
5.6
Invalid
Medium
Unassigned

Bug Description

We operate Percona XtraDB cluster of three servers version: 5.5.34-31.1, Percona XtraDB Cluster (GPL), Release 31.1, wsrep_25.9.r3928

On a database having table with following structure:
CREATE TABLE `docStatsDetail` (
  `date` date NOT NULL,
  `adId` varchar(36) COLLATE utf8_slovenian_ci NOT NULL,
  `userId` varchar(100) COLLATE utf8_slovenian_ci DEFAULT NULL,
  `adImpressions` int(11) DEFAULT '0',
  `adClicks` int(11) DEFAULT '0',
  `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`date`,`adId`),
  KEY `TS` (`ts`),
  KEY `_adId` (`adId`),
  KEY `_userId` (`userId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_slovenian_ci;

The table has 120 mio records and if we try to delete from this table more than 20 mio records using: DELETE FROM docStatsDetail LIMIT 20000000; the server crashes with a crash report:

10:57:55 UTC - mysqld got signal 7 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=14
max_threads=102
thread_count=10
connection_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 759054 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x5b7cdb0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f31a953ae70 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x7e5c1e]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6bb5a4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f3266e54cb0]
/lib/x86_64-linux-gnu/libc.so.6(+0x14fecd)[0x7f326659becd]
/usr/lib/libgalera_smm.so(_ZN6galera12MappedBuffer7reserveEm+0x263)[0x7f3264b33fa3]
/usr/lib/libgalera_smm.so(_ZN6galera12MappedBuffer6resizeEm+0x19)[0x7f3264b34369]
/usr/lib/libgalera_smm.so(_ZN6galera9TrxHandle5flushEm+0xb5)[0x7f3264b5ce65]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM9replicateEPNS_9TrxHandleEP14wsrep_trx_meta+0xd7)[0x7f3264b72537]
/usr/lib/libgalera_smm.so(galera_pre_commit+0xbb)[0x7f3264b88eeb]
/usr/sbin/mysqld(_Z22wsrep_run_wsrep_commitP3THDP10handlertonb+0xadf)[0x66cf9f]
/usr/sbin/mysqld[0x66db3b]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDb+0x2f5)[0x6be3d5]
/usr/sbin/mysqld(_Z17trans_commit_stmtP3THD+0x29)[0x662489]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x77c)[0x5ac7bc]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x314)[0x5b4534]
/usr/sbin/mysqld[0x5b4e58]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x2502)[0x5b7972]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x167)[0x5b80f7]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x14f)[0x65490f]
/usr/sbin/mysqld(handle_one_connection+0x51)[0x654ae1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f3266e4ce9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f32665403fd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f317c004b90): is an invalid pointer
Connection ID (thread ID): 48
Status: NOT_KILLED
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
140116 11:57:55 mysqld_safe Number of processes running now: 0
140116 11:57:55 mysqld_safe WSREP: not restarting wsrep node automatically
140116 11:57:55 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

The servers have following settings:

# ------------------------------------------------------------
# wsrep provider configuration: basic wsrep options
# ------------------------------------------------------------
wsrep_on = 1
wsrep_debug = 0
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_provider_options = "gcache.size=16G;gcache.page_size=512M;"

# Empty gcomm address is being used when cluster is getting bootstrapped. Dont use it later. It will form new cluster!!!
#wsrep_cluster_address = gcomm://
wsrep_cluster_address = gcomm://ae-01
wsrep_cluster_name = bolha_cluster

wsrep_node_name = as-02.bolha.com
wsrep_node_address = ae-02

wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = "sstuser:mXXXXXX9"
wsrep_slave_threads = 8

wsrep_notify_cmd = /etc/mysql/wsrep_notify

# Dodal aAP
wsrep_max_ws_rows = 128K
wsrep_max_ws_size = 4096M

# ------------------------------------------------------------
# InnoDB
# ------------------------------------------------------------
innodb_log_file_size = 4096M
innodb_log_files_in_group = 4
innodb_buffer_pool_size = 1024M

innodb_flush_log_at_trx_commit = 0
innodb_log_buffer_size = 256M
innodb_file_per_table = 1
innodb_open_files = 131072
innodb_io_capacity = 1000
innodb_flush_method = O_DIRECT

# dodal AP
innodb_adaptive_flushing_method = keep_average
innodb_flush_neighbor_pages = none
innodb_max_dirty_pages_pct = 60
innodb_read_io_threads = 8
innodb_write_io_threads = 8

Servers are running on SSD's and the erorr is reproducable in every node.

Servers are Linux based
Linux XXX 3.2.0-57-generic #87-Ubuntu SMP Tue Nov 12 21:35:10 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.4 LTS
Release: 12.04
Codename: precise

PLEASE HELP!

a) Looks like a crash in galera, so adding galera project here as well.

b) Also, this looks similar to https://bugs.launchpad.net/codership-mysql/+bug/1254179

c)

wsrep_max_ws_rows = 128K
wsrep_max_ws_size = 4096M

The large value of ws_size may be causing issues here.

@Ales,

d) Does this happen only when a certain limit (in this case 20000000) is exceeded - DELETE FROM docStatsDetail LIMIT 20000000? If so, it more looks like c). Also, I see buffer reserve/resize at the last leg of the crash.

Ales Perme (ales-perme) wrote :

Pranams Raghavendra Prabhu!

Yes, it looks like it is a problem in galera. I did check "B", but since there is no percona 5.6 with XtraDB Cluster yet I can't check if it is the same bug. I'd gladly give it a try.

I did first run with the default XtraDB Cluster settings:
wsrep_max_ws_rows = 128K
wsrep_max_ws_size = 1024M

But I hit the same bug. Cause I thought it might be a rtansaction size issue I created a bigger size, without help. I also increased the wsrep_provider_options gcache.size from 1G to 16G without help.

Here are all the options: "base_host = ae-02; base_port = 4567; cert.log_conflicts = no; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT15S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = PT5M; gcache.dir = /data/percona/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/percona/mysql//galera.cache; gcache.page_size = 512M; gcache.size = 16G; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = ae-02; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.version = 0; pc.weight = 1; protonet.backend = asio; protonet.version = 0; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3"

I increased the values because I callculated that just somewhere around 20 mio records I'd hit a 1GB transaction size because the average row length is about 58 bytes, thus 20.000.000 x 58 bytes = 1,08 GB (give or take a couple of thousand records since I can(t get the exact row length). So I did a little guessing if there is a problem with buffer sizes, log sizes, etc... but without success. Alas! :-)

Ad. d. Yes this happens at a certain limit. I can delete 15 mio records without problem. I just can't get the exact number, since it would take me a lot of time. I taheks me about an hour to restore the cluseter and it would be a lengthly operation to get the exact number... You can understand.

I can perform deletes or INSERT INTO xxx SELECT * FROM docStatsDetail.... usling LIMIT offset, number with success. The crash happens when using large transactions. Thus I thought it might be a problem with the sizes...

I would understand if MySQL would report an error, but a crash with a crashed Cluster?!? This is a nasty thing... To get a "quick" fix, I usually do a bootstrap of healthy cluster server and then perform a SST. But in production environment this is unacceptable... :-(

What should I do? And thank you again for all the help.

Kind regards,
Ales

Alex Yurchenko (ayurchen) wrote :

Correct me if I'm wrong, but from what I understand the crash is happening only on the master node. And it is happening at the stage of writeset creation, so it does not look like https://bugs.launchpad.net/codership-mysql/+bug/1254179.

Signal 7 is "bus error" and must be caused by a bug in how Galera 2.x manages mmapped output buffer and is definitely related to sizes. Usually signal 7 is a result of mmapped file not fitting on disk. How much free disk space is in the mysql data directory? Note that with such big transaction you need more than double the disk space, since transactional IO cache will be spilled to disk as well.

However,

1) ATM there is no effective check for transaction exceeding wsrep_max_ws_size.
2) Transaction writeset contains not only row data, but also all unique keys which take additional arbitrary space
3) Galera transaction limit is 2GB. In Galera 2.x it is checked at replication time, but the crash happens well before.
4) 20M rows - no replication will thank you for that. pt-archiver tool is the way to truncate tables. Anyway, with Galera 2.x it will likely require several gigs of spare RAM to process such writeset, which you likely don't have. And if you don't have swap it will get even uglier. So if you insist on going that way, you should be using Galera 3.x.

Ales Perme (ales-perme) wrote :

Hi Alex, Raghavendra Prabhu...

The error is happening only on the master mode, yes you are right. And I did some digging on the signal 7 yesterday which might be consequence of a hardware error thus I tested the erorr on more tha one node. It is reproducable and thus not a hardware but a software bug/feature.

Thank you for confirming that it is a Galera 2.x bug.

Regarding 1,2,3 & 4:

1. :-( Baaad... :((((
2. Okay, it makes sense, yes.
3. OMG, only 2G?! And it looks like it happens before the check.
4. I can delete with a truncate, but can't do the insert with one "INSERT...SELECT..." statement. I could use limits though. I would sleep better if I'd know that the database will remain intact if some SYSADMIN misses the point and forgets to use the limits or uses improper ones. RAM? I have 32 GIG server. Disk? Plenty of room: 190 GB.

So it looks like I'll have to wait for Percona 5.6 using Galera 3.x and try again if it works any better. Am I right?

information type: Private Security → Private
Ales Perme (ales-perme) wrote :

I'm in the process of upgrading to Percona 5.6 XtraDB cluster RC with Galera 3.x. Will contact you soon if the bug/behaviour remains the same.

information type: Private → Public
Ales Perme (ales-perme) on 2014-01-17
description: updated

Note, if you want to test galera 3.x, you can use it with PXC 5.5 as well. You need to install Percona-XtraDB-Cluster-galera-3 instead of Percona-XtraDB-Cluster-galera-2.

Ales Perme (ales-perme) wrote :

Just about to finalize installation of PXC 5.6.... then the testing starts.

Ales Perme (ales-perme) wrote :

Test with 25 mio records works and PXC 5.6 looks faster. Will let you know if i hit another wall. I have a dataset of roughly 95 mio records.

Ales Perme (ales-perme) wrote :
Download full text (5.4 KiB)

I can succesfuly transfer 50 mio records but hit a wall at 60 mio with a message from a mysql.err log:

2014-01-20 08:57:04 195615 [ERROR] WSREP: Maximum wirteset size exceeded by 129676357: 90 (Message too long)
         at galera/src/write_set_ng.hpp:check_size():652
2014-01-20 08:57:04 195615 [ERROR] WSREP: unknown connection failure

But what is good is, that the server does not crash.

The thread howerver hangs on the master server even if I kill it.
mysql> INSERT INTO docStatsDetail_2013 SELECT * FROM docStatsDetail WHERE date>='2013-01-01' AND date <='2013-12-31';

There seems to be no I/O on disk, which looks like it could be in a "hanging" state:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r b swpd free buff cache si so bi bo in cs us sy id wa
 3 0 19716 1235680 184652 8389168 0 0 0 33 2168 1605 5 0 95 0
 1 0 19716 1235132 184652 8389220 0 0 32 1 1665 994 6 0 94 0
 2 0 19716 1234988 184660 8389236 0 0 16 25 2240 1245 7 0 93 0
 1 0 19716 1234956 184660 8389292 0 0 68 5 2231 1324 6 0 94 0
 2 0 19716 1233892 184660 8389312 0 0 16 84 2520 1839 6 0 94 0
 0 0 19716 1233312 184660 8389352 0 0 184 13 1474 930 4 0 96 0
 2 0 19716 1232740 184660 8389556 0 0 0 1 1905 1043 5 0 95 0

I will leave it "hanging" for let's say 4 hours and see what happens.

The table however look "operational" (i.e. not locked)
mysql> show processlist;
+------+-------------+-----------+------------+---------+-------+---------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+-------------+-----------+------------+---------+-------+---------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
| 1 | system user | | NULL | Sleep | 43605 | wsrep aborter idle | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 2609 | committed 4951387 | NULL | 0 | 0 |
| 3 | system user | | NULL | Sleep | 2609 | committed 4951394 | NULL | 0 | 0 |
| 4 | system user | | NULL | Sleep | 2609 | committed 4951389 | NULL | 0 | 0 |
| 5 | system user | | NULL | Sleep | 2609 | committed 4951393 ...

Read more...

Alex Yurchenko (ayurchen) wrote :

Yes, rollback should happen in this situation and may take a while. 10x of initial execution time must be a good bet. And it may be expected that you can't kill a thread during rollback. However it may also be a bug in handling wsrep error at this stage. So it would be real great if you could wait a while to see if the thread gets unblocked.

Ales Perme (ales-perme) wrote :

Alex, I'll wait! Also interestend in it myself what will happen.

Ales Perme (ales-perme) wrote :
Download full text (10.1 KiB)

Master server stopped with a message:

=====================================
2014-01-20 16:30:39 7f0c7af3b700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 20 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 9135 srv_active, 0 srv_shutdown, 51375 srv_idle
srv_master_thread log flush and writes: 60503
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 652345
--Thread 139691519579904 has waited at lock0lock.cc line 4766 for 935.00 seconds the semaphore:
Mutex at 0x9c86078 '&lock_sys->mutex', lock var 1
waiters flag 1
OS WAIT ARRAY INFO: signal count 2548472
Mutex spin waits 388600, rounds 2017742, OS waits 47705
RW-shared spins 903106, rounds 10322271, OS waits 147512
RW-excl spins 421647, rounds 16753936, OS waits 452880
Spin rounds per wait: 5.19 mutex, 11.43 RW-shared, 39.73 RW-excl
FAIL TO OBTAIN LOCK MUTEX, SKIP LOCK INFO PRINTING
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (read thread)
I/O thread 7 state: waiting for completed aio requests (read thread)
I/O thread 8 state: waiting for completed aio requests (read thread)
I/O thread 9 state: waiting for completed aio requests (read thread)
I/O thread 10 state: waiting for completed aio requests (write thread)
I/O thread 11 state: waiting for completed aio requests (write thread)
I/O thread 12 state: waiting for completed aio requests (write thread)
I/O thread 13 state: waiting for completed aio requests (write thread)
I/O thread 14 state: waiting for completed aio requests (write thread)
I/O thread 15 state: waiting for completed aio requests (write thread)
I/O thread 16 state: waiting for completed aio requests (write thread)
I/O thread 17 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0, 0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0, 0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
3207133 OS file reads, 4384051 OS file writes, 153435 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 18200, seg size 18202, 96998 merges
merged operations:
 insert 661141, delete mark 1535, delete 1
discarded operations:
 insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 122997258092
Log flushed up to 122997258092
Pages flushed up to 122997258092
Last checkpoint at 122997258092
Max checkpoint age 13914979615
Checkpoint age target 13480136503
Modified age 0
Checkpoint age 0
0 pending log writes, 0 ...

Ales Perme (ales-perme) wrote :

After restart the deamon is running but is "unconnectable". Can't connect with a client and heavily using disk I/O. I guess I have to wait another 10 hours ... :-/ I guess after restart mysqld is trying to again perform a rollback and hopes to be more successfull when running this single thread only without "external" impulses?

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r b swpd free buff cache si so bi bo in cs us sy id wa
 4 5 17808 3344272 167180 10741064 0 0 19 229 1 1 5 0 94 0
 2 1 17808 3300160 167204 10747768 0 0 2960 103145 9619 14716 9 1 88 2
 2 2 17808 3256020 167220 10753936 0 0 3388 106185 9899 14819 10 1 87 3
 3 1 17808 3224908 167268 10761292 0 0 3502 117150 11639 18158 11 1 86 3
 4 1 17808 3156580 167288 10768228 0 0 4087 125337 9606 13510 9 1 87 3

This is definitely unusable in production environment if there is a behaviour like this, unfortunately.... :-(

I guess we will hear eachother again tomorrow with more feedback.

Ales Perme (ales-perme) wrote :

Another feedback. While the "master node" crashed. The server mode became unusable:

mysql> use cAnalytics;
ERROR 1047 (08S01): Unknown command

I can only list databases but can't do anything actually. I guess I have to perform a SST from the "slave node" to the master.

Ales Perme (ales-perme) wrote :

Cluster went into "splithead" since arbiter was not running. My bad!

Alex Yurchenko (ayurchen) wrote :

Ok, so it is a bug in wsrep error handling. I reported a separate bug for that.

separate bug reported will close this issue.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1095

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers