Bug #1269842 “Reproducable crash of Percona XtraDB cluster using...” : Bugs : Galera

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-01-16:

#1

a) Looks like a crash in galera, so adding galera project here as well.

b) Also, this looks similar to https://bugs.launchpad.net/codership-mysql/+bug/1254179

c)

wsrep_max_ws_rows = 128K
wsrep_max_ws_size = 4096M

The large value of ws_size may be causing issues here.

@Ales,

d) Does this happen only when a certain limit (in this case 20000000) is exceeded - DELETE FROM docStatsDetail LIMIT 20000000? If so, it more looks like c). Also, I see buffer reserve/resize at the last leg of the crash.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-16:

#2

Pranams Raghavendra Prabhu!

Yes, it looks like it is a problem in galera. I did check "B", but since there is no percona 5.6 with XtraDB Cluster yet I can't check if it is the same bug. I'd gladly give it a try.

I did first run with the default XtraDB Cluster settings:
wsrep_max_ws_rows = 128K
wsrep_max_ws_size = 1024M

But I hit the same bug. Cause I thought it might be a rtansaction size issue I created a bigger size, without help. I also increased the wsrep_provider_options gcache.size from 1G to 16G without help.

Here are all the options: "base_host = ae-02; base_port = 4567; cert.log_conflicts = no; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT15S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = PT5M; gcache.dir = /data/percona/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/percona/mysql//galera.cache; gcache.page_size = 512M; gcache.size = 16G; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = ae-02; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.version = 0; pc.weight = 1; protonet.backend = asio; protonet.version = 0; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3"

I increased the values because I callculated that just somewhere around 20 mio records I'd hit a 1GB transaction size because the average row length is about 58 bytes, thus 20.000.000 x 58 bytes = 1,08 GB (give or take a couple of thousand records since I can(t get the exact row length). So I did a little guessing if there is a problem with buffer sizes, log sizes, etc... but without success. Alas! :-)

Ad. d. Yes this happens at a certain limit. I can delete 15 mio records without problem. I just can't get the exact number, since it would take me a lot of time. I taheks me about an hour to restore the cluseter and it would be a lengthly operation to get the exact number... You can understand.

I can perform deletes or INSERT INTO xxx SELECT * FROM docStatsDetail.... usling LIMIT offset, number with success. The crash happens when using large transactions. Thus I thought it might be a problem with the sizes...

I would understand if MySQL would report an error, but a crash with a crashed Cluster?!? This is a nasty thing... To get a "quick" fix, I usually do a bootstrap of healthy cluster server and then perform a SST. But in production environment this is unacceptable... :-(

What should I do? And thank you again for all the help.

Kind regards,
Ales

Pranams Raghavendra Prabhu!

Yes, it looks like it is a problem in galera. I did check "B", but since there is no percona 5.6 with XtraDB Cluster yet I can't check if it is the same bug. I'd gladly give it a try.

I did first run with the default XtraDB Cluster settings:
wsrep_max_ws_rows = 128K
wsrep_max_ws_size = 1024M

But I hit the same bug. Cause I thought it might be a rtansaction size issue I created a bigger size, without help. I also increased the wsrep_provider_options gcache.size from 1G to 16G without help.

Here are all the options: "base_host = ae-02; base_port = 4567; cert.log_conflicts = no; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT15S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = PT5M; gcache.dir = /data/percona/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/percona/mysql//galera.cache; gcache.page_size = 512M; gcache.size = 16G; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = ae-02; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.version = 0; pc.weight = 1; protonet.backend = asio; protonet.version = 0; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3"

I increased the values because I callculated that just somewhere around 20 mio records I'd hit a 1GB transaction size because the average row length is about 58 bytes, thus 20.000.000 x 58 bytes = 1,08 GB (give or take a couple of thousand records since I can(t get the exact row length). So I did a little guessing if there is a problem with buffer sizes, log sizes, etc... but without success. Alas! :-)

Ad. d. Yes this happens at a certain limit. I can delete 15 mio records without problem.  I just can't get the exact number, since it would take me a lot of time. I taheks me about an hour to restore the cluseter and it would be a lengthly operation to get the exact number... You can understand.

I can perform deletes or INSERT INTO xxx SELECT * FROM docStatsDetail.... usling LIMIT offset, number with success. The crash happens when using large transactions. Thus I thought it might be a problem with the sizes...

I would understand if MySQL would report an error, but a crash with a crashed Cluster?!? This is a nasty thing... To get a "quick" fix, I usually do a bootstrap of healthy cluster server and then perform a SST. But in production environment this is unacceptable... :-(

What should I do? And thank you again for all the help.

Kind regards,
Ales

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-01-17:

#3

Correct me if I'm wrong, but from what I understand the crash is happening only on the master node. And it is happening at the stage of writeset creation, so it does not look like https://bugs.launchpad.net/codership-mysql/+bug/1254179.

Signal 7 is "bus error" and must be caused by a bug in how Galera 2.x manages mmapped output buffer and is definitely related to sizes. Usually signal 7 is a result of mmapped file not fitting on disk. How much free disk space is in the mysql data directory? Note that with such big transaction you need more than double the disk space, since transactional IO cache will be spilled to disk as well.

However,

1) ATM there is no effective check for transaction exceeding wsrep_max_ws_size.
2) Transaction writeset contains not only row data, but also all unique keys which take additional arbitrary space
3) Galera transaction limit is 2GB. In Galera 2.x it is checked at replication time, but the crash happens well before.
4) 20M rows - no replication will thank you for that. pt-archiver tool is the way to truncate tables. Anyway, with Galera 2.x it will likely require several gigs of spare RAM to process such writeset, which you likely don't have. And if you don't have swap it will get even uglier. So if you insist on going that way, you should be using Galera 3.x.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-17:

#4

Hi Alex, Raghavendra Prabhu...

The error is happening only on the master mode, yes you are right. And I did some digging on the signal 7 yesterday which might be consequence of a hardware error thus I tested the erorr on more tha one node. It is reproducable and thus not a hardware but a software bug/feature.

Thank you for confirming that it is a Galera 2.x bug.

Regarding 1,2,3 & 4:

1. :-( Baaad... :((((
2. Okay, it makes sense, yes.
3. OMG, only 2G?! And it looks like it happens before the check.
4. I can delete with a truncate, but can't do the insert with one "INSERT...SELECT..." statement. I could use limits though. I would sleep better if I'd know that the database will remain intact if some SYSADMIN misses the point and forgets to use the limits or uses improper ones. RAM? I have 32 GIG server. Disk? Plenty of room: 190 GB.

So it looks like I'll have to wait for Percona 5.6 using Galera 3.x and try again if it works any better. Am I right?

information type:

Private Security → Private

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-17:

#5

I'm in the process of upgrading to Percona 5.6 XtraDB cluster RC with Galera 3.x. Will contact you soon if the bug/behaviour remains the same.

information type:

Private → Public

Ales Perme (ales-perme) on 2014-01-17

description:

updated

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2014-01-17:

#6

Note, if you want to test galera 3.x, you can use it with PXC 5.5 as well. You need to install Percona-XtraDB-Cluster-galera-3 instead of Percona-XtraDB-Cluster-galera-2.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-17:

#7

Just about to finalize installation of PXC 5.6.... then the testing starts.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-17:

#8

Test with 25 mio records works and PXC 5.6 looks faster. Will let you know if i hit another wall. I have a dataset of roughly 95 mio records.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-20:

#9

Download full text (5.4 KiB)

I can succesfuly transfer 50 mio records but hit a wall at 60 mio with a message from a mysql.err log:

2014-01-20 08:57:04 195615 [ERROR] WSREP: Maximum wirteset size exceeded by 129676357: 90 (Message too long)
at galera/src/write_set_ng.hpp:check_size():652
2014-01-20 08:57:04 195615 [ERROR] WSREP: unknown connection failure

But what is good is, that the server does not crash.

The thread howerver hangs on the master server even if I kill it.
mysql> INSERT INTO docStatsDetail_2013 SELECT * FROM docStatsDetail WHERE date>='2013-01-01' AND date <='2013-12-31';

There seems to be no I/O on disk, which looks like it could be in a "hanging" state:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
3 0 19716 1235680 184652 8389168 0 0 0 33 2168 1605 5 0 95 0
1 0 19716 1235132 184652 8389220 0 0 32 1 1665 994 6 0 94 0
2 0 19716 1234988 184660 8389236 0 0 16 25 2240 1245 7 0 93 0
1 0 19716 1234956 184660 8389292 0 0 68 5 2231 1324 6 0 94 0
2 0 19716 1233892 184660 8389312 0 0 16 84 2520 1839 6 0 94 0
0 0 19716 1233312 184660 8389352 0 0 184 13 1474 930 4 0 96 0
2 0 19716 1232740 184660 8389556 0 0 0 1 1905 1043 5 0 95 0

I will leave it "hanging" for let's say 4 hours and see what happens.

I can succesfuly transfer 50 mio records but hit a wall at 60 mio with a message from a mysql.err log:

2014-01-20 08:57:04 195615 [ERROR] WSREP: Maximum wirteset size exceeded by 129676357: 90 (Message too long)
         at galera/src/write_set_ng.hpp:check_size():652
2014-01-20 08:57:04 195615 [ERROR] WSREP: unknown connection failure

But what is good is, that the server does not crash.

The thread howerver hangs on the master server even if I kill it.  
mysql> INSERT INTO docStatsDetail_2013 SELECT * FROM docStatsDetail WHERE date>='2013-01-01' AND date <='2013-12-31';

There seems to be no I/O on disk, which looks like it could be in a "hanging" state:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 3  0  19716 1235680 184652 8389168    0    0     0    33 2168 1605  5  0 95  0
 1  0  19716 1235132 184652 8389220    0    0    32     1 1665  994  6  0 94  0
 2  0  19716 1234988 184660 8389236    0    0    16    25 2240 1245  7  0 93  0
 1  0  19716 1234956 184660 8389292    0    0    68     5 2231 1324  6  0 94  0
 2  0  19716 1233892 184660 8389312    0    0    16    84 2520 1839  6  0 94  0
 0  0  19716 1233312 184660 8389352    0    0   184    13 1474  930  4  0 96  0
 2  0  19716 1232740 184660 8389556    0    0     0     1 1905 1043  5  0 95  0

I will leave it "hanging" for let's say 4 hours and see what happens.

The table however look "operational" (i.e. not locked)
mysql> show processlist;
+------+-------------+-----------+------------+---------+-------+---------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
| Id   | User        | Host      | db         | Command | Time  | State                     | Info                                                                                                 | Rows_sent | Rows_examined |
+------+-------------+-----------+------------+---------+-------+---------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
|    1 | system user |           | NULL       | Sleep   | 43605 | wsrep aborter idle        | NULL                                                                                                 |         0 |             0 |
|    2 | system user |           | NULL       | Sleep   |  2609 | committed 4951387         | NULL                                                                                                 |         0 |             0 |
|    3 | system user |           | NULL       | Sleep   |  2609 | committed 4951394         | NULL                                                                                                 |         0 |             0 |
|    4 | system user |           | NULL       | Sleep   |  2609 | committed 4951389         | NULL                                                                                                 |         0 |             0 |
|    5 | system user |           | NULL       | Sleep   |  2609 | committed 4951393         | NULL                                                                                                 |         0 |             0 |
|    6 | system user |           | NULL       | Sleep   |  2609 | committed 4951392         | NULL                                                                                                 |         0 |             0 |
|    7 | system user |           | NULL       | Sleep   |  2609 | committed 4951390         | NULL                                                                                                 |         0 |             0 |
|    8 | system user |           | NULL       | Sleep   |  2609 | committed 4951388         | NULL                                                                                                 |         0 |             0 |
|    9 | system user |           | NULL       | Sleep   |  2609 | committed 4951391         | NULL                                                                                                 |         0 |             0 |
| 2854 | root        | localhost | cAnalytics | Query   | 11794 | wsrep in pre-commit stage | INSERT INTO docStatsDetail_2013 SELECT * FROM docStatsDetail WHERE date>='2013-01-01' AND date <='20 |         0 |      60440137 |
| 3936 | root        | localhost | cAnalytics | Query   |    18 | Sending data              | select count(*) from docStatsDetail_2013                                                             |         0 |             0 |
| 3937 | root        | localhost | cAnalytics | Query   |     0 | init                      | show processlist                                                                                     |         0 |             0 |
+------+-------------+-----------+------------+---------+-------+---------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+

The count of that empty table (docStatsDetail_2013)  takes 4 minutes. Perhaps PXC is performing a rollback. Let's see.

mysql> select count(*) from docStatsDetail_2013;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (4 min 2.73 sec)

At least the server is still running & cluster is up, which is a great improvement. Will keep you posted....

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-01-20:

#10

Yes, rollback should happen in this situation and may take a while. 10x of initial execution time must be a good bet. And it may be expected that you can't kill a thread during rollback. However it may also be a bug in handling wsrep error at this stage. So it would be real great if you could wait a while to see if the thread gets unblocked.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-20:

#11

Alex, I'll wait! Also interestend in it myself what will happen.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-20:

#12

Download full text (10.1 KiB)

Master server stopped with a message:

=====================================
2014-01-20 16:30:39 7f0c7af3b700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 20 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 9135 srv_active, 0 srv_shutdown, 51375 srv_idle
srv_master_thread log flush and writes: 60503
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 652345
--Thread 139691519579904 has waited at lock0lock.cc line 4766 for 935.00 seconds the semaphore:
Mutex at 0x9c86078 '&lock_sys->mutex', lock var 1
waiters flag 1
OS WAIT ARRAY INFO: signal count 2548472
Mutex spin waits 388600, rounds 2017742, OS waits 47705
RW-shared spins 903106, rounds 10322271, OS waits 147512
RW-excl spins 421647, rounds 16753936, OS waits 452880
Spin rounds per wait: 5.19 mutex, 11.43 RW-shared, 39.73 RW-excl
FAIL TO OBTAIN LOCK MUTEX, SKIP LOCK INFO PRINTING
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (read thread)
I/O thread 7 state: waiting for completed aio requests (read thread)
I/O thread 8 state: waiting for completed aio requests (read thread)
I/O thread 9 state: waiting for completed aio requests (read thread)
I/O thread 10 state: waiting for completed aio requests (write thread)
I/O thread 11 state: waiting for completed aio requests (write thread)
I/O thread 12 state: waiting for completed aio requests (write thread)
I/O thread 13 state: waiting for completed aio requests (write thread)
I/O thread 14 state: waiting for completed aio requests (write thread)
I/O thread 15 state: waiting for completed aio requests (write thread)
I/O thread 16 state: waiting for completed aio requests (write thread)
I/O thread 17 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0, 0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0, 0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
3207133 OS file reads, 4384051 OS file writes, 153435 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 18200, seg size 18202, 96998 merges
merged operations:
insert 661141, delete mark 1535, delete 1
discarded operations:
insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 122997258092
Log flushed up to 122997258092
Pages flushed up to 122997258092
Last checkpoint at 122997258092
Max checkpoint age 13914979615
Checkpoint age target 13480136503
Modified age 0
Checkpoint age 0
0 pending log writes, 0 ...

Master server stopped with a message:

=====================================
2014-01-20 16:30:39 7f0c7af3b700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 20 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 9135 srv_active, 0 srv_shutdown, 51375 srv_idle
srv_master_thread log flush and writes: 60503
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 652345
--Thread 139691519579904 has waited at lock0lock.cc line 4766 for 935.00 seconds the semaphore:
Mutex at 0x9c86078 '&lock_sys->mutex', lock var 1
waiters flag 1
OS WAIT ARRAY INFO: signal count 2548472
Mutex spin waits 388600, rounds 2017742, OS waits 47705
RW-shared spins 903106, rounds 10322271, OS waits 147512
RW-excl spins 421647, rounds 16753936, OS waits 452880
Spin rounds per wait: 5.19 mutex, 11.43 RW-shared, 39.73 RW-excl
FAIL TO OBTAIN LOCK MUTEX, SKIP LOCK INFO PRINTING
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (read thread)
I/O thread 7 state: waiting for completed aio requests (read thread)
I/O thread 8 state: waiting for completed aio requests (read thread)
I/O thread 9 state: waiting for completed aio requests (read thread)
I/O thread 10 state: waiting for completed aio requests (write thread)
I/O thread 11 state: waiting for completed aio requests (write thread)
I/O thread 12 state: waiting for completed aio requests (write thread)
I/O thread 13 state: waiting for completed aio requests (write thread)
I/O thread 14 state: waiting for completed aio requests (write thread)
I/O thread 15 state: waiting for completed aio requests (write thread)
I/O thread 16 state: waiting for completed aio requests (write thread)
I/O thread 17 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0, 0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0, 0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
3207133 OS file reads, 4384051 OS file writes, 153435 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 18200, seg size 18202, 96998 merges
merged operations:
 insert 661141, delete mark 1535, delete 1
discarded operations:
 insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 122997258092
Log flushed up to   122997258092
Pages flushed up to 122997258092
Last checkpoint at  122997258092
Max checkpoint age    13914979615
Checkpoint age target 13480136503
Modified age          0
Checkpoint age        0
0 pending log writes, 0 pending chkp writes
26479 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 4395630592; in additional pool allocated 0
Total memory allocated by read views 408
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 1099303648 	(75692648 + 1023611000)
    Page hash           1107208 (buffer pool 0 only)
    Dictionary cache    17844813 	(17702416 + 142397)
    File system         820296 	(812272 + 8024)
    Lock system         37186320 	(10625704 + 26560616)
    Recovery system     0 	(0 + 0)
Dictionary memory allocated 142397
Buffer pool size        262140
Buffer pool size, bytes 4294901760
Free buffers            4130
Database pages          193914
Old database pages      71507
Modified db pages       0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 2344243, not young 1062049864
0.00 youngs/s, 0.00 non-youngs/s
Pages read 3206982, created 1336978, written 4249763
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 193914, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
----------------------
INDIVIDUAL BUFFER POOL INFO
----------------------
---BUFFER POOL 0
Buffer pool size        65535
Buffer pool size, bytes 1073725440
Free buffers            1029
Database pages          48505
Old database pages      17891
Modified db pages       0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 613159, not young 264979076
0.00 youngs/s, 0.00 non-youngs/s
Pages read 804990, created 335156, written 1090004
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 48505, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 1
Buffer pool size        65535
Buffer pool size, bytes 1073725440
Free buffers            1053
Database pages          48473
Old database pages      17873
Modified db pages       0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 591572, not young 268467481
0.00 youngs/s, 0.00 non-youngs/s
Pages read 815657, created 331848, written 1054366
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 48473, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 2
Buffer pool size        65535
Buffer pool size, bytes 1073725440
Free buffers            1024
Database pages          48458
Old database pages      17868
Modified db pages       0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 596914, not young 266917822
0.00 youngs/s, 0.00 non-youngs/s
Pages read 795722, created 334975, written 1042430
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 48458, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 3
Buffer pool size        65535
Buffer pool size, bytes 1073725440
Free buffers            1024
Database pages          48478
Old database pages      17875
Modified db pages       0
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 542598, not young 261685485
0.00 youngs/s, 0.00 non-youngs/s
Pages read 790613, created 334999, written 1062963
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 48478, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB
3 RW transactions active inside InnoDB
0 RO transactions active inside InnoDB
3 out of 1000 descriptors used
Main thread process no. 195615, id 139691570734848, state: sleeping
Number of rows inserted 120943906, updated 1048779, deleted 3, read 516048101
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
FAIL TO OBTAIN LOCK MUTEX, SKIP LOCK INFO PRINTING
----------------------------
END OF INNODB MONITOR OUTPUT
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2014-01-20 16:30:58 7f0c7b73c700  InnoDB: Assertion failure in thread 139691587520256 in file srv0srv.cc line 2156
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
15:30:58 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=16
max_threads=102
thread_count=12
connection_count=12
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 759274 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8e91ae]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68c1e4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f11ddc69cb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f11dd0be425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7f11dd0c1b8b]
/usr/sbin/mysqld[0xa074cf]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f11ddc61e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f11dd17c3fd]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
140120 16:30:58 mysqld_safe Number of processes running now: 0
140120 16:30:58 mysqld_safe WSREP: not restarting wsrep node automatically
140120 16:30:58 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-20:

#13

After restart the deamon is running but is "unconnectable". Can't connect with a client and heavily using disk I/O. I guess I have to wait another 10 hours ... :-/ I guess after restart mysqld is trying to again perform a rollback and hopes to be more successfull when running this single thread only without "external" impulses?

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
4 5 17808 3344272 167180 10741064 0 0 19 229 1 1 5 0 94 0
2 1 17808 3300160 167204 10747768 0 0 2960 103145 9619 14716 9 1 88 2
2 2 17808 3256020 167220 10753936 0 0 3388 106185 9899 14819 10 1 87 3
3 1 17808 3224908 167268 10761292 0 0 3502 117150 11639 18158 11 1 86 3
4 1 17808 3156580 167288 10768228 0 0 4087 125337 9606 13510 9 1 87 3

This is definitely unusable in production environment if there is a behaviour like this, unfortunately.... :-(

I guess we will hear eachother again tomorrow with more feedback.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-20:

#14

Another feedback. While the "master node" crashed. The server mode became unusable:

mysql> use cAnalytics;
ERROR 1047 (08S01): Unknown command

I can only list databases but can't do anything actually. I guess I have to perform a SST from the "slave node" to the master.

Revision history for this message

Ales Perme (ales-perme) wrote on 2014-01-20:

#15

Cluster went into "splithead" since arbiter was not running. My bad!

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-01-20:

#16

Ok, so it is a bug in wsrep error handling. I reported a separate bug for that.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-01-20:

#17

https://bugs.launchpad.net/codership-mysql/+bug/1270920

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2015-11-17:

#18

separate bug reported will close this issue.

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#19

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1095

	Status	Importance	Assigned to
Galera	New	Undecided	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Invalid	Medium	Unassigned
5.6	Invalid	Medium	Unassigned

Galera

Reproducable crash of Percona XtraDB cluster using large updates

Bug Description

Other bug subscribers

Remote bug watches