cluster crashes on importing data

Bug #1267507 reported by Axel on 2014-01-09
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Galera
Undecided
Alex Yurchenko
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

OS: Ubuntu 12.04 x64 - ALL updates made!
uname: Linux srzotrsDB01 3.5.0-45-generic #68~precise1-Ubuntu SMP Wed Dec 4 16:18:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

running under VMWare with 8GB RAM
top shows mysqld is using 4.5G RAM right before it crashes

mysql -V:
mysql Ver 14.14 Distrib 5.6.15, for Linux (x86_64) using EditLine wrapper

phpMyAdmin shows:
Server-Type: Percona Server
Server Version: 5.6.15-62.0-log - Percona XtraDB Cluster (GPL), Release 62.0, wsrep_25.2.r4027

==> output of mysql client that is trying to load the data:

User time 493.34, System time 38.32
Maximum resident set size 33532, Integral resident set size 0
Non-physical pagefaults 9844, Physical pagefaults 178, Swaps 0
Blocks in 51202720 out 51202688, Messages in 0 out 0, Signals 0
Voluntary context switches 703541, Involuntary context switches 1211587
Thu Jan 9 14:44:00 CET 2014

==> /var/log/mysql/error.log <==
2014-01-09 14:36:49 2070 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
13:36:52 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=268435456
read_buffer_size=268435456
max_used_connections=4
max_threads=252
thread_count=3
connection_count=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 198446771 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7faa18000990
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7faa503f2a50 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8e91ae]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68c1e4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7faa52a56cb0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7faa51eab425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b)[0x7faa51eaeb8b]
/usr/lib/libgalera_smm.so(_ZN6galera13Certification16purge_for_trx_v3EPNS_9TrxHandleE+0x245)[0x7faa50704295]
/usr/lib/libgalera_smm.so(_ZN6galera13Certification16purge_trxs_upto_Elb+0x158)[0x7faa507058c8]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM18process_commit_cutEll+0x85)[0x7faa50732215]
/usr/lib/libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x405)[0x7faa50713d75]
/usr/lib/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPvRb+0x5e)[0x7faa507148ee]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x78)[0x7faa50739958]
/usr/lib/libgalera_smm.so(galera_recv+0x1e)[0x7faa5074ec8e]
/usr/sbin/mysqld[0x5d7ab1]
/usr/sbin/mysqld(start_wsrep_THD+0x41d)[0x5c09ad]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7faa52a4ee9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7faa51f693fd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 2
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
140109 14:36:57 mysqld_safe Number of processes running now: 0
140109 14:36:57 mysqld_safe WSREP: not restarting wsrep node automatically
140109 14:36:57 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

Related branches

lp:galera
David Bennett: Pending requested 2014-07-25
Seppo Jaakola (seppo-jaakola) wrote :

How are you importing data to this cluster?

You probably are trying to run one huge transaction, which chokes the first node. If so, try to figure out a way to split your import into a series of reasonable size transactions.

Alex Yurchenko (ayurchen) wrote :

It was not a very big transaction: GCache page created was only 128MB. How easily this can be reproduced?

Axel (ajurak) wrote :
Download full text (3.6 KiB)

hi,

thank you for you quick replies!
:)

data is imported "locally" via mysql client.
i tried it with 3 nodes, and just with one in bootstrap-pxc mode.
same result.

the line is not everytime exactly the same, so i think it is not a single "INSERT" statement that is wrong.
but it crashed within a range of +/- 100 lines (insert statements)

oh, with a 5.5 version this was no problem!
just this "brand new" 5.6er version has troubles.

what other info do you require?

I did the import many times, there i made this script to have the same results every time:

IMPORT SCRIPT: /storage/admin/scripts/do_import__test.sh
#!/bin/sh

( date; mysql --show-warnings --unbuffered --debug-info -v -v -v -f otrs_test < /storage/dbbackup/otrs_prod.sql; date ) > /storage/log/import__test.log 2>&1

# EOF

----------------------------------------------

the SQL import file :
-rw-r--r-- 1 root root 25G Jan 4 02:45 /storage/dbbackup/otrs_prod.sql

CONFIG: egrep -v '^#' my.cnf

[client]
port = 3306
socket = /var/run/mysqld/mysqld.sock

[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0

[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql

bind-address = 0.0.0.0

myisam-recover = BACKUP

explicit_defaults_for_timestamp = true

slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow_queries.log

log-output = FILE

long_query_time = 10000
log-queries-not-using-indexes = 1

general_log_file = /var/log/mysql/general.log
general-log = 0

log-error = /var/log/mysql/error.log

log_warnings = 3
verbose

join_buffer_size = 128M

thread_stack = 512k
thread_cache_size = 80

table_open_cache = 1024

key_buffer_size = 256M
max_allowed_packet = 64M
net_buffer_length = 1M

read_rnd_buffer_size = 256M
read_buffer_size = 256M
sort_buffer_size = 512M

myisam_sort_buffer_size = 1024M

max_connections = 250
wait_timeout = 600
interactive_timeout = 3000

tmp_table_size = 2048M
max_tmp_tables = 256

max_heap_table_size = 512M

innodb_data_file_path = ibdata1:10M:autoextend

innodb_buffer_pool_size = 4096M

innodb_log_file_size = 128M
innodb_log_buffer_size = 128M

innodb_file_per_table

default_storage_engine = InnoDB

query_cache_type = 0
query_cache_size = 0

log_slave_updates = 1

skip-name-resolve
transaction-isolation = 'READ-COMMITTED'
expire_logs_days = 2
max_binlog_size = 100M
binlog_format = ROW

log_bin = /storage/bin-logs/DB01/mysql-bin.log

wsrep_cluster_name = "Percona-XtraDB-Cluster-LLL-01"
wsrep_provider = /usr/lib/libgalera_smm.so
wsrep_notify_cmd = /etc/mysql/bin/wsrep_notify

wsrep_cluster_address = gcomm://10.10.17.73:4567,10.10.17.74:4567,10.10.17.75:4567

server-id = 101
wsrep_node_name = "LLLC1Node1"
wsrep_node_address = 10.10.17.7...

Read more...

Alex Yurchenko (ayurchen) wrote :

1) Any chance to test it with codership's binaries? https://launchpad.net/galera/+download https://launchpad.net/codership-mysql/+download
2) Any chance for us to get hands on otrs_prod.sql (the first part that crashes the cluster)? How many insert statements on average does it go through before the crash?

Seppo Jaakola (seppo-jaakola) wrote :

Does it always crash when inserting to same table? Can you show the crashing table definition?

Axel (ajurak) wrote :

Hi Alex, Hi Seppo,

thank you for your answers.

1)
The Cluster is in testing-phase and therefore shouldn't be problem to test "your" binaries.

2)
Sorry, no, this is not possible. This data may not leave the company.

3)
The software itself is OTRS 3.2.8, the structure of the table is:

CREATE TABLE IF NOT EXISTS `article_attachment` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `article_id` bigint(20) NOT NULL,
  `filename` varchar(250) DEFAULT NULL,
  `content_size` varchar(30) DEFAULT NULL,
  `content_type` text,
  `content_id` varchar(250) DEFAULT NULL COMMENT 'bei Upgrade von 2.4 auf 3.0.11 von graz4u hinzugefügt',
  `content_alternative` varchar(50) DEFAULT NULL COMMENT 'bei Upgrade von 2.4 auf 3.0.11 von graz4u hinzugefügt',
  `content` longblob NOT NULL,
  `create_time` datetime NOT NULL,
  `create_by` int(11) NOT NULL,
  `change_time` datetime NOT NULL,
  `change_by` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `article_attachment_article_id` (`article_id`),
  KEY `create_by` (`create_by`),
  KEY `change_by` (`change_by`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=28344 ;

It should have 23076 rows, but it mostly crashes around 20233 (+/- 30) rows.

4) to round up the picure:
- first, the article table is filled. there are 9450187 rows, it is about 16.7 GByte large.
- second, the article_attachment table is done,23076 rows. it should be about 2 GByte large.

it always crashes at this table.

i will try to create a new export only of the article_attachments table, which i will try to import into the cluster.

Axel (ajurak) wrote :
Download full text (10.9 KiB)

oke, i just imported ONLY the article_attachment table.
same result.
except, that this time it crashes with signal 11, not 6.

with the last tests i was always running the "cluster" in bootstrap-pxc mode ....

-----------

first try: 20446 rows (AND the following log)

2014-01-10 14:47:48 2502 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
2014-01-10 14:47:49 2502 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
13:48:00 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=268435456
read_buffer_size=268435456
max_used_connections=3
max_threads=252
thread_count=3
connection_count=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 198446771 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x9568f70
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f90b0125e00 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8e91ae]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x68c1e4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f90dbc3bcb0]
/usr/lib/libgalera_smm.so(_ZNSt3tr110_HashtableIPN6galera10KeyEntryNGES3_SaIS3_ESt9_IdentityIS3_ENS1_18KeyEntryPtrEqualNGENS1_17KeyEntryPtrHashNGENS_8__detail18_Mod_range_hashingENS9_20_Default_ranged_hashENS9_20_Prime_rehash_policyELb0ELb1ELb1EE4findERKS3_+0x51)[0x7f90bcadac31]
/usr/lib/libgalera_smm.so(_ZN6galera13Certification10do_test_v3EPNS_9TrxHandleEb+0x18d)[0x7f90bcad731d]
/usr/lib/libgalera_smm.so(_ZN6galera13Certification7do_testEPNS_9TrxHandleEb+0x3ef)[0x7f90bcad986f]
/usr/lib/libgalera_smm.so(_ZN6galera13Certification4testEPNS_9TrxHandleEb+0x28)[0x7f90bcad9cc8]
/usr/lib/libgalera_smm.so(_ZN6galera13Certification10append_trxEPNS_9TrxHandleE+0xcc)[0x7f90bcad9dac]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM4certEPNS_9TrxHandleE+0x88)[0x7f90bcb03ee8]
/usr/lib/libgalera_smm.so(_ZN6galera13ReplicatorSMM10pre_commitEPNS_9TrxHandleEP14wsrep_trx_meta+0x59)[0x7f90bcb053e9]
/usr/lib/libgalera_smm.so(galera_pre_commit+0x140)[0x7f90bcb1ce30]
/usr/sbin/mysqld(_Z22wsrep_run_wsrep_commitP3THDP10handlertonb+0xa17)[0x7ac1e7]
/usr/sbin/mysqld[0x7acd2b]
/usr/sbin/mysqld(_Z14ha_prepare_lowP3THDb+0x87)[0x5de227]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDbb+0x17a)[0x5dd38a]
/usr/sbin/mysqld(_Z17trans_commit_stmtP3THD+0x37)[0x791d67]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x97c)[0x70639c]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x5f8)[0x70d678]
/usr/sbin/mysqld[0x70def8]
/usr/sbin/mys...

Axel (ajurak) wrote :

could HW-acceleration be the problem since this is a VMWare box?

2014-01-10 15:14:37 8029 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-01-10 15:14:37 8029 [Note] WSREP: wsrep_load(): Galera 3.2(r170) by Codership Oy <email address hidden> loaded successfully.
2014-01-10 15:14:37 8029 [Note] WSREP: CRC-32C: using hardware acceleration.
2014-01-10 15:14:37 8029 [Note] WSREP: Found saved state: ccbfba4f-0f57-11e3-b322-53da687002f1:-1
2014-01-10 15:14:37 8029 [Note] WSREP: Passing config to GCS: base_host = 10.10.17.73; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
2014-01-10 15:14:37 8029 [Note] WSREP: Assign initial position for certification: 151134, protocol version: -1
2014-01-10 15:14:37 8029 [Note] WSREP: wsrep_sst_grab()
2014-01-10 15:14:37 8029 [Note] WSREP: Start replication
2014-01-10 15:14:37 8029 [Note] WSREP: Setting initial position to ccbfba4f-0f57-11e3-b322-53da687002f1:151134
2014-01-10 15:14:37 8029 [Note] WSREP: protonet asio version 0
2014-01-10 15:14:37 8029 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
2014-01-10 15:14:37 8029 [Note] WSREP: backend: asio
2014-01-10 15:14:37 8029 [Note] WSREP: GMCast version 0
2014-01-10 15:14:37 8029 [Note] WSREP: (89bb5ae3-7a01-11e3-a0fc-1754c0c173de, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2014-01-10 15:14:37 8029 [Note] WSREP: (89bb5ae3-7a01-11e3-a0fc-1754c0c173de, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2014-01-10 15:14:37 8029 [Note] WSREP: EVS version 0
2014-01-10 15:14:37 8029 [Note] WSREP: PC version 0

---

however, i try to install your binary versions and do the tests again.....

Axel (ajurak) wrote :

HA! Your binaries work! :)

All 23076 rows can be inserted! Nice.

Now I'll try to load the complete DB....

OK, that worked as well.

---------------

But, I can NOT load the libgalera provider:
2014-01-11 01:32:55 12174 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2014-01-11 11:24:11 14543 [Note] WSREP: Initial position: bfecfd3a-7a57-11e3-abaa-031560a01952:0
2014-01-11 11:24:11 14543 [Note] WSREP: wsrep_load(): loading provider library 'none'
2014-01-11 11:24:11 14543 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.14-log' socket: '/var/run/mysqld/mysqld.sock' port: 3306 MySQL Community Server (GPL), wsrep_25.1.r4019

If i start the node with "/etc/init.d/mysql start" it looks like the configfile is not or half read????
very strange!
some vars are set, some not.

these are not set:
wsrep-cluster-name = my_wsrep_cluster
wsrep-notify-cmd = <empty>
wsrep-provider = <empty>

these are set to correct node name from cfg - just picked a few:
wsrep-node-name
log-bin
innodb-data-file-path

---------------

This is really not a stable environment.
I think I just go back to 5.5 and try the upgrade to 5.6 in a year or so... :/

Axel (ajurak) wrote :

oops! I found something very strange!

The Server loaded all data, but the table size is by far smaller! ????

See the attachment - this is the galera server with your binaries

---------------

Server-Typ: MySQL
Server Version: 5.6.14-log - MySQL Community Server (GPL), wsrep_25.1.r4019

article_attachment:

Space usage
Data 616.5 MiB
Index 544 KiB
Total 617 MiB

Row statistics
Format Compact
Collation utf8_general_ci
Next autoindex 28,344
Rows 23076 total

---------------

Server-Typ: MySQL
Server Version: 5.1.69-0ubuntu0.10.04.1-log - (Ubuntu)

article_attachment

Space usage
Data 2 GiB
Index 1.4 MiB
Total 2 GiB

Row statistics
Format Compact
Collation utf8_general_ci
Next autoindex 28,344
Rows 23076 total

Axel (ajurak) wrote :

this is the production system running under standard mysql 5.1 under ubuntu 10.04

Alex Yurchenko (ayurchen) wrote :

CRC acceleration can't have any effect there, it is engaged only at replication stage. The crash happens way after.

No wonder it worked this time - galera was not loaded and the crash happens in galera. It looks like wsrep_provider was not configured in effective my.cnf

I would not worry about physical size on disk that much - 5.6 may be using different file format. What's important is the number of rows and table definition.

Alex Yurchenko (ayurchen) wrote :

Could you please try with the attached galera binary, it is unstripped debug build. Also would be great if you could attach to mysqld process with gdb before running the load and print the stack trace when it crashes?

Axel (ajurak) wrote :
Download full text (3.5 KiB)

Hi Alex,

you are right.
Thats what I said. Somehow the my.cnf file is NOT fully loaded. See #9.

When I start the server without the init script like this:
/usr/sbin/mysqld --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_cluster_address=gcomm://

With this setup the wsrep provider is loaded and working (checked it with: show status like 'wsrep%'; )
OK, fine.

But the question remains why the SAME config is working with percona-56 but not with "your" binaries.
I tried your init script & the one from percona - both do not work with your binary, the my.cnf is only partially read...

Then the import is NOT working:

2014-01-11 13:55:04 26168 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
2014-01-11 13:55:20 26168 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
2014-01-11 14:00:54 26168 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000001 of size 134217728 bytes
13:00:59 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=268435456
read_buffer_size=268435456
max_used_connections=4
max_threads=250
thread_count=3
connection_count=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 196873675 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fde80000990
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fde9669da70 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x8d52ce]
/usr/sbin/mysqld(handle_fatal_signal+0x481)[0x6adba1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7fdec0fa8cb0]
/usr/lib/galera/libgalera_smm.so(_ZN6galera13Certification16purge_for_trx_v3EPNS_9TrxHandleE+0x27d)[0x7fdea104cf7d]
/usr/lib/galera/libgalera_smm.so(_ZNK6galera13Certification15PurgeAndDiscardclERSt4pairIKlPNS_9TrxHandleEE+0xc0)[0x7fdea1054cb0]
/usr/lib/galera/libgalera_smm.so(_ZSt8for_eachISt17_Rb_tree_iteratorISt4pairIKlPN6galera9TrxHandleEEENS3_13Certification15PurgeAndDiscardEET0_T_SB_SA_+0x2c)[0x7fdea1054edc]
/usr/lib/galera/libgalera_smm.so(_ZN6galera13Certification16purge_trxs_upto_Elb+0x72)[0x7fdea104e722]
/usr/lib/galera/libgalera_smm.so(_ZN6galera13Certification15purge_trxs_uptoElb+0x4b)[0x7fdea107c2bb]
/usr/lib/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM18process_commit_cutEll+0x94)[0x7fdea1076864]
/usr/lib/galera/libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x43d)[0x7fdea105a8ad]
/usr/lib/galera/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPvRb+0x5b)[0x7fdea105b4eb]
/usr/lib/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x63)[0x7fdea107ad03]
/usr/lib/galera/libgalera_smm.so(galera_recv+...

Read more...

Axel (ajurak) wrote :
Download full text (6.1 KiB)

started with:
/usr/sbin/mysqld --wsrep_provider=/usr/lib/galera/libgalera_smm.so --wsrep_cluster_address=gcomm://

(/usr/lib/galera/libgalera_smm.so is the one from you)

-------------------------------

session from gdb:

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7f1632a67700 (LWP 18364)]
0x00007f165bd1c425 in raise () from /lib/x86_64-linux-gnu/libc.so.6

(gdb) bt
#0 0x00007f165bd1c425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f165bd1fb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f165bd150ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f165bd15192 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f16587ebaa2 in galera::Certification::purge_for_trx_v3 (this=0x22852f8,
    trx=0x7f14e525cf30) at galera/src/certification.cpp:94
#5 0x00007f16587ebbf7 in galera::Certification::purge_for_trx (this=0x22852f8,
    trx=0x7f14e525cf30) at galera/src/certification.cpp:115
#6 0x00007f16587f1920 in galera::Certification::PurgeAndDiscard::operator() (
    this=0x7f1632a658a0, vt=...) at galera/src/certification.hpp:147
#7 0x00007f16587f2cc0 in std::for_each<std::_Rb_tree_iterator<std::pair<long const, galera::TrxHandle*> >, galera::Certification::PurgeAndDiscard> (__first=..., __last=..., __f=...)
    at /usr/include/c++/4.6/bits/stl_algo.h:4379
#8 0x00007f16587ef699 in galera::Certification::purge_trxs_upto_ (this=0x22852f8,
    seqno=15300, handle_gcache=true) at galera/src/certification.cpp:919
#9 0x00007f1658820001 in galera::Certification::purge_trxs_upto (this=0x22852f8, seqno=15300,
    handle_gcache=true) at galera/src/certification.hpp:82
#10 0x00007f165881be1e in galera::ReplicatorSMM::process_commit_cut (this=0x2284a30,
    seq=15300, seqno_l=21866) at galera/src/replicator_smm.cpp:1245
#11 0x00007f16587f9db5 in galera::GcsActionSource::dispatch (this=0x2285040,
    recv_ctx=0x7f1628000990, act=..., exit_loop=@0x7f1632a6658d: false)
    at galera/src/gcs_action_source.cpp:127
#12 0x00007f16587fa0a8 in galera::GcsActionSource::process (this=0x2285040,
    recv_ctx=0x7f1628000990, exit_loop=@0x7f1632a6658d: false)
    at galera/src/gcs_action_source.cpp:177
#13 0x00007f1658816ef8 in galera::ReplicatorSMM::async_recv (this=0x2284a30,
    recv_ctx=0x7f1628000990) at galera/src/replicator_smm.cpp:352
#14 0x00007f165883307e in galera_recv (gh=0x2259b10, recv_ctx=0x7f1628000990)
    at galera/src/wsrep_provider.cpp:222
#15 0x0000000000609215 in ?? ()
#16 0x0000000080040b00 in ?? ()
#17 0x0000000100000002 in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb)

-------------------------------

Content of the Log:

2014-01-11 18:11:43 18358 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
2014-01-11 18:12:00 18358 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.000000
2014-01-11 18:18:18 18358 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000001 of size 134217728 bytes
mysqld: galera/src/certification.cpp:94: void galera::Certification::purge_for_trx_v3(galera::TrxHandle*): Assertion `ci != cert_index_ng_.end()' failed.
18:32:45 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that...

Read more...

Axel (ajurak) wrote :

oh, now it chrashed in an other table!
now it's the article_plain.

article, article_attachment, article_flag were loaded successfully

but the next one, article_plain, crashed.
it should have 27406 rows, but has only 24876.

--
-- Table structure for table `article_plain`
--

CREATE TABLE IF NOT EXISTS `article_plain` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `article_id` bigint(20) NOT NULL,
  `body` longblob NOT NULL,
  `create_time` datetime NOT NULL,
  `create_by` int(11) NOT NULL,
  `change_time` datetime NOT NULL,
  `change_by` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `article_plain_article_id` (`article_id`),
  KEY `create_by` (`create_by`),
  KEY `change_by` (`change_by`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=26360 ;

Alex Yurchenko (ayurchen) wrote :

Hi Axel, thanks for the stack trace, it is very helpful. I'll be offline for the next 5 days, hopefully I'll come up with solution by then.

Axel (ajurak) wrote :

Hi Alex, oke fine, thanx :)

Changed in percona-xtradb-cluster:
status: New → Incomplete
status: Incomplete → New

Adding galera to the bug since it is crashing there.

Alex Yurchenko (ayurchen) wrote :

A duplicate: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1271533
It looks like the size of writesets makes a difference.

Sami Ahlroos (sami-ahlroos) wrote :

Hi,
I was having exactly this problem in exactly same situation; importing large DB to a new cluster and having that same crash. Setting wsrep_provider_options='gcs.max_throttle=0.0;gcs.fc_limit=512' as a work-around let me import the data.

Alex Yurchenko (ayurchen) wrote :

I don't see how gcs.max_throttle may help there, but a low enough gcs.fc_limit will lessen applying concurrency... so that may be helping.

Axel (ajurak) wrote :

Hi Sami,
Hi Alex,

thank you for your answers & hints. I didn't have time to test these settings.
We are using for now Version 5.5.

Any idea when the bug will be fixxed in 5.6?
Our customer wants to upgrade asap. ;-)

Thank you & best regards
 Axel

Alex Yurchenko (ayurchen) wrote :

Hi Axel, I still didn't have a chance to look at it in depth. However, Percona made a PXC 5.6 release few days ago. While it does no fix this bug specifically, it may have better chances of success.

Axel (ajurak) wrote :

hi alex,
ok, thanx for the info! :)

A duplicate, but no VMware involved: https://mariadb.atlassian.net/browse/MDEV-5720 Let me know if I You need more information.

Axel (ajurak) wrote :

Hello again,

sorry to bother you again, but we really need a solution by now!
Can you please tell me when we can expect an update or fix for this problem?

Our customer is not amused about the situation and is urging for a solution.

Thank you very much & best regards
 Axel

Alex Yurchenko (ayurchen) wrote :

Looks like it was of-by-one error in certification index cleanup.
Fix pushed in http://bazaar.launchpad.net/~codership/galera/3.x/revision/178

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
milestone: none → 25.3.5
status: New → Fix Committed
Changed in galera:
status: Fix Committed → Fix Released
Axel (ajurak) wrote :
Download full text (12.7 KiB)

Hi Alex,

hmmm, just tried it again with the newest ubuntu package available.
crashed again, same result. :(

LOGFILE:

140403 10:13:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140403 10:13:22 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.fpLUPy' --pid-file='/var/lib/mysql/srzotrsDB01-recover.pid'
140403 10:13:37 mysqld_safe WSREP: Recovered position a2a91839-9fb6-11e3-a366-f7e082d5377b:27494
2014-04-03 10:13:37 0 [Note] WSREP: wsrep_start_position var submitted: 'a2a91839-9fb6-11e3-a366-f7e082d5377b:27494'
2014-04-03 10:13:37 8593 [Note] WSREP: Setting wsrep_ready to 0
2014-04-03 10:13:37 8593 [Note] WSREP: Read nil XID from storage engines, skipping position init
2014-04-03 10:13:37 8593 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2014-04-03 10:13:37 8593 [Note] WSREP: wsrep_load(): Galera 3.4(r176) by Codership Oy <email address hidden> loaded successfully.
2014-04-03 10:13:37 8593 [Note] WSREP: CRC-32C: using hardware acceleration.
2014-04-03 10:13:37 8593 [Note] WSREP: Found saved state: a2a91839-9fb6-11e3-a366-f7e082d5377b:-1
2014-04-03 10:13:37 8593 [Note] WSREP: Passing config to GCS: base_host = 10.10.17.73; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; protonet
2014-04-03 10:13:37 8593 [Note] WSREP: Assign initial position for certification: 27494, protocol version: -1
2014-04-03 10:13:37 8593 [Note] WSREP: Assign initial position for certification: 27494, protocol version: -1
2014-04-03 10:13:37 8593 [Note] WSREP: wsrep_sst_grab()
2014-04-03 10:13:37 8593 [Note] WSREP: Start replication
2014-04-03 10:13:37 8593 [Note] WSREP: Setting initial position to a2a91839-9fb6-11e3-a366-f7e082d5377b:27494
2014-04-03 10:13:37 8593 [Note] WSREP: protonet asio version 0
2014-04-03 10:13:37 8593 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
2014-04-03 10:13:37 8593 [Note] WSREP: backend: asio
2014-04-03 10:13:37 8593 [Note] WSREP: GMCast version 0
2014-04-03 10:13:37 8593 [Note] WSREP: (dbcf1bdd-bb07-11e3-ac06-1b749f0aa051, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2014-04-03 10:13:37 8593 [Note] WSREP: (dbcf1bdd-bb07-11e3-ac06-1b749f0aa051, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2014-04-03 10:13:37 8593 [No...

Alex Yurchenko (ayurchen) wrote :

Axel,

> 2014-04-03 10:13:37 8593 [Note] WSREP: wsrep_load(): Galera 3.4(r176) by Codership Oy <email address hidden> loaded successfully.

You need 3.5 release.

Axel (ajurak) wrote :

aha, oke.
well, that hasn't been published yet by percona! :(

i have in /etc/apt/sources.list :

deb http://repo.percona.com/apt precise main
deb-src http://repo.percona.com/apt precise main

and the command
   apt-get update ; apt-get upgrade

tell me that there is no to install/upgrade.

so, i have to wait......

@Axel,

The 3.6 package is now available in debian experimental repo. https://launchpad.net/percona-xtradb-cluster/+milestone/galera-3.6 is the milestone associated with it.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers