Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Silent abort (crash) at gcs/src/gcs_core.cpp:1152

Bug #1549704 reported by Vladimir on 2016-02-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.29-25.15

Bug Description

Description:

After some period of stable work Percona XtraDB Cluster nodes starts crashing and can't join cluster (no matter IST/SST).
The only way to restore cluster is to stop all nodes and rebootstrap.

Steps to reproduce (reproducibility - 100%):

1. Bootstrap cluster
2. Wait for 15-20 days
3. Some XtraDB node will crash
4. Try to join cluster (i.e. systemctl start mysql).

Actual results:

Crash and inability to join cluster.

2016-02-05 10:42:27 6383 [Warning] WSREP: 1.0 (mysql-rw0): State transfer to 0.0 (mysql-rw1) failed: -12 (Cannot allocate memory)
2016-02-05 10:42:27 6383 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():731: Will never receive state. Need to abort.

Expected results:

JOINER->SYNCED

Additional info:

I've tried to enable core dumping but it seems that galera disables it in sources so I've tried to run mysqld manually (i.e. without mysqld_safe wrapper) with gdb attached to catch fault and get bt.

# rpm -qa | grep Percona
Percona-XtraDB-Cluster-server-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-test-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-galera-3-debuginfo-3.14-1.rhel7.x86_64
Percona-XtraDB-Cluster-full-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.14-1.rhel7.x86_64
Percona-XtraDB-Cluster-garbd-3-3.14-1.rhel7.x86_64
Percona-XtraDB-Cluster-56-debuginfo-5.6.28-25.14.1.el7.x86_64

# cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)

# hostnamectl
   Static hostname: mysql-rw2
           Chassis: container
    Virtualization: lxc-libvirt
  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-229.20.1.el7.x86_64
      Architecture: x86-64

See original description

Tags:

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-02-25:

joiner node GDB debug session Edit (5.0 KiB, text/plain)

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-02-25:

donor mysqld.log on joiner crash Edit (6.0 KiB, text/plain)

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-02-25:

joiner my.cnf Edit (2.3 KiB, text/plain)

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-02-25:

mysqld.log of joiner after initial fault Edit (329.6 KiB, text/plain)

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-02-25:

journald mysql messages of joiner Edit (4.8 KiB, text/plain)

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-02-25:

joiner mysqld.log on crash Edit (13.8 KiB, text/plain)

description:	updated
tags:	added: galera

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2016-03-24:

I see there is error because memory allocation fails during SST but at first level not sure why it is so.

While we look at this check if you can findout cause for memory allocation failure.

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-03-24:

Can this become a reason?

# cat /proc/meminfo
MemTotal: 9007199254740991 kB
MemFree: 9007199225892055 kB
MemAvailable: 3341132 kB
Buffers: 0 kB
Cached: 2360676 kB
SwapCached: 114024 kB
Active: 25532372 kB
Inactive: 3316148 kB
Active(anon): 24423012 kB
Inactive(anon): 2122176 kB
Active(file): 1109360 kB
Inactive(file): 1193972 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2097148 kB
SwapFree: 12 kB
Dirty: 5484 kB
Writeback: 0 kB
AnonPages: 26910188 kB
Mapped: 391772 kB
Shmem: 1483464 kB
Slab: 656116 kB
SReclaimable: 481404 kB
SUnreclaim: 174712 kB
KernelStack: 8576 kB
PageTables: 123736 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18436208 kB
Committed_AS: 39987788 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 330704 kB
VmallocChunk: 34359403516 kB
HardwareCorrupted: 0 kB
AnonHugePages: 2_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 726880 kB
DirectMap2M: 19091456 kB
DirectMap1G: 15728640 kB

# cat /proc/mounts | grep meminfo
libvirt /proc/meminfo fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

AFAIR it was a normal libvirt-lxc behaviour and following code works as expected (killed by oom-killer near 25GB):

    char* buf;
    while((buf=malloc(1024*1024))!= NULL){
        memset(buf,0,1024*1024);
    }

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2016-03-25:

Well if I look at your joiner log then it suggest that things failed while try to fork() wsrep_sst_xtrabackup-v2 ... and error is returned by fork() call so quite likely that memory consumption on your system is bit higher than expected.

I see you have used 18G of buffer pool is that needed ?

How big is your SWAP space and physical RAM ?

Revision history for this message

Vladimir (amigo-elite) wrote on 2016-03-26:

#10

>I see you have used 18G of buffer pool is that needed ?
No but crashes was irrelevant to this setting - mysql continue to crash if I'll comment this setting.

>How big is your SWAP space and physical RAM ?

On the host:
# free -m
total used free shared buff/cache available
Mem: 31912 22907 291 2484 8714 6105
Swap: 2047 1863 184

# virsh -c lxc:// dumpxml mysql-rw1 | grep KiB
<memory unit='KiB'>32677888</memory>
<currentMemory unit='KiB'>32677888</currentMemory>

# sysctl -a | grep overcommit
vm.nr_overcommit_hugepages = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

Krunal Bauskar (krunal-bauskar) on 2016-05-07

Changed in percona-xtradb-cluster:
status:	New → Fix Committed

Hrvoje Matijakovic (hrvojem) on 2016-05-16

Changed in percona-xtradb-cluster:
milestone:	none → 5.6.29-25.15

Hrvoje Matijakovic (hrvojem) on 2016-05-20

Changed in percona-xtradb-cluster:
status:	Fix Committed → Fix Released

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#11

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1886

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.