Silent abort (crash) at gcs/src/gcs_core.cpp:1152

Bug #1549704 reported by Vladimir
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Unassigned

Bug Description

Description:

After some period of stable work Percona XtraDB Cluster nodes starts crashing and can't join cluster (no matter IST/SST).
The only way to restore cluster is to stop all nodes and rebootstrap.

Steps to reproduce (reproducibility - 100%):

1. Bootstrap cluster
2. Wait for 15-20 days
3. Some XtraDB node will crash
4. Try to join cluster (i.e. systemctl start mysql).

Actual results:

Crash and inability to join cluster.

2016-02-05 10:42:27 6383 [Warning] WSREP: 1.0 (mysql-rw0): State transfer to 0.0 (mysql-rw1) failed: -12 (Cannot allocate memory)
2016-02-05 10:42:27 6383 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():731: Will never receive state. Need to abort.

Expected results:

JOINER->SYNCED

Additional info:

I've tried to enable core dumping but it seems that galera disables it in sources so I've tried to run mysqld manually (i.e. without mysqld_safe wrapper) with gdb attached to catch fault and get bt.

# rpm -qa | grep Percona
Percona-XtraDB-Cluster-server-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-devel-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-client-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-test-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-galera-3-debuginfo-3.14-1.rhel7.x86_64
Percona-XtraDB-Cluster-full-56-5.6.28-25.14.1.el7.x86_64
Percona-XtraDB-Cluster-galera-3-3.14-1.rhel7.x86_64
Percona-XtraDB-Cluster-garbd-3-3.14-1.rhel7.x86_64
Percona-XtraDB-Cluster-56-debuginfo-5.6.28-25.14.1.el7.x86_64

# cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)

# hostnamectl
   Static hostname: mysql-rw2
           Chassis: container
    Virtualization: lxc-libvirt
  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-229.20.1.el7.x86_64
      Architecture: x86-64

Tags: crash galera
Revision history for this message
Vladimir (amigo-elite) wrote :
Revision history for this message
Vladimir (amigo-elite) wrote :
Revision history for this message
Vladimir (amigo-elite) wrote :
Revision history for this message
Vladimir (amigo-elite) wrote :
Revision history for this message
Vladimir (amigo-elite) wrote :
Revision history for this message
Vladimir (amigo-elite) wrote :
description: updated
tags: added: galera
Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

I see there is error because memory allocation fails during SST but at first level not sure why it is so.

While we look at this check if you can findout cause for memory allocation failure.

Revision history for this message
Vladimir (amigo-elite) wrote :

Can this become a reason?

# cat /proc/meminfo
MemTotal: 9007199254740991 kB
MemFree: 9007199225892055 kB
MemAvailable: 3341132 kB
Buffers: 0 kB
Cached: 2360676 kB
SwapCached: 114024 kB
Active: 25532372 kB
Inactive: 3316148 kB
Active(anon): 24423012 kB
Inactive(anon): 2122176 kB
Active(file): 1109360 kB
Inactive(file): 1193972 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 2097148 kB
SwapFree: 12 kB
Dirty: 5484 kB
Writeback: 0 kB
AnonPages: 26910188 kB
Mapped: 391772 kB
Shmem: 1483464 kB
Slab: 656116 kB
SReclaimable: 481404 kB
SUnreclaim: 174712 kB
KernelStack: 8576 kB
PageTables: 123736 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 18436208 kB
Committed_AS: 39987788 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 330704 kB
VmallocChunk: 34359403516 kB
HardwareCorrupted: 0 kB
AnonHugePages: 2_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 726880 kB
DirectMap2M: 19091456 kB
DirectMap1G: 15728640 kB

# cat /proc/mounts | grep meminfo
libvirt /proc/meminfo fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

AFAIR it was a normal libvirt-lxc behaviour and following code works as expected (killed by oom-killer near 25GB):

    char* buf;
    while((buf=malloc(1024*1024))!= NULL){
        memset(buf,0,1024*1024);
    }

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

Well if I look at your joiner log then it suggest that things failed while try to fork() wsrep_sst_xtrabackup-v2 ... and error is returned by fork() call so quite likely that memory consumption on your system is bit higher than expected.

I see you have used 18G of buffer pool is that needed ?

How big is your SWAP space and physical RAM ?

Revision history for this message
Vladimir (amigo-elite) wrote :

>I see you have used 18G of buffer pool is that needed ?
No but crashes was irrelevant to this setting - mysql continue to crash if I'll comment this setting.

>How big is your SWAP space and physical RAM ?

On the host:
# free -m
              total used free shared buff/cache available
Mem: 31912 22907 291 2484 8714 6105
Swap: 2047 1863 184

# virsh -c lxc:// dumpxml mysql-rw1 | grep KiB
  <memory unit='KiB'>32677888</memory>
  <currentMemory unit='KiB'>32677888</currentMemory>

# sysctl -a | grep overcommit
vm.nr_overcommit_hugepages = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

Changed in percona-xtradb-cluster:
status: New → Fix Committed
Changed in percona-xtradb-cluster:
milestone: none → 5.6.29-25.15
Changed in percona-xtradb-cluster:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1886

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.