Issue:
For some cases, corosync (even with debug mode OFF) could generate a bunch of coredumps within a very short amount of time, consuming all free space at /var/log (thus, root partition as well). It cannot be handled by logrotate scheduled jobs as well.
Corosync logging config:
logging {
fileline: off
to_stderr: no
to_logfile: no
to_syslog: yes
logfile: /var/log/corosync.log
syslog_facility: daemon
# We don't really want corosync debugs, it is TOO verbose
# debug: off
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}
Example of coredumps in /var/log/coredumps:
161734 -rw------- 1 root root 780M Nov 20 15:46 core.corosync.14831.node-7.test.domain.local.1384962366
161735 -rw------- 1 root root 731M Nov 20 15:46 core.corosync.14832.node-7.test.domain.local.1384962366
161736 -rw------- 1 root root 679M Nov 20 15:46 core.corosync.14833.node-7.test.domain.local.1384962367
161737 -rw------- 1 root root 207M Nov 20 15:46 core.corosync.14836.node-7.test.domain.local.1384962367
161738 -rw------- 1 root root 163M Nov 20 15:46 core.corosync.14837.node-7.test.domain.local.1384962367
161739 -rw------- 1 root root 130M Nov 20 15:46 core.corosync.14840.node-7.test.domain.local.1384962367
161740 -rw------- 1 root root 57M Nov 20 15:46 core.corosync.14841.node-7.test.domain.local.1384962367
161741 -rw------- 1 root root 50M Nov 20 15:46 core.corosync.14842.node-7.test.domain.local.1384962367
Example of remote logs records from /var/log/remote/node-7.test.domain.local/crmd.log
(grep "2013-11-20T15:46:35" /var/log/remote/node-7.test.domain.local/crmd.log | wc -l
8609
is about ~10k per a second!) :
2013-11-20T15:46:35.564651+00:00 warning: warning: do_pe_control: Setup of client connection failed, not adding channel to mainloop
2013-11-20T15:46:35.564655+00:00 warning: warning: do_log: FSA: Input I_FAIL from do_pe_control() received in state S_INTEGRATION
I've reproduced it at least 2 times on 5.1 ISO. Currently on: e1c42185bfac9cc 7a641a9bd48" , 27_00-31- 14", ef0a504dc26507e b91ce757220" , e2f8f3e1bb881d2 72e638dcffa" , 1c96acb076ba919 4f670b818e8" , dbc229276608a83 9a19ece2b70" ,
{
"api": "1.0",
"astute_sha": "694b5a55695e01
"build_id": "2014-06-
"build_number": "274",
"fuellib_sha": "acc99fcd0ba9ee
"fuelmain_sha": "bf8660309601ce
"mirantis": "yes",
"nailgun_sha": "5f2944a8d5077a
"ostf_sha": "a4978638de3951
"production": "docker",
"release": "5.1"
}
Steps to reproduce:
1. Deploy ha on Centos with neutron vlan with 3 controllers
2. When deployment finish with success ssh on controller and see where vips are running (crm status)
3. ssh on node where vip__management is running
4. shut down br-mgmt
Result on the node where br-mgmt was shut down:
# df -h image-glance
130G 45M 130G 1% /var/lib/glance
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/os-root 15G 15G 0 100% /
tmpfs 939M 37M 903M 4% /dev/shm
/dev/md0 194M 25M 160M 14% /boot
/dev/mapper/
# du -hs /var/log/coredump/
12G /var/log/coredump/
Example of dumps: 14512.node- 1.test. domain. local.140423842 1 14513.node- 1.test. domain. local.140423842 1 14514.node- 1.test. domain. local.140423842 1 14580.node- 1.test. domain. local.140423842 4 14691.node- 1.test. domain. local.140423842 5 14712.node- 1.test. domain. local.140423842 5 14713.node- 1.test. domain. local.140423842 5 14714.node- 1.test. domain. local.140423842 5
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.
Example of "ps": pacemaker/ crmd
189 15090 0.0 0.8 148768 16088 ? S 14:42 0:04 \_ /usr/libexec/
root 12553 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12574 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12575 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12576 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12577 0.1 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12578 0.1 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12579 0.1 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12582 0.0 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12673 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12829 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12850 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12851 0.0 0.0 0 0 ? Z 18...