corosync coredumps might consume all free space at /var/log

Bug #1253577 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
Medium
Fuel Library (Deprecated)

Bug Description

Issue:
For some cases, corosync (even with debug mode OFF) could generate a bunch of coredumps within a very short amount of time, consuming all free space at /var/log (thus, root partition as well). It cannot be handled by logrotate scheduled jobs as well.

Corosync logging config:
logging {
  fileline: off
  to_stderr: no
  to_logfile: no
  to_syslog: yes
  logfile: /var/log/corosync.log
  syslog_facility: daemon
# We don't really want corosync debugs, it is TOO verbose
# debug: off
  debug: off
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: off
    tags: enter|leave|trace1|trace2|trace3|trace4|trace6
  }
}

Example of coredumps in /var/log/coredumps:
161734 -rw------- 1 root root 780M Nov 20 15:46 core.corosync.14831.node-7.test.domain.local.1384962366
161735 -rw------- 1 root root 731M Nov 20 15:46 core.corosync.14832.node-7.test.domain.local.1384962366
161736 -rw------- 1 root root 679M Nov 20 15:46 core.corosync.14833.node-7.test.domain.local.1384962367
161737 -rw------- 1 root root 207M Nov 20 15:46 core.corosync.14836.node-7.test.domain.local.1384962367
161738 -rw------- 1 root root 163M Nov 20 15:46 core.corosync.14837.node-7.test.domain.local.1384962367
161739 -rw------- 1 root root 130M Nov 20 15:46 core.corosync.14840.node-7.test.domain.local.1384962367
161740 -rw------- 1 root root 57M Nov 20 15:46 core.corosync.14841.node-7.test.domain.local.1384962367
161741 -rw------- 1 root root 50M Nov 20 15:46 core.corosync.14842.node-7.test.domain.local.1384962367

Example of remote logs records from /var/log/remote/node-7.test.domain.local/crmd.log
(grep "2013-11-20T15:46:35" /var/log/remote/node-7.test.domain.local/crmd.log | wc -l
8609
is about ~10k per a second!) :
2013-11-20T15:46:35.564651+00:00 warning: warning: do_pe_control: Setup of client connection failed, not adding channel to mainloop
2013-11-20T15:46:35.564655+00:00 warning: warning: do_log: FSA: Input I_FAIL from do_pe_control() received in state S_INTEGRATION

Tags: library
Changed in fuel:
importance: Undecided → Low
Mike Scherbakov (mihgen)
Changed in fuel:
milestone: none → 4.1
Changed in fuel:
milestone: 4.1 → 5.1
importance: Low → Critical
status: New → Confirmed
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Download full text (3.2 KiB)

I've reproduced it at least 2 times on 5.1 ISO. Currently on:
{
    "api": "1.0",
    "astute_sha": "694b5a55695e01e1c42185bfac9cc7a641a9bd48",
    "build_id": "2014-06-27_00-31-14",
    "build_number": "274",
    "fuellib_sha": "acc99fcd0ba9eeef0a504dc26507eb91ce757220",
    "fuelmain_sha": "bf8660309601cee2f8f3e1bb881d272e638dcffa",
    "mirantis": "yes",
    "nailgun_sha": "5f2944a8d5077a1c96acb076ba9194f670b818e8",
    "ostf_sha": "a4978638de3951dbc229276608a839a19ece2b70",
    "production": "docker",
    "release": "5.1"
}

Steps to reproduce:
1. Deploy ha on Centos with neutron vlan with 3 controllers
2. When deployment finish with success ssh on controller and see where vips are running (crm status)
3. ssh on node where vip__management is running
4. shut down br-mgmt

Result on the node where br-mgmt was shut down:

# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/os-root 15G 15G 0 100% /
tmpfs 939M 37M 903M 4% /dev/shm
/dev/md0 194M 25M 160M 14% /boot
/dev/mapper/image-glance
                      130G 45M 130G 1% /var/lib/glance

# du -hs /var/log/coredump/
12G /var/log/coredump/

Example of dumps:
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14512.node-1.test.domain.local.1404238421
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14513.node-1.test.domain.local.1404238421
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14514.node-1.test.domain.local.1404238421
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14580.node-1.test.domain.local.1404238424
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14691.node-1.test.domain.local.1404238425
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14712.node-1.test.domain.local.1404238425
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14713.node-1.test.domain.local.1404238425
-rw------- 1 root root 62390272 Jul 1 18:13 core.corosync.14714.node-1.test.domain.local.1404238425

Example of "ps":
189 15090 0.0 0.8 148768 16088 ? S 14:42 0:04 \_ /usr/libexec/pacemaker/crmd
root 12553 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12574 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12575 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12576 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12577 0.1 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12578 0.1 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12579 0.1 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12582 0.0 0.0 0 0 ? Z 18:12 0:01 \_ [corosync] <defunct>
root 12673 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12829 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12850 0.0 0.0 0 0 ? Z 18:12 0:00 \_ [corosync] <defunct>
root 12851 0.0 0.0 0 0 ? Z 18...

Read more...

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Corosync core dump backtrace:

(gdb) bt
#0 0x00007fba6caeb925 in raise () from /lib64/libc.so.6
#1 0x00007fba6caed105 in abort () from /lib64/libc.so.6
#2 0x00007fba6a0749d0 in send_plugin_msg_raw () from /usr/libexec/lcrso/pacemaker.lcrso
#3 0x00007fba6a074d15 in route_ais_message () from /usr/libexec/lcrso/pacemaker.lcrso
#4 0x00007fba6a07575c in pcmk_ipc () from /usr/libexec/lcrso/pacemaker.lcrso
#5 0x00007fba6d4787cd in pthread_ipc_consumer (conn=0xbda0f0) at coroipcs.c:720
#6 0x00007fba6d0589d1 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fba6cba1b5d in clone () from /lib64/libc.so.6

It looks like the problem wih full message queue handling by pacemaker described here: http://lists.corosync.org/pipermail/discuss/2012-February/000959.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/104200

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Lowering to medium since coredumps could be easily disabled via sysctl and the issue is intermittent.

Changed in fuel:
importance: Critical → Medium
Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Vladimir Kuklin (vkuklin)
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
status: In Progress → Triaged
Revision history for this message
Dmitry Ilyin (idv1985) wrote :

We can disable core files by sysctl, but if there are core files there are some problems with corosync binaries.
I don't see core sump on my deployment and package was updated several times already. Perhaps they are gone.

Changed in fuel:
status: Triaged → Invalid
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Invalid is a wrong resolution here: the problem exists, it just can't be solved without a tradeoff between safety and debuggability that we don't want to make.

Changed in fuel:
status: Invalid → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Aleksandr Didenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/104200
Reason: No longer needed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.