corosync 2.3.4 memory leak

Bug #1598229 reported by Richard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
High
MOS Linux
6.1.x
Won't Fix
High
MOS Maintenance
7.0.x
Fix Released
High
Sergii Rizvan
8.0.x
Fix Released
High
Sergii Rizvan
9.x
Fix Released
High
MOS Linux

Bug Description

Bug Description:

Encountered a memory leak with corosync on all three nodes in a cluster:

Jun 13 20:36:35 XXXXXXXXX1 kernel: [929808.525991] Out of memory: Kill process 4846 (corosync) score 941 or sacrifice child
Jun 13 20:36:35 XXXXXXXXX1 kernel: [929808.620411] Killed process 4846 (corosync) total-vm:267928256kB, anon-rss:257475632kB, file-rss:37816kB
Jun 29 02:26:17 XXXXXXXXX1 kernel: [2247790.069557] Out of memory: Kill process 27791 (corosync) score 938 or sacrifice child
Jun 29 02:26:17 XXXXXXXXX1 kernel: [2247790.166524] Killed process 27791 (corosync) total-vm:265216168kB, anon-rss:255941644kB, file-rss:28580kB

Jun 14 14:00:03 XXXXXXXXX2 kernel: [993027.615377] Out of memory: Kill process 5167 (corosync) score 943 or sacrifice child
Jun 14 14:00:03 XXXXXXXXX2 kernel: [993027.709419] Killed process 5167 (corosync) total-vm:265023016kB, anon-rss:256668244kB, file-rss:33844kB
Jun 28 22:56:30 XXXXXXXXX2 kernel: [2235753.617203] Out of memory: Kill process 27073 (corosync) score 941 or sacrifice child
Jun 28 22:56:30 XXXXXXXXX2 kernel: [2235753.713521] Killed process 27073 (corosync) total-vm:261875792kB, anon-rss:255939160kB, file-rss:24760kB
Mar 21 22:19:17 XXXXXXXXX2 kernel: [956727.096937] Out of memory: Kill process 5422 (corosync) score 942 or sacrifice child
Mar 21 22:19:17 XXXXXXXXX2 kernel: [956727.191025] Killed process 5422 (corosync) total-vm:264643868kB, anon-rss:256189360kB, file-rss:33976kB
Apr 26 00:30:04 XXXXXXXXX2 kernel: [1017203.359940] Out of memory: Kill process 5183 (corosync) score 927 or sacrifice child
Apr 26 00:30:04 XXXXXXXXX2 kernel: [1017203.455015] Killed process 5183 (corosync) total-vm:271136904kB, anon-rss:251953372kB, file-rss:33760kB

Jun 29 09:00:02 XXXXXXXXX3 kernel: [2276334.347836] Out of memory: Kill process 24183 (corosync) score 937 or sacrifice child
Jun 29 09:00:02 XXXXXXXXX3 kernel: [2276334.444000] Killed process 24183 (corosync) total-vm:270476488kB, anon-rss:255257476kB, file-rss:32248kB
Mar 22 04:58:18 XXXXXXXXX3 kernel: [979377.041372] Out of memory: Kill process 5088 (corosync) score 941 or sacrifice child
Mar 22 04:58:18 XXXXXXXXX3 kernel: [979377.135414] Killed process 5088 (corosync) total-vm:265582012kB, anon-rss:255851792kB, file-rss:36000kB
Apr 26 09:26:02 XXXXXXXXX3 kernel: [1014911.175029] Out of memory: Kill process 5255 (corosync) score 925 or sacrifice child
Apr 26 09:26:02 XXXXXXXXX3 kernel: [1014911.270203] Killed process 5255 (corosync) total-vm:269154272kB, anon-rss:251736288kB, file-rss:35740kB
Jun 13 22:46:23 XXXXXXXXX3 kernel: [942502.987771] Out of memory: Kill process 5230 (corosync) score 940 or sacrifice child
Jun 13 22:46:23 XXXXXXXXX3 kernel: [942503.081826] Killed process 5230 (corosync) total-vm:265560916kB, anon-rss:256339740kB, file-rss:33788kB

The memory leak was confirmed through an analysis of atop logs where it was observed that memory utilization by corosync would go from 47% to 97% over the course of several days before corosync was then killed.

The are many memory leaks identified for the current version of corosync in MOS6.1

# dpkg -l | grep corosync
ii corosync 2.3.4-0u~u14.04+mos1 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.4-0u~u14.04+mos1 amd64 Standards-based cluster framework, common library

Steps to reproduce:

Unsure how to reproduce at this point, as logging is not detailed enough. Will enable debug when possible.

Expected results:

Impact:

corosync has crashed relatively frequently on all three nodes, however unsure if this has occurred in other zones.

Environment description:

- Operation system: Ubuntu 14.04.2 LTS - 3.13.0-61-generic
- Versions of components:

# dpkg -l | egrep 'corosync|pacemaker'
ii corosync 2.3.4-0u~u14.04+mos1 amd64 Standards-based cluster framework (daemon and modules)
ii crmsh 2.1.0-1~u14.04+mos1 all CRM shell for the pacemaker cluster manager
ii libcorosync-common4 2.3.4-0u~u14.04+mos1 amd64 Standards-based cluster framework, common library
ii pacemaker 1.1.12-0u~u14.04+mos6.1 amd64 HA cluster resource manager
ii pacemaker-cli-utils 1.1.12-0u~u14.04+mos6.1 amd64 Command line interface utilities for Pacemaker

# uname -r
3.13.0-61-generic
- Reference architecture:
MOS6.1 - unable to provide more information due to restrictions, but at scale
- Network model:
Neutron+GRE+vlan
- Related projects installed:
N/A

Richard (rkuo)
description: updated
tags: added: customer-found
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :
Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :

@Richard, could you please test packages with patch?
https://drive.google.com/open?id=0B2LmvHYmdMhkYXlLa0lMWXJwSXc

Revision history for this message
Richard (rkuo) wrote :

Hello Dmitry,

Sorry for the late reply. It will have to go through scrum team for testing. I will start the process, however it may take a long time.

Richard

Revision history for this message
Albert Syriy (asyriy) wrote :

The commit has been tested with custom ubuntu bvt_2
https://custom-ci.infra.mirantis.net/view/9.0/job/9.0.custom.ubuntu.bvt_2/112/

Revision history for this message
Dmitry Teselkin (teselkin-d) wrote :
tags: added: on-verification
Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :

@rkuo: Could you please specify steps to reproduce

Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :

@Richard: Could you please specify steps to reproduce

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

So we don't have steps to reproduce / verification of the issue. Let's change status to Fix Released for MOS 9.1 until the issue will be reproduced (if it wasn't fixed or fixed just partially). We believe the feature was fixed properly but we don't know how to reproduce it.

Revision history for this message
Richard (rkuo) wrote :

@Sergii and @Timur, we are working on trying to get the client to test. This is reliably occurring extremely frequently on one site, although we are not sure what is triggering it.

Revision history for this message
Sergii Rizvan (srizvan) wrote :
Revision history for this message
Sergii Rizvan (srizvan) wrote :

Closed as Won't fix for 6.1 because we don't provide support for this version anymore.

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

We still don't have steps to reproduce / verification of the issue, so we'll move bug to Fix Released status for MOS 7.0 + MU6 updates, as it was done for 9.1.

The fix has been included into 7.0 MU6 updates:

root@node-1:~# dpkg -l | grep 'corosync'
ii corosync 2.3.4-0u~u14.04+mos3 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.4-0u~u14.04+mos3 amd64 Standards-based cluster framework, common library

Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

We still don't have steps to reproduce / verification of the issue, so we'll move bug to Fix Released status for MOS 8.0 + MU4 updates, as it was done for 9.1.

The fix has been included into 8.0 MU4 updates:

root@node-7:~# dpkg -l | grep 'corosync'
ii corosync 2.3.4-0u~u14.04+mos3 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.4-0u~u14.04+mos3 amd64 Standards-based cluster framework, common library

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.