Memory Leak when new cluster configuration is formed.

Bug #1563089 reported by Jorge Niedbalski
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
corosync (Ubuntu)
Fix Released
High
Jorge Niedbalski
Trusty
Fix Released
High
Jorge Niedbalski
Wily
Won't Fix
High
Jorge Niedbalski

Bug Description

[Environment]

Trusty 14.04.3

Packages:

ii corosync 2.3.3-1ubuntu1 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.3-1ubuntu1 amd64 Standards-based cluster framework, common library

[Reproducer]

1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml)
with a 3 nodes installation of cinder related to an HACluster subordinate unit.

$ juju-deployer -c next-ha.yaml -w 600 trusty-kilo

2) I changed the default corosync transport mode to unicast.

$ juju set cinder-hacluster corosync_transport=udpu

3) I assured that the 3 units were quorated

cinder/0# corosync-quorumtool
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
    Nodeid Votes Name
      1002 1 10.5.1.57 (local)
      1001 1 10.5.1.58
      1000 1 10.5.1.59

The primary unit was holding the VIP resource 10.5.105.1/16

root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP group default qlen 1000
    link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff
    inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0
       valid_lft forever preferred_lft forever

4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay.

$ sudo tc qdisc add dev eth0 root netem delay 350ms

5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new
cluster configuration.

Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration.
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service.

This happens on all of the units.

6) After receiving this message, I remove the queue from eth0:

$ sudo tc qdisk del dev eth0 root netem

Then, the following statement is written in the master node:

Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members
Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000
Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service.

7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the
corosync process:

root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms
root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem

$ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

The results shows that both vsz and rss are increased over time at a high ratio.

25476 4036

... (after 5 minutes).

135644 10352

[Fix]

So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9)
is a good candidate to be backported in Ubuntu Trusty.

[Test Case]

* See reproducer

[Backport Impact]

* Not identified

summary: - Memory Leak when new configuration is formed.
+ Memory Leak when new cluster configuration is formed.
tags: added: sts-needs-review
description: updated
Changed in corosync (Ubuntu):
status: New → In Progress
Changed in corosync (Ubuntu Trusty):
status: New → In Progress
Changed in corosync (Ubuntu):
importance: Undecided → High
Changed in corosync (Ubuntu Trusty):
importance: Undecided → High
Changed in corosync (Ubuntu):
assignee: nobody → Jorge Niedbalski (niedbalski)
Changed in corosync (Ubuntu Trusty):
assignee: nobody → Jorge Niedbalski (niedbalski)
Changed in corosync (Ubuntu Wily):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Jorge Niedbalski (niedbalski)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "Xenial Patch" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package corosync - 2.3.5-3ubuntu1

---------------
corosync (2.3.5-3ubuntu1) xenial; urgency=high

  * debian/patches/Totempg-Fix-memory-leak.patch: Fixes memory leak on
    Totempg. (LP: #1563089).

 -- Jorge Niedbalski <email address hidden> Fri, 01 Apr 2016 15:52:13 +0200

Changed in corosync (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Louis Bouchard (louis) wrote :

Fix accepted to the development release.

Fix sponsored for Trusty and wily; unsubscribing the Ubuntu-sponsor team

Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Jorge, or anyone else affected,

Accepted corosync into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/corosync/2.3.3-1ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in corosync (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Changed in corosync (Ubuntu Wily):
status: In Progress → Fix Committed
Revision history for this message
Chris J Arges (arges) wrote :

Hello Jorge, or anyone else affected,

Accepted corosync into wily-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/corosync/2.3.4-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :
Download full text (3.4 KiB)

Hello,

I ran the verification for the Trusty version.

root@juju-niedbalski-sec-machine-15:/home/ubuntu# dpkg -l|grep corosync
ii corosync 2.3.3-1ubuntu3 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.3-1ubuntu3 amd64 Standards-based cluster framework, common library

I configured a 3 nodes nova-cloud-controller environment related with hacluster.

ubuntu@niedbalski-sec-bastion:~/openstack-charm-testing/bundles/dev$ juju run --service nova-cloud-controller "sudo corosync-quorumtool -s|grep votes"
- MachineId: "15"
  Stdout: |
    Expected votes: 3
    Total votes: 3
  UnitId: nova-cloud-controller/0
- MachineId: "28"
  Stdout: |
    Expected votes: 3
    Total votes: 3
  UnitId: nova-cloud-controller/1
- MachineId: "29"
  Stdout: |
    Expected votes: 3
    Total votes: 3
  UnitId: nova-cloud-controller/2

I changed the transport mode to UDP by setting:

$ juju set hacluster-ncc corosync_transport=udpu

After this, I moved to the primary node (the one that holds the virtual ip address) and I applied the TC
rules, while monitoring the memory usage of the corosync process (multiple times)

root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc add dev eth0 root netem delay 550ms
root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc del dev eth0 root netem

Apr 6 17:57:37 juju-niedbalski-sec-machine-15 cib[14387]: warning: cib_process_request: Completed cib_apply_diff operation for section 'all': Application of an update diff failed (rc=-206, origin=local/cibadmin/2, version=0.27.1)
Apr 6 18:04:12 juju-niedbalski-sec-machine-15 corosync[14376]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:04:13 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:06:27 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:06:28 juju-niedbalski-sec-machine-15 corosync[19528]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:07:48 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:07:49 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:08:16 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:08:59 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:09:38 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide
service.

After 5 minutes of observation on the corosync process by using:

 $ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

I don't see any substantial memory usage increase.

root@juju-niedbalski-sec-machine-15:/home/ubuntu# more me...

Read more...

Revision history for this message
Jorge Niedbalski (niedbalski) wrote :

Based on my latest comment, I am marking the Trusty version as verification-done-trusty

tags: added: verification-done-trusty verification-needed-wily
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package corosync - 2.3.3-1ubuntu3

---------------
corosync (2.3.3-1ubuntu3) trusty; urgency=medium

  * debian/patches/Totempg-Fix-memory-leak.patch: Fixes memory leak on
    Totempg. (LP: #1563089).

 -- Jorge Niedbalski <email address hidden> Tue, 05 Apr 2016 10:01:37 +0200

Changed in corosync (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for corosync has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Martin Pitt (pitti) wrote :

The wily update was not verified in half a year, and wily is almost EOL. So I removed the -proposed package.

Changed in corosync (Ubuntu Wily):
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.