Ubuntu
corosync package

Memory Leak when new cluster configuration is formed.

Bug #1563089 reported by Jorge Niedbalski on 2016-03-28

This bug affects 1 person

	Status	Importance	Assigned to
corosync (Ubuntu)	Fix Released	High	Jorge Niedbalski
Trusty	Fix Released	High	Jorge Niedbalski
Wily	Won't Fix	High	Jorge Niedbalski

Bug Description

[Environment]

Trusty 14.04.3

Packages:

ii corosync 2.3.3-1ubuntu1 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.3-1ubuntu1 amd64 Standards-based cluster framework, common library

[Reproducer]

1) I deployed an HA environment using this bundle (http://bazaar.launchpad.net/~ost-maintainers/openstack-charm-testing/trunk/view/head:/bundles/dev/next-ha.yaml)
with a 3 nodes installation of cinder related to an HACluster subordinate unit.

$ juju-deployer -c next-ha.yaml -w 600 trusty-kilo

2) I changed the default corosync transport mode to unicast.

$ juju set cinder-hacluster corosync_transport=udpu

3) I assured that the 3 units were quorated

cinder/0# corosync-quorumtool
Votequorum information
----------------------
Expected votes: 3
Highest expected: 3
Total votes: 3
Quorum: 2
Flags: Quorate

Membership information
----------------------
    Nodeid Votes Name
      1002 1 10.5.1.57 (local)
      1001 1 10.5.1.58
      1000 1 10.5.1.59

The primary unit was holding the VIP resource 10.5.105.1/16

root@juju-niedbalski-sec-machine-4:/home/ubuntu# ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc netem state UP group default qlen 1000
    link/ether fa:16:3e:d2:19:6f brd ff:ff:ff:ff:ff:ff
    inet 10.5.1.57/16 brd 10.5.255.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.5.105.1/16 brd 10.5.255.255 scope global secondary eth0
       valid_lft forever preferred_lft forever

4) I manually added a TC queue for the eth0 interface on the node holding the VIP resource, introducing a 350 ms delay.

$ sudo tc qdisc add dev eth0 root netem delay 350ms

5) Right after adding the 350ms on the cinder/0 unit, the corosync process informs that one of the processors failed, and is forming a new
cluster configuration.

Mar 28 21:57:41 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A processor failed, forming new configuration.
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [QUORUM] Members[3]: 1002 1001 1000
Mar 28 22:00:48 juju-niedbalski-sec-machine-5 corosync[4584]: [MAIN ] Completed service synchronization, ready to provide service.

This happens on all of the units.

6) After receiving this message, I remove the queue from eth0:

$ sudo tc qdisk del dev eth0 root netem

Then, the following statement is written in the master node:

Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [TOTEM ] A new membership (10.5.1.57:11628) was formed. Members
Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [QUORUM] Members[3]: 1002 1001 1000
Mar 28 22:00:48 juju-niedbalski-sec-machine-4 corosync[9630]: [MAIN ] Completed service synchronization, ready to provide service.

7) While executing 5 and 6 repeatedly, I ran the following command to track the VSZ and RSS memory usage of the
corosync process:

root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc add dev eth0 root netem delay 350ms
root@juju-niedbalski-sec-machine-4:/home/ubuntu# tc qdisc del dev eth0 root netem

$ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

The results shows that both vsz and rss are increased over time at a high ratio.

25476 4036

... (after 5 minutes).

135644 10352

[Fix]

So preliminary based on this reproducer, I think that this commit (https://github.com/corosync/corosync/commit/600fb4084adcbfe7678b44a83fa8f3d3550f48b9)
is a good candidate to be backported in Ubuntu Trusty.

[Test Case]

* See reproducer

[Backport Impact]

* Not identified

See original description

Tags:

Jorge Niedbalski (niedbalski) on 2016-03-29

summary:	- Memory Leak when new configuration is formed. + Memory Leak when new cluster configuration is formed.
tags:	added: sts-needs-review

Jorge Niedbalski (niedbalski) on 2016-03-30

description:

updated

Jorge Niedbalski (niedbalski) on 2016-03-30

Changed in corosync (Ubuntu):
status:	New → In Progress
Changed in corosync (Ubuntu Trusty):
status:	New → In Progress
Changed in corosync (Ubuntu):
importance:	Undecided → High
Changed in corosync (Ubuntu Trusty):
importance:	Undecided → High
Changed in corosync (Ubuntu):
assignee:	nobody → Jorge Niedbalski (niedbalski)
Changed in corosync (Ubuntu Trusty):
assignee:	nobody → Jorge Niedbalski (niedbalski)

Jorge Niedbalski (niedbalski) on 2016-03-30

Changed in corosync (Ubuntu Wily):
status:	New → In Progress
importance:	Undecided → High
assignee:	nobody → Jorge Niedbalski (niedbalski)

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2016-03-30:

The attachment "Xenial Patch" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags:

added: patch

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2016-03-31:

Xenial Patch Edit (5.9 KiB, text/plain)

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2016-04-01:

Wily Patch Edit (5.0 KiB, text/plain)

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2016-04-01:

Trusty Patch Edit (5.2 KiB, text/plain)

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-04-04:

This bug was fixed in the package corosync - 2.3.5-3ubuntu1

---------------
corosync (2.3.5-3ubuntu1) xenial; urgency=high

* debian/patches/Totempg-Fix-memory-leak.patch: Fixes memory leak on
Totempg. (LP: #1563089).

-- Jorge Niedbalski <email address hidden> Fri, 01 Apr 2016 15:52:13 +0200

Changed in corosync (Ubuntu):
status:	In Progress → Fix Released

Revision history for this message

Louis Bouchard (louis) wrote on 2016-04-05:

Fix accepted to the development release.

Fix sponsored for Trusty and wily; unsubscribing the Ubuntu-sponsor team

Revision history for this message

Chris J Arges (arges) wrote on 2016-04-06: Please test proposed package

Hello Jorge, or anyone else affected,

Accepted corosync into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/corosync/2.3.3-1ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in corosync (Ubuntu Trusty):
status:	In Progress → Fix Committed
tags:	added: verification-needed
Changed in corosync (Ubuntu Wily):
status:	In Progress → Fix Committed

Revision history for this message

Chris J Arges (arges) wrote on 2016-04-06:

Hello Jorge, or anyone else affected,

Accepted corosync into wily-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/corosync/2.3.4-0ubuntu2 in a few hours, and then in the -proposed repository.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2016-04-06:

#10

Download full text (3.4 KiB)

Hello,

I ran the verification for the Trusty version.

root@juju-niedbalski-sec-machine-15:/home/ubuntu# dpkg -l|grep corosync
ii corosync 2.3.3-1ubuntu3 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.3-1ubuntu3 amd64 Standards-based cluster framework, common library

I configured a 3 nodes nova-cloud-controller environment related with hacluster.

ubuntu@niedbalski-sec-bastion:~/openstack-charm-testing/bundles/dev$ juju run --service nova-cloud-controller "sudo corosync-quorumtool -s|grep votes"
- MachineId: "15"
  Stdout: |
    Expected votes: 3
    Total votes: 3
  UnitId: nova-cloud-controller/0
- MachineId: "28"
  Stdout: |
    Expected votes: 3
    Total votes: 3
  UnitId: nova-cloud-controller/1
- MachineId: "29"
  Stdout: |
    Expected votes: 3
    Total votes: 3
  UnitId: nova-cloud-controller/2

I changed the transport mode to UDP by setting:

$ juju set hacluster-ncc corosync_transport=udpu

After this, I moved to the primary node (the one that holds the virtual ip address) and I applied the TC
rules, while monitoring the memory usage of the corosync process (multiple times)

root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc add dev eth0 root netem delay 550ms
root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc del dev eth0 root netem

Apr 6 17:57:37 juju-niedbalski-sec-machine-15 cib[14387]: warning: cib_process_request: Completed cib_apply_diff operation for section 'all': Application of an update diff failed (rc=-206, origin=local/cibadmin/2, version=0.27.1)
Apr 6 18:04:12 juju-niedbalski-sec-machine-15 corosync[14376]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:04:13 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:06:27 juju-niedbalski-sec-machine-15 corosync[18645]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:06:28 juju-niedbalski-sec-machine-15 corosync[19528]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:07:48 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:07:49 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:08:16 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:08:59 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 6 18:09:38 juju-niedbalski-sec-machine-15 corosync[19985]: [MAIN ] Completed service synchronization, ready to provide
service.

After 5 minutes of observation on the corosync process by using:

$ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

I don't see any substantial memory usage increase.

root@juju-niedbalski-sec-machine-15:/home/ubuntu# more me...

Hello,

I ran the verification for the Trusty version.

root@juju-niedbalski-sec-machine-15:/home/ubuntu# dpkg -l|grep corosync
ii  corosync                         2.3.3-1ubuntu3                        amd64        Standards-based cluster framework (daemon and modules)
ii  libcorosync-common4              2.3.3-1ubuntu3                        amd64        Standards-based cluster framework, common library

I configured a 3 nodes nova-cloud-controller environment related with hacluster.

ubuntu@niedbalski-sec-bastion:~/openstack-charm-testing/bundles/dev$ juju run --service nova-cloud-controller "sudo corosync-quorumtool -s|grep votes"
- MachineId: "15"
  Stdout: |
    Expected votes:   3
    Total votes:      3
  UnitId: nova-cloud-controller/0
- MachineId: "28"
  Stdout: |
    Expected votes:   3
    Total votes:      3
  UnitId: nova-cloud-controller/1
- MachineId: "29"
  Stdout: |
    Expected votes:   3
    Total votes:      3
  UnitId: nova-cloud-controller/2

I changed the transport mode to UDP by setting:

$ juju set hacluster-ncc corosync_transport=udpu

After this, I moved to the primary node (the one that holds the virtual ip address) and I applied the TC
rules, while monitoring the memory usage of the corosync process (multiple times)

root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc add dev eth0 root netem delay 550ms
root@juju-niedbalski-sec-machine-15:/home/ubuntu# tc qdisc del dev eth0 root netem

Apr  6 17:57:37 juju-niedbalski-sec-machine-15 cib[14387]:  warning: cib_process_request: Completed cib_apply_diff operation for section 'all': Application of an update diff failed (rc=-206, origin=local/cibadmin/2, version=0.27.1)
Apr  6 18:04:12 juju-niedbalski-sec-machine-15 corosync[14376]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:04:13 juju-niedbalski-sec-machine-15 corosync[18645]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:06:27 juju-niedbalski-sec-machine-15 corosync[18645]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:06:28 juju-niedbalski-sec-machine-15 corosync[19528]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:07:48 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:07:49 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:08:16 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:08:59 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] Completed service synchronization, ready to provide service.
Apr  6 18:09:38 juju-niedbalski-sec-machine-15 corosync[19985]:  [MAIN  ] Completed service synchronization, ready to provide 
service.

After 5 minutes of observation on the corosync process by using:

$ sudo while true; do ps -o vsz,rss -p $(pgrep corosync) 2>&1 | grep -E '.*[0-9]+.*' | tee -a memory-usage.log && sleep 1; done

I don't see any substantial memory usage increase.

root@juju-niedbalski-sec-machine-15:/home/ubuntu# more memory-usage.log 
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928
135584  3928

Revision history for this message

Jorge Niedbalski (niedbalski) wrote on 2016-04-06:

#11

Based on my latest comment, I am marking the Trusty version as verification-done-trusty

tags:

added: verification-done-trusty verification-needed-wily
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-04-13:

#12

This bug was fixed in the package corosync - 2.3.3-1ubuntu3

---------------
corosync (2.3.3-1ubuntu3) trusty; urgency=medium

* debian/patches/Totempg-Fix-memory-leak.patch: Fixes memory leak on
Totempg. (LP: #1563089).

-- Jorge Niedbalski <email address hidden> Tue, 05 Apr 2016 10:01:37 +0200

Changed in corosync (Ubuntu Trusty):
status:	Fix Committed → Fix Released

Revision history for this message

Chris J Arges (arges) wrote on 2016-04-13: Update Released

#13

The verification of the Stable Release Update for corosync has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message

Martin Pitt (pitti) wrote on 2016-07-06:

#14

The wily update was not verified in half a year, and wily is almost EOL. So I removed the -proposed package.

Changed in corosync (Ubuntu Wily):
status:	Fix Committed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntucorosync package

Memory Leak when new cluster configuration is formed.

Bug Description

Other bug subscribers

Patches

Remote bug watches

Ubuntu
corosync package