pacemaker-controld crash on double free

Bug #2012740 reported by Scati Labs I+D
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Michał Małoszewski

Bug Description

[Impact]

* The pacemaker-controld is Pacemaker’s coordinator, which maintains a consistent view of the cluster membership and orchestration of all the other components.

* Users of mysql clusters migrating from bionic to jammy reported a crash.

* This crash is caused by lrmd_dispatch_internal(), which assigns the exit_reason string directly from an XML node to a new lrmd_event_data_t object (without duplicating), and this string gets freed twice. The fix is to make a copy of event.exit_reason in lrmd_dispatch_internal() before the callback.

[Test Plan]

lxc launch ubuntu:22.04 node1
lxc shell node1
  apt update && apt dist-upgrade -y
  apt install pcs mysql-server resource-agents -y
  echo hacluster:hacluster | chpasswd
  mysql -e "CREATE USER 'replicator'@'localhost'"
  mysql -e "GRANT RELOAD, PROCESS, SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'replicator'@'localhost'"
  systemctl disable mysql.service
  systemctl stop mysql.service
  exit
lxc copy node1 node2
lxc start node2
lxc shell node1
  pcs host auth node1 node2 -u hacluster -p hacluster
  pcs cluster setup --force mysqlclx node1 node2 transport udpu
  pcs cluster enable --all
  pcs cluster start --all
  pcs property set stonith-enabled=false
  pcs property set no-quorum-policy=ignore
  pcs resource create p_mysql ocf:heartbeat:mysql \
    replication_user=replicator \
    test_user=root \
    op demote interval=0s timeout=120 monitor interval=20 timeout=30 monitor \
    interval=10 role=Master timeout=30 monitor interval=30 role=Slave timeout=30 \
    notify interval=0s timeout=90 promote interval=0s timeout=120 start \
    interval=0s timeout=120 stop interval=0s timeout=120 meta notify=true
  pcs resource promotable p_mysql p_mysql-master notify=true

Example of failed output:
There should be a crash file at /var/crash/ in some of the nodes.

Example of successful output:

No crash file at /var/crash/.

[Where problems could occur]

* The patch itself modifies only the lmrd code, so regressions should be limited to the behavior of lmrd.

* Since the code changes affect event dispatching and memory allocation, therefore potential regressions would most likely be related to that.

---------------------------------original report--------------------------

After migrating a mysql cluster from bionic to jammy (pacemaker 2.1.2-1ubuntu3), pacemaker started to malfunction because of pacemaker-controld crashes. It is easy to reproduce doing a standby of the promoted node.

Apport crash view has been attached and it is the same bug reported in redhat https://bugzilla.redhat.com/show_bug.cgi?id=2039675

And was fixed in this commit https://github.com/ClusterLabs/pacemaker/commit/ed8b2c86ab77aaa3d7fd688c049ad5e1b922a9c6

Please, provide an update for pacemaker because it is unusable this way.

Related branches

Revision history for this message
Scati Labs I+D (scatilabs) wrote :
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Thanks for taking the time to report this bug.

The issue was introduced in 2.1.2 by https://github.com/ClusterLabs/pacemaker/commit/31c7fa8a3a9c72c05bafdac1841c1c0c5f003797.

As mentioned, it was fixed in 2.1.3 by https://github.com/ClusterLabs/pacemaker/commit/ed8b2c86ab77aaa3d7fd688c049ad5e1b922a9c6.

Therefore, only jammy is affected.

A workaround provided by the red hat bug is to remove the notify=true entry.

Changed in pacemaker (Ubuntu):
status: New → Triaged
tags: added: bitesize server-todo
Changed in pacemaker (Ubuntu Jammy):
status: New → Triaged
Changed in pacemaker (Ubuntu):
status: Triaged → Fix Released
Changed in pacemaker (Ubuntu Jammy):
assignee: nobody → Michał Małoszewski (michal-maloszewski99)
Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Hello Scati Labs I+D!
Thank you for your report and the effort.
I will take care of that bug.
The SRU process requires an easy to follow test case be documented with the bug, to allow it to be easily validated. Could you please assist us with writing it? You can see an example of what we're looking for from the [Test Case] in this bug report:

https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1988224

Thank you in advance

Revision history for this message
Scati Labs I+D (scatilabs) wrote :

Hi!

Sure, here we go.

[Test Case]

lxc launch ubuntu:22.04 node1
lxc shell node1
  apt update && apt dist-upgrade -y
  apt install pcs mysql-server resource-agents -y
  echo hacluster:hacluster | chpasswd
  mysql -e "CREATE USER 'replicator'@'localhost'"
  mysql -e "GRANT RELOAD, PROCESS, SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'replicator'@'localhost'"
  systemctl disable mysql.service
  systemctl stop mysql.service
  exit
lxc copy node1 node2
lxc start node2
lxc shell node1
  pcs host auth node1 node2 -u hacluster -p hacluster
  pcs cluster setup --force mysqlclx node1 node2 transport udpu
  pcs cluster enable --all
  pcs cluster start --all
  pcs property set stonith-enabled=false
  pcs property set no-quorum-policy=ignore
  pcs resource create p_mysql ocf:heartbeat:mysql \
    replication_user=replicator \
    test_user=root \
    op demote interval=0s timeout=120 monitor interval=20 timeout=30 monitor \
    interval=10 role=Master timeout=30 monitor interval=30 role=Slave timeout=30 \
    notify interval=0s timeout=90 promote interval=0s timeout=120 start \
    interval=0s timeout=120 stop interval=0s timeout=120 meta notify=true
  pcs resource promotable p_mysql p_mysql-master notify=true

[/Test Case]

After that there should be a crash file at /var/crash/ in some of the nodes.

Inorder to force the crash again:

* Check with crm_mon which node is promoted.
* run: pcs node standby <promotednodename>
* Check crm_mon again until the node gets in standby mode (without resources running).
** It will take a really long time.
* A crash will be created on the other node.

Hope it helps. Kind regards.

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Perfect! Thank you so much

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote (last edit ):

Reproduced test case. Starting fix the issue.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Fixed some small issues in the MR and the SRU description, LGTM now - sponsoring.

description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

thanks, sponsored

description: updated
Changed in pacemaker (Ubuntu Jammy):
status: Triaged → Fix Committed
status: Fix Committed → In Progress
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Scati, or anyone else affected,

Accepted pacemaker into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/2.1.2-1ubuntu3.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in pacemaker (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Fix works, 2.1.2-1ubuntu3.1 fixes the bug.

I've created the jammy container using steps from the [Test Plan] section listed above in the Bug Description and inside that container I typed in:

$ apt policy pacemaker

The output:

pacemaker:
  Installed: 2.1.2-1ubuntu3
  Candidate: 2.1.2-1ubuntu3.1
  Version table:
     2.1.2-1ubuntu3.1 500
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
 *** 2.1.2-1ubuntu3 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages
        100 /var/lib/dpkg/status

Then I repeated steps from [Test Plan] section.

I've noticed that nothing has changed there, so the problem still existed, because as we could see in the output, the package version was not the one where the fix is.

Then I've upgraded pacemaker using:
$ apt install pacemaker=2.1.2-1ubuntu3.1

Later I've typed in:

$ apt policy pacemaker
to check if installed version is changed and we see that we have new version installed (with fix)

pacemaker:
  Installed: 2.1.2-1ubuntu3.1
  Candidate: 2.1.2-1ubuntu3.1
  Version table:
 *** 2.1.2-1ubuntu3.1 500
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2.1.2-1ubuntu3 500
        500 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

Finally when I repeated steps from the [Test Plan] the problem did not exist, no crash file at /var/crash/. So the fix works.

tags: added: verification-done-jammy
removed: verification-needed-jammy
tags: added: verification-done
removed: verification-needed
Revision history for this message
Scati Labs I+D (scatilabs) wrote :

Hi verified on our side too. Thanks!

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I verified the test results and am satisfied that they show the executed planned test case, and that the results are correct.

In such types of test plans, I would suggest that in the future you also add output showing whether the relevant artifact you are looking for is there or not. For example, in the case of crash files, show that they exist when expected, and don't exist when not expected. I would also have looked at log files, either from systemd, or the daemon itself, which should have some sort of indication that a crash happened.

That's better than just "I've noticed that nothing has changed there, so the problem still existed", for example.

The package built correctly in all architectures and Ubuntu releases it was meant for.

There are no DEP8 regressions.

There is no SRU freeze ongoing at the moment.

There is no halted phasing on the previous update.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 2.1.2-1ubuntu3.1

---------------
pacemaker (2.1.2-1ubuntu3.1) jammy; urgency=medium

  * d/p/jammy-avoid-double-free-during-notify-operation.patch:
    Fix a regression introduced by 31c7fa8, causing a double-free in
    notify operations (LP: #2012740)

 -- Michal Maloszewski <email address hidden> Fri, 31 Mar 2023 20:55:24 +0200

Changed in pacemaker (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Update Released

The verification of the Stable Release Update for pacemaker has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.