drbd-utils timeout issue with Pacemaker

Bug #2043817 reported by Josua
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
drbd-utils (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Sergio Durigan Junior

Bug Description

[ Impact ]

When using drbd-utils in a cluster, and after corosync and pacemaker services are gracefully stopped on one node, it takes more than 2 minutes for the resources to migrate to one of the other nodes.

[ Test Plan ]

Test for this bug requires setting up a cluster with corosync/pacemaker, which can be somewhat complex to achieve. Instead, we will rely on the reporter's setup to test and verify that the fix works.

[ Where problems could occur ]

The patches being backported are somewhat simple, but arguably they do modify the logic behind the timeout calculation. Without the patch, a user will always expect drbd-utils to specify N *seconds* to crmadmin as the timeout, whereas with the patch drbd-utils will now specify N *milliseconds* instead. If a user expects the timeout to be several orders of magnitude higher than what this fix is proposing, then they might as well have made decisions based on this timeout. On the other hand, it is hard to think of a scenario where having the resources migrate quicker to a healthy node would cause problems to a setup.

[ Other Info ]

The problem may seem complex, but it's actually simple: drbd-utils tries to determine whether it can use the suffix "ms" when specifying the timeout to crmadmin. Versions of crmadmin <= 2.0.x expected the timeout to be specified as a number which was always interpreted as milliseconds, but that changed with crmadmin >= 2.1.0, which expects the timeout to be a timespec (meaning that the "ms" suffix must be present). When no time unit is specified, the timeout is interpreted as seconds:

https://github.com/ClusterLabs/pacemaker/blob/main/lib/common/utils.c#L271

For this reason, we need to patch drbd-utils and make it correctly append the "ms" suffix whenever specifying the timeout via crmadmin.

[ Original Description ]

[Description]

We ran into an issue on an Ubuntu Jammy Failover setup with Corosync, Pacemaker and DRBD.

If the pacemaker and corosync services are gracefully stopped (systemctl stop pacemaker.service corosync.service) on one node, it takes more than 2 minutes for the resources to migrate to the second node.
There is a global default timeout for pacemaker resources of 20 seconds, and we always ran in that threshold, causing the failover to stall. After about 90 seconds, the resources are migrated.

We're seeing this issue only with Ubuntu Jammy, not on Bionic or Focal, where we use identical configurations and the graceful resource failover happens in like 2 seconds.

Note: this issue does not occur when migrating resources with "crm resource move".

During our debugging, we noticed in the changelog of the drbd-utils package the following entry:
https://github.com/LINBIT/drbd-utils/blob/fc49473cde48a9b2bb645ad042abfc56ce2e2e2f/ChangeLog#L89
```
  9.20.2
  -----------
  * crm-fence-peer: fix timeout with Pacemaker 2.0.5
```

Despite that Ubuntu Jammy comes with a more recent version of Pacemaker, we still gave it a try and replaced the drbd-utils package with the newest version 9.26.0 from this PPA https://launchpad.net/~linbit/+archive/ubuntu/linbit-drbd9-stack.
Without changing any settings, with the new version, the failover works perfectly and nearly instant, as we're used to it on the Bionic and Focal setups. No timeouts occur.

[Test environment]

DistroRelease: Ubuntu 22.04.3 LTS
Uname: Linux 5.15.0-88-generic
Architecture: amd64

corosync:
  Installed: 3.1.6-1ubuntu1
  Candidate: 3.1.6-1ubuntu1

pacemaker:
  Installed: 2.1.2-1ubuntu3.1
  Candidate: 2.1.2-1ubuntu3.1

drbd-utils:
  Installed: 9.15.0-1build2
  Candidate: 9.15.0-1build2

[Steps to reproduce]

* Create 2 node cluster running Ubuntu 22.04 with Corosync, DRBD and Pacemaker
* Create a replicated DRBD volume
* Create a CRM resource that uses the DRBD volume
* Stop Pacemaker and Corosync service on promoted node to trigger a failover
* Monitor the failover with `crm_mon -rf` and `tail -f /var/log/pacemaker/pacemaker.log`

The failover should be done in about 2 seconds, but because of the bug within drdb-utils 9.15, it will fail as it runs into the timeout. If the timeout is increased, it takes a considerable amount of time, usually north of 90 seconds.

[Solution]

Back port the fencing issue with pacemaker fix to drdb-utils:
https://github.com/LINBIT/drbd-utils/commit/3eec04bc65b39b04be21d2689568892e7788abb3

or provide a newer version of drbd-utils that has this fix already included

Related branches

Josua (josua-bryner)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in drbd-utils (Ubuntu):
status: New → Confirmed
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hello Josua,

Thank you for taking the time to report a bug and improve Ubuntu. Also, thank you for the excellent description of the bug!

Unfortunately I am not able to set up a test environment to verify the bug here, therefore I would like to ask for your help in testing a PPA with the backported patches to see if the issue is resolved. If it is, then we can proceed with the SRU process, which will require yet another help from you, this time when verifying that the new official package indeed fixes the issue.

The PPA is here: https://launchpad.net/~sergiodj/+archive/ubuntu/drbd-utils

I've also noticed that the upstream patches have been included in version 9.20.0, which probably means that only Jammy is affected (Lunar onwards have drbd-utils 9.22.0). I'll add the necessary tasks to this bug.

Please let me know how the test goes. Thank you very much!

Changed in drbd-utils (Ubuntu Jammy):
status: New → Confirmed
Revision history for this message
Josua (josua-bryner) wrote :

Hi Sergio,

Thanks a lot for your prompt help!

I just tested the package from your PPA and can confirm that this fixed the issue we encountered.
With your package, the failover happens instantly again.

Just let me know if you need any more information.

Revision history for this message
Paride Legovini (paride) wrote :

Thanks Josua for testing the package Sergio prepared in that PPA. I am updating the bug metadata so that this will be tracked in our work queues.

Changed in drbd-utils (Ubuntu):
status: Confirmed → Fix Released
Changed in drbd-utils (Ubuntu Jammy):
status: Confirmed → Triaged
tags: added: server-todo
description: updated
Changed in drbd-utils (Ubuntu Jammy):
assignee: nobody → Sergio Durigan Junior (sergiodj)
Changed in drbd-utils (Ubuntu Jammy):
status: Triaged → In Progress
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Josua, or anyone else affected,

Accepted drbd-utils into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/drbd-utils/9.15.0-1ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in drbd-utils (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Josua (josua-bryner) wrote :

Hi Steve

Thanks for the update.

I tested the package from the jammy-proposed repository and can confirm that the bug is fixed with this version.

[PACKAGE VERSION]
drbd-utils:
  Installed: 9.15.0-1ubuntu0.1
  Candidate: 9.15.0-1ubuntu1~ppa1
  Version table:
     9.15.0-1ubuntu1~ppa1 500
        500 https://ppa.launchpadcontent.net/sergiodj/drbd-utils/ubuntu jammy/main amd64 Packages
 *** 9.15.0-1ubuntu0.1 500
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     9.15.0-1build2 500
        500 https://mirror.nine.ch/ubuntu jammy/main amd64 Packages

[TEST CASE]
I used the same setup and procedure as described in the bug report above:

1. Monitor the failover with `crm_mon -rf` and `tail -f /var/log/pacemaker/pacemaker.log` on passive node (node02)
2. Stop Pacemaker and Corosync service on promoted node (node01) to trigger a failover to passive node (node02).
3. Restart the two services on the now stopped node (node01)

And then the same in the opposite direction:

4. Monitor the failover with `crm_mon -rf` and `tail -f /var/log/pacemaker/pacemaker.log` on passive node (node01)
5. Stop Pacemaker and Corosync service on promoted node (node02) to trigger a failover to passive node (node01).
6. Restart the two services on the now stopped node (node02)

[VERIFICATION DONE]

The failovers were triggered immediately and the migration of the services was successful.
The result was as expected.

tags: added: verification-done-jammy
removed: verification-needed-jammy
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package drbd-utils - 9.15.0-1ubuntu0.1

---------------
drbd-utils (9.15.0-1ubuntu0.1) jammy; urgency=medium

  * d/p/lp2043817-fix-timeout-pacemaker-jammy-*.patch: Fix timeout
    issue with Pacemaker. (LP: #2043817)

 -- Sergio Durigan Junior <email address hidden> Tue, 21 Nov 2023 14:16:59 -0500

Changed in drbd-utils (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote : Update Released

The verification of the Stable Release Update for drbd-utils has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.