drbd-utils timeout issue with Pacemaker
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
drbd-utils (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
Fix Released
|
Undecided
|
Sergio Durigan Junior |
Bug Description
[ Impact ]
When using drbd-utils in a cluster, and after corosync and pacemaker services are gracefully stopped on one node, it takes more than 2 minutes for the resources to migrate to one of the other nodes.
[ Test Plan ]
Test for this bug requires setting up a cluster with corosync/pacemaker, which can be somewhat complex to achieve. Instead, we will rely on the reporter's setup to test and verify that the fix works.
[ Where problems could occur ]
The patches being backported are somewhat simple, but arguably they do modify the logic behind the timeout calculation. Without the patch, a user will always expect drbd-utils to specify N *seconds* to crmadmin as the timeout, whereas with the patch drbd-utils will now specify N *milliseconds* instead. If a user expects the timeout to be several orders of magnitude higher than what this fix is proposing, then they might as well have made decisions based on this timeout. On the other hand, it is hard to think of a scenario where having the resources migrate quicker to a healthy node would cause problems to a setup.
[ Other Info ]
The problem may seem complex, but it's actually simple: drbd-utils tries to determine whether it can use the suffix "ms" when specifying the timeout to crmadmin. Versions of crmadmin <= 2.0.x expected the timeout to be specified as a number which was always interpreted as milliseconds, but that changed with crmadmin >= 2.1.0, which expects the timeout to be a timespec (meaning that the "ms" suffix must be present). When no time unit is specified, the timeout is interpreted as seconds:
https:/
For this reason, we need to patch drbd-utils and make it correctly append the "ms" suffix whenever specifying the timeout via crmadmin.
[ Original Description ]
[Description]
We ran into an issue on an Ubuntu Jammy Failover setup with Corosync, Pacemaker and DRBD.
If the pacemaker and corosync services are gracefully stopped (systemctl stop pacemaker.service corosync.service) on one node, it takes more than 2 minutes for the resources to migrate to the second node.
There is a global default timeout for pacemaker resources of 20 seconds, and we always ran in that threshold, causing the failover to stall. After about 90 seconds, the resources are migrated.
We're seeing this issue only with Ubuntu Jammy, not on Bionic or Focal, where we use identical configurations and the graceful resource failover happens in like 2 seconds.
Note: this issue does not occur when migrating resources with "crm resource move".
During our debugging, we noticed in the changelog of the drbd-utils package the following entry:
https:/
```
9.20.2
-----------
* crm-fence-peer: fix timeout with Pacemaker 2.0.5
```
Despite that Ubuntu Jammy comes with a more recent version of Pacemaker, we still gave it a try and replaced the drbd-utils package with the newest version 9.26.0 from this PPA https:/
Without changing any settings, with the new version, the failover works perfectly and nearly instant, as we're used to it on the Bionic and Focal setups. No timeouts occur.
[Test environment]
DistroRelease: Ubuntu 22.04.3 LTS
Uname: Linux 5.15.0-88-generic
Architecture: amd64
corosync:
Installed: 3.1.6-1ubuntu1
Candidate: 3.1.6-1ubuntu1
pacemaker:
Installed: 2.1.2-1ubuntu3.1
Candidate: 2.1.2-1ubuntu3.1
drbd-utils:
Installed: 9.15.0-1build2
Candidate: 9.15.0-1build2
[Steps to reproduce]
* Create 2 node cluster running Ubuntu 22.04 with Corosync, DRBD and Pacemaker
* Create a replicated DRBD volume
* Create a CRM resource that uses the DRBD volume
* Stop Pacemaker and Corosync service on promoted node to trigger a failover
* Monitor the failover with `crm_mon -rf` and `tail -f /var/log/
The failover should be done in about 2 seconds, but because of the bug within drdb-utils 9.15, it will fail as it runs into the timeout. If the timeout is increased, it takes a considerable amount of time, usually north of 90 seconds.
[Solution]
Back port the fencing issue with pacemaker fix to drdb-utils:
https:/
or provide a newer version of drbd-utils that has this fix already included
Related branches
- git-ubuntu bot: Approve
- Andreas Hasenack: Approve
- Canonical Server Reporter: Pending requested
-
Diff: 279 lines (+239/-1)5 files modifieddebian/changelog (+7/-0)
debian/control (+2/-1)
debian/patches/lp2043817-fix-timeout-pacemaker-jammy-01.patch (+138/-0)
debian/patches/lp2043817-fix-timeout-pacemaker-jammy-02.patch (+90/-0)
debian/patches/series (+2/-0)
description: | updated |
description: | updated |
Changed in drbd-utils (Ubuntu Jammy): | |
assignee: | nobody → Sergio Durigan Junior (sergiodj) |
Changed in drbd-utils (Ubuntu Jammy): | |
status: | Triaged → In Progress |
tags: |
added: verification-done removed: verification-needed |
Status changed to 'Confirmed' because the bug affects multiple users.