pgsql resource agent incompatible with pacemaker
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
resource-agents (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Trusty |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
[Impact]
The pgsql Pacemaker resource agent (/usr/lib/
Due to version skew in 14.04 between the version of Pacemaker (1.1.10) and resource-agents (3.9.3), the Pacemaker output of various status commands is slightly different than what the pgsql resource agent expects, and parses it wrong. In particular, since Pacemaker 1.1.8, the so-called instance number is no longer appended to the resource name (like pgsql:1) if the property globally-unique is set to false.
This leads to the following two problems:
- The call to crm_attribute fails as it appended the instance number to the resource name. It first tries to read the current score, but as the requested resource name:instance does not exist, gets back an error message and subsequently leaves the score of the Standby at the default of 1000 and not at 100 as it should be.
To wit:
# crm_attribute -l reboot -N h1db2 -n "master-pgsql:1" -G -q
Error performing operation: No such device or address
# crm_attribute -l reboot -N h1db2 -n "master-pgsql" -G -q
100
- The output of crm_mon no longer includes a colon, so the resource agent on the Standby believes no Master is present and is not able to get the Master's transaction log position.
To wit:
# crm_mon -n1 | grep Master
pgsql (ocf::heartbeat
# crm_mon -n1 | tr -d "\t" | grep Master
pgsql(ocf:
The pgsql resource agent's original grep was running 'grep -q "^${RESOURCE_
This gives Pacemaker wrong input to decide in failover situations, resulting in possibly spurious failovers. As Pacemaker is typically deployed in business-critical setups, any unneeded failover implies a (possibly short but) unwanted downtime. Fixing them will make it possble to use PostgreSQL streaming replication in a high-availability fashion on 14.04.
The problems have since been fixed in the upstream resource-agents repository (https:/
[Test Case]
Setup a two-node PostgreSQL Pacemaker cluster on 14.04 according to e.g. according to https:/
Note that gocardless ship a patched version of the pgsql resource agent as well, so revert commit https:/
After setup, the score of the Standby will be 1000 with the current resource-agents package, and after installation of the proposed SRU package, it will be 100.
[Regression Potential]
As the commits are from upstream and fix currenlty broken behaviour in a localized fashion, there should be no regressions.
The patch has been deployed by our customer for three weeks now and they reported no problems.
[Other Info]
I am happy to answer further questions.
[Original Description]
We were debugging an unexpected failover of a PostgreSQL-9.3 Pacemaker cluster running on 14.04 LTS at a client. As a timeline, the client put back the second node (h1db2) into the cluster at around 9:10 AM, and the unexpected failover occured at 10:32 AM.
What exactly lead to the failover could not be exatly figured out, but two problems were apparent from the logs:
1. The standby monitor action thought there was no master running:
May 4 09:10:59 h1db2 pgsql(pgsql)[2460]: INFO: Master does not exist.
[Message repeats 175 times]
May 4 10:32:04 h1db2 pgsql(pgsql)[2611]: INFO: Master does not exist.
[...]
May 4 10:32:04 h1db2 pgsql(pgsql)[2611]: INFO: I have a master right.
At this point, Pacemaker decided to promote h1db2.
2. Between 9:10 and 10:32, the score of the standby was -INFINITY, at10:32 it was then set to the same score as the master (1000) while it should be 100 for standbys.
Both problems were debugged and traced back to bugs in the pgsql resource agent version in trusty, which are due to output changes in newer pacemaker versions (including the one in trusty) and have since been fixed.
The following git commits from https:/
https:/
https:/
However, several other intermediate commits are required on top of the version in trusty, so the full list we are using is:
956244dd05f69bd
b7911abce27889b
ffc9c6444996144
404d205636ad02e
ff9f0ed32e64f9b
78ddf466e413d0c
b7911abce27 needs to be adjusted as it uses a function (exec_with_retry) which is not available yet, but it (and its first argument) can be safely removed. Commits 3-5 just keep changing the same line (as does the last) so the final patch isn't getting any bigger.
The attached patch makes the pgsql resource agent work much better for us, would it be possible to apply it to the resource-agents package in trusty?
tags: | added: patch |
summary: |
- pgsql RA has problems with pacemaker version + pgsql resource agent incompatible with pacemaker |
Debdiff attached.