Activity log for bug #1688613

Date Who What changed Old value New value Message
2017-05-05 17:15:22 Michael Banck bug added bug
2017-05-05 17:15:22 Michael Banck attachment added Proposed patch https://bugs.launchpad.net/bugs/1688613/+attachment/4872336/+files/pgsql.diff
2017-05-05 17:33:01 Michael Banck attachment added Proposed debdiff https://bugs.launchpad.net/ubuntu/+source/resource-agents/+bug/1688613/+attachment/4872345/+files/resource-agents_3.9.3+git20121009-3ubuntu3.debdiff
2017-05-05 20:23:55 Ubuntu Foundations Team Bug Bot tags patch
2017-05-05 20:58:13 Michael Banck nominated for series Ubuntu Trusty
2017-05-23 21:49:46 Nish Aravamudan bug task added resource-agents (Ubuntu Trusty)
2017-05-23 21:49:58 Nish Aravamudan bug added subscriber Ubuntu Server Team
2017-05-23 21:50:01 Nish Aravamudan resource-agents (Ubuntu Trusty): milestone ubuntu-14.04.5
2017-05-23 21:50:04 Nish Aravamudan resource-agents (Ubuntu): milestone ubuntu-14.04.5
2017-05-23 21:50:17 Nish Aravamudan resource-agents (Ubuntu Trusty): status New Triaged
2017-05-30 06:24:08 Michael Banck attachment added updated debdiff with DEP3 https://bugs.launchpad.net/ubuntu/+source/resource-agents/+bug/1688613/+attachment/4886061/+files/resource-agents_3.9.3+git20121009-3ubuntu3.debdiff
2017-05-30 06:25:17 Michael Banck description We were debugging an unexpected failover of a PostgreSQL-9.3 Pacemaker cluster running on 14.04 LTS at a client. As a timeline, the client put back the second node (h1db2) into the cluster at around 9:10 AM, and the unexpected failover occured at 10:32 AM. What exactly lead to the failover could not be exatly figured out, but two problems were apparent from the logs: 1. The standby monitor action thought there was no master running: May 4 09:10:59 h1db2 pgsql(pgsql)[2460]: INFO: Master does not exist. [Message repeats 175 times] May 4 10:32:04 h1db2 pgsql(pgsql)[2611]: INFO: Master does not exist. [...] May 4 10:32:04 h1db2 pgsql(pgsql)[2611]: INFO: I have a master right. At this point, Pacemaker decided to promote h1db2. 2. Between 9:10 and 10:32, the score of the standby was -INFINITY, at10:32 it was then set to the same score as the master (1000) while it should be 100 for standbys. Both problems were debugged and traced back to bugs in the pgsql resource agent version in trusty, which are due to output changes in newer pacemaker versions (including the one in trusty) and have since been fixed. The following git commits from https://github.com/ClusterLabs/resource-agents are relevant: https://github.com/ClusterLabs/resource-agents/commit/78ddf466e413d0c1f18f7610cfbd63968b012ce0 fixes the first issue. https://github.com/ClusterLabs/resource-agents/commit/956244dd05f69bdad979b252a3e359855b88e6bd fixes the second issue. However, several other intermediate commits are required on top of the version in trusty, so the full list we are using is: 956244dd05f69bdad979b252a3e359855b88e6bd b7911abce27889becc8a4637e003bfcf5ef1b15e (adjusted) ffc9c6444996144076ef2b4bc79a38569e05250a 404d205636ad02e09ddffdb9710dd660b8171c6b ff9f0ed32e64f9be9e57dc712ec241231b04d917 78ddf466e413d0c1f18f7610cfbd63968b012ce0 b7911abce27 needs to be adjusted as it uses a function (exec_with_retry) which is not available yet, but it (and its first argument) can be safely removed. Commits 3-5 just keep changing the same line (as does the last) so the final patch isn't getting any bigger. The attached patch makes the pgsql resource agent work much better for us, would it be possible to apply it to the resource-agents package in trusty? [Impact] The pgsql Pacemaker resource agent (/usr/lib/ocf/resource.d/heartbeat/pgsql from the resource-agents package) implements a Pacemaker Master/Slave set. Besides regular actions lke starting/stopping/monitoring the resource, this also includes monitoring of the transaction log position on each node and assigning a score to a node and implamenting promote/demote actions. In the case of failed monitoring of the Master, Pacemaker may decide to failover to a Slave based on the Slave's score. Due to version skew in 14.04 between the version of Pacemaker (1.1.10) and resource-agents (3.9.3), the Pacemaker output of various status commands is slightly different than what the pgsql resource agent expects, and parses it wrong. In particular, since Pacemaker 1.1.8, the so-called instance number is no longer appended to the resource name (like pgsql:1) if the property globally-unique is set to false. This leads to the following two problems: - The call to crm_attribute fails as it appended the instance number to the resource name. It first tries to read the current score, but as the requested resource name:instance does not exist, gets back an error message and subsequently leaves the score of the Standby at the default of 1000 and not at 100 as it should be. To wit: # crm_attribute -l reboot -N h1db2 -n "master-pgsql:1" -G -q Error performing operation: No such device or address # crm_attribute -l reboot -N h1db2 -n "master-pgsql" -G -q 100 - The output of crm_mon no longer includes a colon, so the resource agent on the Standby believes no Master is present and is not able to get the Master's transaction log position. To wit: # crm_mon -n1 | grep Master pgsql (ocf::heartbeat:pgsql): Master # crm_mon -n1 | tr -d "\t" | grep Master pgsql(ocf::heartbeat:pgsql):Master The pgsql resource agent's original grep was running 'grep -q "^${RESOURCE_NAME}:.* Master"' (where $RESSOURCE_NAME=pgsql) on the last line, which turned up no hits (or rather, a non-zero exit status). This gives Pacemaker wrong input to decide in failover situations, resulting in possibly spurious failovers. As Pacemaker is typically deployed in business-critical setups, any unneeded failover implies a (possibly short but) unwanted downtime. Fixing them will make it possble to use PostgreSQL streaming replication in a high-availability fashion on 14.04. The problems have since been fixed in the upstream resource-agents repository (https://github.com/ClusterLabs/resource-agents/). The appropriate upstream commits have been stashed to a single patch. [Test Case] Setup a two-node PostgreSQL Pacemaker cluster on 14.04 according to e.g. according to https://github.com/gocardless/our-postgresql-setup/blob/master/postgresql-cluster-setup.sh Note that gocardless ship a patched version of the pgsql resource agent as well, so revert commit https://github.com/gocardless/our-postgresql-setup/commit/2511b9441d43996a3e45604080dedfac9a490c28 or comment out the deployement of the patched pgsql. After setup, the score of the Standby will be 1000 with the current resource-agents package, and after installation of the proposed SRU package, it will be 100. [Regression Potential] As the commits are from upstream and fix currenlty broken behaviour in a localized fashion, there should be no regressions. The patch has been deployed by our customer for three weeks now and they reported no problems. [Other Info] I am happy to answer further questions. [Original Description] We were debugging an unexpected failover of a PostgreSQL-9.3 Pacemaker cluster running on 14.04 LTS at a client. As a timeline, the client put back the second node (h1db2) into the cluster at around 9:10 AM, and the unexpected failover occured at 10:32 AM. What exactly lead to the failover could not be exatly figured out, but two problems were apparent from the logs: 1. The standby monitor action thought there was no master running: May 4 09:10:59 h1db2 pgsql(pgsql)[2460]: INFO: Master does not exist. [Message repeats 175 times] May 4 10:32:04 h1db2 pgsql(pgsql)[2611]: INFO: Master does not exist. [...] May 4 10:32:04 h1db2 pgsql(pgsql)[2611]: INFO: I have a master right. At this point, Pacemaker decided to promote h1db2. 2. Between 9:10 and 10:32, the score of the standby was -INFINITY, at10:32 it was then set to the same score as the master (1000) while it should be 100 for standbys. Both problems were debugged and traced back to bugs in the pgsql resource agent version in trusty, which are due to output changes in newer pacemaker versions (including the one in trusty) and have since been fixed. The following git commits from https://github.com/ClusterLabs/resource-agents are relevant: https://github.com/ClusterLabs/resource-agents/commit/78ddf466e413d0c1f18f7610cfbd63968b012ce0 fixes the first issue. https://github.com/ClusterLabs/resource-agents/commit/956244dd05f69bdad979b252a3e359855b88e6bd fixes the second issue. However, several other intermediate commits are required on top of the version in trusty, so the full list we are using is: 956244dd05f69bdad979b252a3e359855b88e6bd b7911abce27889becc8a4637e003bfcf5ef1b15e (adjusted) ffc9c6444996144076ef2b4bc79a38569e05250a 404d205636ad02e09ddffdb9710dd660b8171c6b ff9f0ed32e64f9be9e57dc712ec241231b04d917 78ddf466e413d0c1f18f7610cfbd63968b012ce0 b7911abce27 needs to be adjusted as it uses a function (exec_with_retry) which is not available yet, but it (and its first argument) can be safely removed. Commits 3-5 just keep changing the same line (as does the last) so the final patch isn't getting any bigger. The attached patch makes the pgsql resource agent work much better for us, would it be possible to apply it to the resource-agents package in trusty?
2017-06-23 08:19:06 Christian Ehrhardt  tags patch patch server-next
2017-06-23 08:25:24 Christian Ehrhardt  resource-agents (Ubuntu): status New Fix Released
2017-06-23 08:27:27 Christian Ehrhardt  tags patch server-next bitesize patch server-next
2017-08-17 12:22:18 Michael Banck summary pgsql RA has problems with pacemaker version pgsql resource agent incompatible with pacemaker
2020-05-05 20:35:46 Bryce Harrington resource-agents (Ubuntu Trusty): status Triaged Won't Fix
2020-05-05 20:35:46 Bryce Harrington resource-agents (Ubuntu Trusty): milestone ubuntu-14.04.5