[SRU] pgsql resource agent uses regexes for old crm_mon format, breaks pgsql-status and pgsql-data-status attributes

Bug #1900016 reported by Jason Hobbs
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
resource-agents (Ubuntu)
Fix Released
Critical
Bryce Harrington
Focal
Fix Released
Critical
Bryce Harrington
Groovy
Fix Released
Critical
Bryce Harrington

Bug Description

[Impact]

resource-agent uses crm_mon to determine node state, however crm_mon's output format differs on bionic and focal which results in invalid status reporting for focal hosts. This has resulted in, for example, failure when migrating a bionic pgsql node to focal.

[Test Case]

Set up a 4-nodes Focal Pacemaker/Corosync cluster with the following CIB:

https://paste.ubuntu.com/p/Mqcn7HMzng/

Check the XML file with the cluster status, the 'pgsql-status' and 'pgsql-data-status' are not listed as nodes attributes:

ubuntu@ekans:~$ sudo crm_mon --as-xml | grep -A11 "<node_attributes>"
  <node_attributes>
    <node name="budew">
      <attribute name="master-pgsql" value="1000"/>
      <attribute name="pgsql-xlog-loc" value="0000000004000150"/>
    </node>
    <node name="ekans">
      <attribute name="master-pgsql" value="1000"/>
      <attribute name="pgsql-master-baseline" value="00000000040000A0"/>
    </node>
    <node name="tyrogue">
      <attribute name="master-pgsql" value="1000"/>
      <attribute name="pgsql-xlog-loc" value="0000000004000150"/>

[Regression Potential]

Since this changes the node status reporting for resource-agents, watch for anything depending on the status information for managing nodes such as issues upgrading software or migrating to new ubuntu releases, or such as web dashboards, etc.

[Fix]

Upstream appears to have encountered and fixed the issue by adjusting the regex to cover the new line format. This corresponds to the following upstream commit:

https://github.com/ClusterLabs/resource-agents/commit/2a56d5b2

[Discussion]

In groovy's 4.6.1, the issue is fixed a bit differently, by switching to use of crm_mon1200 XML format

[Original Report]

There is a bug in the resource agent's node_exist function. It looks at crm_mon output, which has changed between bionic and focal.

The result is that the 'pgsql-status' and 'pgsql-data-status' attributes are missing from crm status --as-xml output on focal.

Here is the focal output:
http://paste.ubuntu.com/p/RrFnPJHWCS/

Here is the bionic output:
http://paste.ubuntu.com/p/NrvqtjJD5r/

This is the node_exist function:

node_exist() {
    print_crm_mon | tr '[A-Z]' '[a-z]' | grep -q "^node $1"
}

It's looking for a line starting with "Node <nodename>".

That works in bionic, but in focal, it's " * Node <nodename>".

is_online has the same problem:

is_node_online() {
    print_crm_mon | tr '[A-Z]' '[a-z]' | grep -e "^node $1 " -e "^node $1:" | grep -q -v "offline"
}

It looks like this is the upstream:
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql

It's fixed there; they look at crm_mon xml output instead.

I tested with changing the regex to "node $1:" and it works fine. that could be tightened up a bit to just match "node <nodename>" or " * node <nodename>", but I'm not sure if we shouldn't just pull in something from upstream so I haven't spent time refining that.

this is on focal with resource-agents 1:4.5.0-2ubuntu2

Related branches

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

sub'd to field high; this breaks our ability to validate postgres HA on focal.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here they are with regex that accepts either version:

is_node_online() {
    print_crm_mon | tr '[A-Z]' '[a-z]' | grep -e "^\( \* \)\?node $1 " -e "^\( \* \)\?node $1:" | grep -q -v "offline"
}

node_exist() {
    print_crm_mon | tr '[A-Z]' '[a-z]' | grep -q "^\( \* \)\?node $1"
}

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Bumped to field crit as we don't have a good workaround for this. We could hotpatch the resource agent, but that only lasts until the package is updated again, and then crm status output for pgsql will be broken again.

Revision history for this message
Richard Harding (rharding) wrote :

The team has identified the upstream patch and will pull that down and update it. We agree the current workaround will work ok and hold until the updated package can be provided and have a test case and go through the SRU process. I've added the tag that this bug blocks updates to the package in the mean time.

The team will provide a PPA with the fixed package tomorrow while the test-case and SRU bug is processed if it's required.

tags: added: block-proposed-focal
Changed in resource-agents (Ubuntu):
assignee: nobody → Bryce Harrington (bryce)
status: New → Triaged
importance: Undecided → Critical
Bryce Harrington (bryce)
Changed in resource-agents (Ubuntu):
status: Triaged → In Progress
Changed in resource-agents (Ubuntu Focal):
importance: Undecided → Critical
status: New → In Progress
assignee: nobody → Bryce Harrington (bryce)
status: In Progress → Triaged
Revision history for this message
Bryce Harrington (bryce) wrote :

The upstream commit I'm backporting for this is sha 2a56d5b2, attached as a patch.

Revision history for this message
Bryce Harrington (bryce) wrote :

I confirmed the patch is included in the 4.6.0 release, and we're carrying 4.6.1 in groovy. Note that a subsequent change after this patch switched from parsing the raw output text to using the XML format, as it seems to be more version-stable with historical versions of the package; however we're just backporting the exact bug-fix to keep the SRU focused to the minimally effective change.

Changed in resource-agents (Ubuntu Groovy):
status: In Progress → Fix Released
Changed in resource-agents (Ubuntu Focal):
status: Triaged → In Progress
Bryce Harrington (bryce)
description: updated
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Dropped back to field-high since we can hotpatch as a workaround, because there will be no additional package updates that don't contain this fix.

Revision history for this message
Bryce Harrington (bryce) wrote :

Before this can go in for SRU, steps to reliably reproduce the issue need to be determined and a test case defined.

Revision history for this message
Canonical Solutions QA Bot (oil-ci-bot) wrote :

This bug is fixed with commit 7f088c69 to cpe-foundation on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cpe-foundation/commit/?id=7f088c69

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

@Jason would be possible for you to share your CIB file with us? Or even the output of the 'crm status' command? That would be great to help us build a test case for the SRU.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

minor fix: "crm configure show" <- that would give us the cib in a human readable way so we can reproduce with the same parameters.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's crm configure show:

https://paste.ubuntu.com/p/Mqcn7HMzng/

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Hi Jason,

Thanks for providing your configuration, it was helpful. I spent my day yesterday trying to trigger this bug and for some reason I was not able to do it. I have set up a 2-node Focal cluster, set up PostgreSQL with replica streaming, and I tried to swap the master and slave some times and it worked for me. FWIW this is my cluster configuration:

node 1: focal01 \
 attributes pgsql-data-status="STREAMING|SYNC"
node 2: focal02 \
 attributes pgsql-data-status=LATEST
primitive postgresql pgsql \
 params pgctl="/usr/lib/postgresql/12/bin/pg_ctl" pgdata="/var/lib/postgresql/12/main" psql="/usr/bin/psql" config="/etc/postgresql/12/main/postgresql.conf" rep_mode=sync master_ip=192.168.3.3 repuser=replicator restart_on_promote=true check_wal_receiver=true node_list="focal01 focal02" \
 op monitor timeout=30 interval=2
primitive vip_public IPaddr2 \
 params ip=192.168.3.4 cidr_netmask=24 \
 op monitor interval=10s \
 meta target-role=Started
primitive vip_replica IPaddr2 \
 params ip=192.168.3.3 cidr_netmask=24 \
 op monitor interval=10s \
 meta target-role=Started
ms master_postgresql postgresql \
 meta notify=true master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 target-role=Started
location cli-prefer-vip_public vip_public role=Started inf: focal01
order order-vip_replica-psql_master-vip_public inf: vip_replica:start master_postgresql:promote vip_public:start
colocation psql_master_and_vips inf: master_postgresql vip_public vip_replica
property cib-bootstrap-options: \
 have-watchdog=false \
 dc-version=2.0.3-4b1f869f0f \
 cluster-infrastructure=corosync \
 cluster-name=cluster01 \
 stonith-enabled=false \
 last-lrm-refresh=1603195934
rsc_defaults rsc-options: \
 resource-stickiness=100

And here you can see the some of the commands I ran and their respective output:

https://pastebin.ubuntu.com/p/ZBBByznQfT/

You can see the 'postgresql-receiver-status' error value in the xml output but it should not impact the reproducibility of this bug, actually the 'pgsql-data-status' is there. During this process I found an unrelated issue with the PostgreSQL resource and I filed this bug:

https://bugs.launchpad.net/ubuntu/+source/resource-agents/+bug/1900613

However, after checking the code I can see your point and it indeed seems buggy, I do not know why my attempt did not trigger it. I'd not want to spend too much more time on it, I'd like to use your configuration as the test case for the SRU and if it's possible ask you to do the validation work when the SRU team requests it. Is that OK for you? Maybe you could try to do what I did (in the pastebin link) and that would be good enough for our test case I think.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I can use my configuration for the test case and do the validation, no problem. Do you need anything from me right now?

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

If you could run commands similar to the ones in my pastebin and report their output (like I did there) would be great to define a solid the test case before uploading it. After an ack from the SRU team a validation will be required and then the ball will be in your court again.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's my commands and output:

https://paste.ubuntu.com/p/R4f5xX6QPq/

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thank you for providing the data Jason! I took a look at it and when you ran the following command to move the 'res_pgsql_vip' resource to the 'budew' node:

$ sudo crm_resource -M -r res_pgsql_vip -H budew

The resource did not move to the target node (and therefore the master pgsql resource was not moved as well). Maybe something related to how you deployed it.

Since the operation I proposed did not work (the status of the cluster before and after are the same) I am going to consider your xml output is missing the attributes you mentioned right after you deploy your cluster (using the config you shared in #12). Please correct me if I am wrong here.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1900016] Re: pgsql resource agent uses regexes for old crm_mon format, breaks pgsql-status and pgsql-data-status attributes
Download full text (4.1 KiB)

We're not having any issues with the VIP moving when it's supposed to. I
don't really understand what the command you suggested does, but it's not
really relevant to our problem. As you say, the problem is that the xml is
missing the node attributes right after the deployment.

Jason

On Wed, Oct 21, 2020 at 7:50 AM Lucas Kanashiro <email address hidden>
wrote:

> Thank you for providing the data Jason! I took a look at it and when you
> ran the following command to move the 'res_pgsql_vip' resource to the
> 'budew' node:
>
> $ sudo crm_resource -M -r res_pgsql_vip -H budew
>
> The resource did not move to the target node (and therefore the master
> pgsql resource was not moved as well). Maybe something related to how
> you deployed it.
>
> Since the operation I proposed did not work (the status of the cluster
> before and after are the same) I am going to consider your xml output is
> missing the attributes you mentioned right after you deploy your cluster
> (using the config you shared in #12). Please correct me if I am wrong
> here.
>
> --
> You received this bug notification because you are a member of Canonical
> Field High, which is subscribed to the bug report.
> https://bugs.launchpad.net/bugs/1900016
>
> Title:
> pgsql resource agent uses regexes for old crm_mon format, breaks
> pgsql-status and pgsql-data-status attributes
>
> Status in resource-agents package in Ubuntu:
> Fix Released
> Status in resource-agents source package in Focal:
> In Progress
> Status in resource-agents source package in Groovy:
> Fix Released
>
> Bug description:
> [Impact]
>
> resource-agent uses crm_mon to determine node state, however crm_mon's
> output format differs on bionic and focal which results in invalid
> status reporting for focal hosts. This has resulted in, for example,
> failure when migrating a bionic pgsql node to focal.
>
>
> [Test Case]
> TBD
>
>
> [Regression Potential]
>
> Since this changes the node status reporting for resource-agents,
> watch for anything depending on the status information for managing
> nodes such as issues upgrading software or migrating to new ubuntu
> releases, or such as web dashboards, etc.
>
>
> [Fix]
>
> Upstream appears to have encountered and fixed the issue by adjusting
> the regex to cover the new line format. This corresponds to the
> following upstream commit:
>
> https://github.com/ClusterLabs/resource-agents/commit/2a56d5b2
>
>
> [Discussion]
>
> In groovy's 4.6.1, the issue is fixed a bit differently, by switching
> to use of crm_mon1200 XML format
>
>
> [Original Report]
>
> There is a bug in the resource agent's node_exist function. It looks
> at crm_mon output, which has changed between bionic and focal.
>
> The result is that the 'pgsql-status' and 'pgsql-data-status'
> attributes are missing from crm status --as-xml output on focal.
>
> Here is the focal output:
> http://paste.ubuntu.com/p/RrFnPJHWCS/
>
> Here is the bionic output:
> http://paste.ubuntu.com/p/NrvqtjJD5r/
>
> This is the node_exist function:
>
> node_exist() {
> print_crm_mon | tr '[A-Z]' '[a-z]' | grep -q "^node $1"
> }
>
> It's looking ...

Read more...

description: updated
Revision history for this message
Bryce Harrington (bryce) wrote :

Thanks Jason and Lucas.
The fix is now uploaded to focal-proposed for SRU team review.

summary: - pgsql resource agent uses regexes for old crm_mon format, breaks pgsql-
- status and pgsql-data-status attributes
+ [SRU] pgsql resource agent uses regexes for old crm_mon format, breaks
+ pgsql-status and pgsql-data-status attributes
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Jason, or anyone else affected,

Accepted resource-agents into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/resource-agents/1:4.5.0-2ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in resource-agents (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I tested the resource-agents package from focal-proposed and it fixed it for me. Marking verification-complete. Logs: http://paste.ubuntu.com/p/F5yDkV2wKS/

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Bryce Harrington (bryce)
tags: removed: block-proposed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package resource-agents - 1:4.5.0-2ubuntu2.1

---------------
resource-agents (1:4.5.0-2ubuntu2.1) focal; urgency=medium

  * d/p/crm-mon-format.patch: Support newer crm_mon output formats.
    The output of crm_mon -n1 changed to prefix node information with an
    asterisk, resulting in the node_exist() function failing to show
    correct information for nodes. This updates the code to accept the new
    node line format.
    (LP: #1900016)

 -- Bryce Harrington <email address hidden> Fri, 16 Oct 2020 00:11:13 +0000

Changed in resource-agents (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for resource-agents has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.