pgsql heartbeat does not support current postgreSQL version

Bug #2013084 reported by Keha
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
resource-agents (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Released
Undecided
Michał Małoszewski

Bug Description

[Impact]

* resource-agents version 4.7.0 does not correctly work with PostgreSQL 11 and above.
* The issue appears on 22.04 and the resource-agents package is affected.
* This issue is caused by the receiver_parent_pids variable, which assigns the wrong name of the wal receiver process.
* The fix is to assign the correct name of the wal receiver process to search for that process and make the wal receiver check compatible with PostgreSQL >= 11.

[Test Plan]

Create a lxd container for the primary postgresql:
$ lxc launch ubuntu:jammy j1

Connect and install packages:
$ lxc shell j1
# apt update && apt install postgresql resource-agents-extra pacemaker-cli-utils -y

Configure postgresql.conf and pg_hba.conf:
# pg_conftool 14 main set listen_addresses '*'
# pg_conftool 14 main set wal_level replica
# echo "host replication replicator all scram-sha-256" >> /etc/postgresql/14/main/pg_hba.conf

Create replication user (choose a password, and remember it, it will be needed again later):
# sudo -u postgres createuser --replication -P -e replicator

restart the primary postgresql:
# systemctl restart postgresql

Back on the host, create lxd container for the secondary postgresql:
$ lxc launch ubuntu:jammy j2

Connect and install packages:
$ lxc shell j2
# apt update && apt install postgresql resource-agents-extra pacemaker-cli-utils -y

Stop postgresql:
# systemctl stop postgresql

Configure postgresql.conf:
# pg_conftool 14 main set listen_addresses '*'
# pg_conftool 14 main set hot_standby on

Cleanup data dir:
# rm -rf /var/lib/postgresql/*/main/*

Perform initial replication as "postgres" user. The pg_basebackup command will prompt for the "replicator" password created earlier on the primary:
# sudo -u postgres -i
$ pg_basebackup -h <IP-of-primary> -D /var/lib/postgresql/14/main -U replicator -P -v -R
$ exit

Start the secondary:
# systemctl start postgresql

Verify replication: list of databases on the secondary does not have a "test" database:
# sudo -u postgres psql -l 2>/dev/null | grep test
#

On the primary, create a test database:
$ lxc shell j1
# sudo -u postgres createdb test
could not change directory to "/root": Permission denied
# sudo -u postgres psql -l 2>/dev/null | grep test
 test | postgres | UTF8 | C.UTF-8 | C.UTF-8 |

On the secondary, verify that the test database now exists:
$ lxc shell j2
# sudo -u postgres psql -l 2>/dev/null | grep test
 test | postgres | UTF8 | C.UTF-8 | C.UTF-8 |

Check that the secondary does have a "walreceiver" process running:
$ lxc shell j2
# ps axw|grep -E "postgres:.*wal" | grep -v grep
   6001 ? Ss 0:06 postgres: 14/main: walreceiver streaming 0/7000780

Now run this long command, one line, on the secondary.

Actual result:

With the bug present, the command will complain that the walreceiver process is NOT running:
# OCF_RESKEY_check_wal_receiver=true OCF_RESKEY_socketdir=/run/postgresql OCF_RESKEY_config=/etc/postgresql/14/main/postgresql.conf OCF_RESKEY_pgctl=/usr/lib/postgresql/14/bin/pg_ctl OCF_RESKEY_pgdata=/var/lib/postgresql/14/main OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/pgsql monitor
INFO: Don't check /var/lib/postgresql/14/main during probe
WARNING: wal receiver process is not running

Expected result:

The warning is not present in the output.

[Where problems could occur]

* The patch itself modifies only the heartbeat/pgsql code, so regressions should be limited to the behavior of pgsql.
* Since the code changes affect the pgsql_wal_receiver_status() function, there might be a problem related to the status of the wal receiver and the status of the running processes.

----------------------------original bug report---------------------------

Hello!

Ubuntu 22.04.2 include packages resource-agents version 4.7.0 (release 2020). This version non-correct work with PostgreSQL 11 and above.

resource-agents get fix 'WAL receiver process' in 4.8.0 rc1 (release 2021)
See links
https://github.com/ClusterLabs/resource-agents/blob/baddb06c9720c8df3dadea42ad863d5948a6345f/ChangeLog

Current version resource-agents 4.12.0

Related branches

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in resource-agents (Ubuntu):
status: New → Confirmed
Changed in resource-agents (Ubuntu):
status: Confirmed → Fix Released
Changed in resource-agents (Ubuntu Jammy):
status: New → Triaged
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Hi Keha, thanks for filing this bug.

This is the upstream patch addressing the issue:

https://github.com/ClusterLabs/resource-agents/commit/214149d28142e4889e27f6637de8a8508c2f4d27

The change seems straightforward.

Would you be willing to prepare a patch for this one and fill the SRU paperwork for this bug?

I would be happy to guide you through the process in case you are not used to it, and sponsor your uploads.

Otherwise, I am adding this to the server team backlog, meaning someone will work on this one as the server team time allows.

tags: added: bitesize
summary: - deprecated package
+ pgsql heartbeat does not support current postgreSQL version
Revision history for this message
Keha (keha) wrote :

Why not just build a new package?
The HA team has already tested version 4.12
https://qa.debian.org/developer.php?login=debian-ha-maintainers%40alioth-lists.debian.net

Revision history for this message
Paride Legovini (paride) wrote :

Fixing the bug in Jammy requires following the Stable Release Updates (SRU) process:

  https://wiki.ubuntu.com/StableReleaseUpdates

which requires targeted patches to fix specific bugs. New upstream versions of packages are normally not uploaded to stable Ubuntu releases (there are exceptions, but this is the general rule).

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote (last edit ):

Hi Keha!

Could you please provide me with exact steps to reproduce the issue or at least test the package with changes when it lands in the -proposed?
I have prepared the PostgreSQL setup, and I wanted to call the specific function from the pgsql heartbeat file, which would trigger the wal receiver monitoring, but I bumped into a "nested" issues along the way and could not reproduce that problem from the bug.
I'd appreciate your response.

Revision history for this message
Keha (keha) wrote :

I can test the package

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Thank you so much Keha. We will need more detailed testing. You did X and Y, that's why it worked/failed. Not only worked/failed. Thank you in advance. I will let you know when it is ready to be tested.

description: updated
description: updated
description: updated
Changed in resource-agents (Ubuntu Jammy):
assignee: nobody → Michał Małoszewski (michal-maloszewski99)
status: Triaged → In Progress
description: updated
description: updated
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Hi Michal, thanks for the patch.

While the patch LGTM and the fix seems quite straightforward, the [Test Plan] seems to be missing a more substantial description of the affected and expected behaviors. I also understand testing this issue may not be a trivial task and I see the bug reporter agreed to test the potential fix.

Keha, would you be willing to come up with such a test plan for SRU documentation purposes?

Otherwise, I am leaving the test plan discussions/assessment for the SRU team and am proceeding with the fix upload here.

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Also, note that 4.7.0-1ubuntu7.1 is in hanging in proposed (see LP: #1981598). It would be nice to ping that bug to ensure it is verified before this SRU is accepted. Otherwise, we will need to re-upload this to include both bug changelogs and both verifications will need to be completed so this can land.

tags: added: server-todo
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> Also, note that 4.7.0-1ubuntu7.1 is in hanging in proposed (see LP: #1981598).

I released that one today.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The test plan is indeed quite vague, and I don't think it can be accepted as is. There are no steps, no indication of that the error looks like, and no indication of what a good behavior is.

From my days with using juju, I do remember that the postgresql charm is quite good, and if you deploy two units of it, it will be configured with replication. That could be a starting point for the test case here, only missing configuring the pgsql heartbeat.

Changed in resource-agents (Ubuntu Jammy):
status: In Progress → Incomplete
tags: removed: bitesize
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

> From my days with using juju, I do remember that the postgresql charm is quite good, and if you deploy two units of it, it will be configured with replication. That could be a starting point for the test case here, only missing configuring the pgsql heartbeat.

Leaving this here as an example of juju being used in a test plan of an HA related SRU:

https://bugs.launchpad.net/ubuntu/+source/crmsh/+bug/1972730/comments/30

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

I've tried with a new setup and I used the steps from: https://juju.is/docs/olm/get-started-with-juju

Both using the lxd option and microk8s I had some troubles.

1) using lxd - having enough space, I got a message while bootstrapping a controller: no space on device
2) using microk8s -

root@jammy-pgsql-test:~# sudo juju bootstrap microk8s pgsql1

ERROR required addons not enabled for microk8s, run 'microk8s enable dns storage'
root@jammy-pgsql-test:~# sudo microk8s enable dns storage
Infer repository core for addon dns
Infer repository core for addon storage
Addon core/dns is already enabled
Addon core/storage is already enabled
root@jammy-pgsql-test:~# sudo juju bootstrap microk8s pgsql1
ERROR required addons not enabled for microk8s, run 'microk8s enable dns storage'
root@jammy-pgsql-test:~#

repetitive error

I appreciate any feedback, at the end I can improve my test-plan which did not include juju approach.

Revision history for this message
Andreas Hasenack (ahasenack) wrote (last edit ):
Download full text (3.8 KiB)

Ignore microk8s, it's not necessary for this.

These were my steps:
- create jammy VM. I used 2Gb of RAM, and 20GB of disk
ubuntu@j-pgsql:~$ free -h
               total used free shared buff/cache available
Mem: 1.9Gi 175Mi 1.2Gi 1.0Mi 572Mi 1.6Gi
Swap: 0B 0B 0B
ubuntu@j-pgsql:~$ df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 198M 1.1M 197M 1% /run
/dev/vda1 20G 1.7G 18G 9% /
tmpfs 988M 0 988M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/vda15 105M 6.1M 99M 6% /boot/efi
tmpfs 198M 4.0K 198M 1% /run/user/1000

Then:
ubuntu@j-pgsql:~$ sudo lxd init --auto
ubuntu@j-pgsql:~$ lxc network set lxdbr0 ipv6.address none
ubuntu@j-pgsql:~$ sudo snap install juju --classic
juju (2.9/stable) 2.9.43 from Canonical✓ installed

$ juju clouds
Since Juju 2 is being run for the first time, it has downloaded the latest public cloud information.
Only clouds with registered credentials are shown.
There are more clouds, use --all to see them.
You can bootstrap a new controller using one of these clouds...

Clouds available on the client:
Cloud Regions Default Type Credentials Source Description
localhost 1 localhost lxd 0 built-in LXD Container Hypervisor

Now bootstrap (I didn't give the model a name, i.e., no "tutorial-controller". That doesn't matter):
(this takes a few minutes)

ubuntu@j-pgsql:~$ juju bootstrap localhost
Creating Juju controller "localhost-localhost" on localhost/localhost
Looking for packaged Juju agent version 2.9.43 for amd64
Located Juju agent version 2.9.43-ubuntu-amd64 at https://streams.canonical.com/juju/tools/agent/2.9.43/juju-2.9.43-linux-amd64.tgz
To configure your system to better support LXD containers, please see: https://linuxcontainers.org/lxd/docs/master/explanation/performance_tuning/
Launching controller instance(s) on localhost/localhost...
 - juju-3b04ae-0 (arch=amd64)
Installing Juju agent on bootstrap instance
Fetching Juju Dashboard 0.8.1
Waiting for address
Attempting to connect to 10.154.44.244:22
Connected to 10.154.44.244
Running machine configuration script...
Bootstrap agent now started
Contacting Juju controller at 10.154.44.244 to verify accessibility...

Bootstrap complete, controller "localhost-localhost" is now available
Controller machines are in the "controller" model
Initial model "default" added

Now we are ready to deploy apps. Let's deploy postgresql in HA mode:

$ juju deploy postgresql -n 2
Located charm "postgresql" in charm-hub, revision 288
Deploying "postgresql" from charm-hub charm "postgresql", revision 288 in channel 14/stable on jammy

And monitor its progress with:
ubuntu@j-pgsql:~$ juju status --watch 5s
Model Controller Cloud/Region Version SLA Timestamp
default localhost-localhost localhost/localhost 2.9.43 unsupported 12:38:26Z

App Version Status Scale Charm Channel Rev Exposed Message
postgresql waiting 0/2 postgresql 14/stable 288 no waiting for machine

...

Read more...

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

I built the setup and configured the pgsql and I had the same trouble as mentioned in the comment #5.
Nested _weird_ problem with a function call.

ubuntu@juju-fc1c81-1:~$ /usr/lib/ocf/resource.d/heartbeat/pgsql
        usage: /usr/lib/ocf/resource.d/heartbeat/pgsql start|stop|status|monitor|promote|demote|notify|meta-data|validate-all|methods

        /usr/lib/ocf/resource.d/heartbeat/pgsql manages a PostgreSQL Server as an HA resource.

        The 'start' operation starts the PostgreSQL server.
        The 'stop' operation stops the PostgreSQL server.
        The 'status' operation reports whether the PostgreSQL is up.
        The 'monitor' operation reports whether the PostgreSQL is running.
        The 'promote' operation promotes the PostgreSQL server.
        The 'demote' operation demotes the PostgreSQL server.
        The 'validate-all' operation reports whether the parameters are valid.
        The 'methods' operation reports on the methods /usr/lib/ocf/resource.d/heartbeat/pgsql supports.

ubuntu@juju-fc1c81-1:~$ /usr/lib/ocf/resource.d/heartbeat/pgsql monitor
error: Could not connect to controller: Permission denied
ocf-exit-reason:Setup problem: couldn't find command:

ubuntu@juju-fc1c81-1:~$ sudo /usr/lib/ocf/resource.d/heartbeat/pgsql monitor
ocf-exit-reason:Setup problem: couldn't find command:

ubuntu@juju-fc1c81-1:~$

When I try to fix the problem, the next error/warning appears.

I think that at the end Keha can test it when it reaches -proposed. I will update the Test Plan to be less vague and provide as many details as I know about it. How do you think about it? @Keha, @ahasenack

description: updated
description: updated
description: updated
Revision history for this message
Andreas Hasenack (ahasenack) wrote (last edit ):

I have a test case that does not involve setting up a pacemaker cluster, but should be enough to verify that the resource agent is now properly detecting the postgresql walreceiver process. Here it is. Could you please run it yourself and see if you get the same results as I did?

[Test plan]
Create a lxd container for the primary postgresql:
$ lxc launch ubuntu:jammy j1

Connect and install packages:
$ lxc shell j1
# apt update && apt install postgresql resource-agents-extra pacemaker-cli-utils -y

Configure postgresql.conf and pg_hba.conf:
# pg_conftool 14 main set listen_addresses '*'
# pg_conftool 14 main set wal_level replica
# echo "host replication replicator all scram-sha-256" >> /etc/postgresql/14/main/pg_hba.conf

Create replication user (choose a password, and remember it, it will be needed again later):
# sudo -u postgres createuser --replication -P -e replicator

restart the primary postgresql:
# systemctl restart postgresql

Back on the host, create lxd container for the secondary postgresql:
$ lxc launch ubuntu:jammy j2

Connect and install packages:
$ lxc shell j2
# apt update && apt install postgresql resource-agents-extra pacemaker-cli-utils -y

Stop postgresql:
# systemctl stop postgresql

Configure postgresql.conf:
# pg_conftool 14 main set listen_addresses '*'
# pg_conftool 14 main set hot_standby on

Cleanup data dir:
# rm -rf /var/lib/postgresql/*/main/*

Perform initial replication as "postgres" user. The pg_basebackup command will prompt for the "replicator" password created earlier on the primary:
# sudo -u postgres -i
$ pg_basebackup -h <IP-of-primary> -D /var/lib/postgresql/14/main -U replicator -P -v -R
$ exit

Start the secondary:
# systemctl start postgresql

Verify replication: list of databases on the secondary does not have a "test" database:
# sudo -u postgres psql -l 2>/dev/null | grep test
#

On the primary, create a test database:
$ lxc shell j1
# sudo -u postgres createdb test
could not change directory to "/root": Permission denied
# sudo -u postgres psql -l 2>/dev/null | grep test
 test | postgres | UTF8 | C.UTF-8 | C.UTF-8 |

On the secondary, verify that the test database now exists:
$ lxc shell j2
# sudo -u postgres psql -l 2>/dev/null | grep test
 test | postgres | UTF8 | C.UTF-8 | C.UTF-8 |

Check that the secondary does have a "walreceiver" process running:
$ lxc shell j2
# ps axw|grep -E "postgres:.*wal" | grep -v grep
   6001 ? Ss 0:06 postgres: 14/main: walreceiver streaming 0/7000780

Now run this long command, one line, on the secondary. With the bug present, the command will complain that the walreceiver process is NOT running:
# OCF_RESKEY_check_wal_receiver=true OCF_RESKEY_socketdir=/run/postgresql OCF_RESKEY_config=/etc/postgresql/14/main/postgresql.conf OCF_RESKEY_pgctl=/usr/lib/postgresql/14/bin/pg_ctl OCF_RESKEY_pgdata=/var/lib/postgresql/14/main OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/pgsql monitor
INFO: Don't check /var/lib/postgresql/14/main during probe
WARNING: wal receiver process is not running

With the bug fixed, the warning will not be present in the output.

Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Thank you so much Andreas, that helps a lot! And obviously it works fine.

If you don't mind, I will put that Test Plan that you created in the Test Plan section of the SRU template.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Yes, go ahead. In the end I couldn't use the juju postgresql deployment for this test because it uses a snap, and turns out manually configuring postgresql replication wasn't so difficult.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Heads up that I had to update the "long command" from the test case to use OCF_RESKEY_pgctl=/usr/lib/postgresql/14/bin/pg_ctl, the path was incorrect before. I edited the comment already.

description: updated
Steve Langasek (vorlon)
Changed in resource-agents (Ubuntu Jammy):
status: Incomplete → In Progress
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Keha, or anyone else affected,

Accepted resource-agents into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/resource-agents/1:4.7.0-1ubuntu7.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in resource-agents (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Michał Małoszewski (michal-maloszewski99) wrote :

Jammy SRU Verification:

The fix works, 1:4.7.0-1ubuntu7.2 fixes the bug.

I've created the jammy container using steps from the [Test Plan] section listed above in the Bug Description and inside that container:

I have installed resource-agents using:

$ apt install resource-agents

Then I typed in:

$ apt policy resource-agents

The output:

resource-agents:
  Installed: 1:4.7.0-1ubuntu7.1
  Candidate: 1:4.7.0-1ubuntu7.1
  Version table:
 *** 1:4.7.0-1ubuntu7.1 500
        500 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages
        100 /var/lib/dpkg/status
     1:4.7.0-1ubuntu7 500
        500 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages

Then I repeated the steps from the [Test Plan] section.

I've noticed that nothing has changed there, so the problem still exists because, as we could see in the output, the package version is not the one where the fix is.

Output:

INFO: Don't check /var/lib/postgresql/14/main during probe
WARNING: wal receiver process is not running

Then, I enabled proposed.

Then I upgraded resource-agents using:

$ apt install resource-agents=1:4.7.0-1ubuntu7.2

Later, I typed in:

$ apt policy resource-agents
to check if the installed version has changed, and we see that we have a new version installed (with a fix).

resource-agents:
  Installed: 1:4.7.0-1ubuntu7.2
  Candidate: 1:4.7.0-1ubuntu7.2
  Version table:
 *** 1:4.7.0-1ubuntu7.2 500
        500 http://archive.ubuntu.com/ubuntu jammy-proposed/universe amd64 Packages
        100 /var/lib/dpkg/status
     1:4.7.0-1ubuntu7.1 500
        500 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages

Finally, when I repeated the steps from the [Test Plan] the problem did not exist; the warning was not present in the output. So the fix works.

tags: added: verification-done verification-done-jammy
removed: verification-needed verification-needed-jammy
Revision history for this message
Robie Basak (racb) wrote : Update Released

The verification of the Stable Release Update for resource-agents has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package resource-agents - 1:4.7.0-1ubuntu7.2

---------------
resource-agents (1:4.7.0-1ubuntu7.2) jammy; urgency=medium

  * d/p/pgsql-heartbeat-postgresql-issue-jammy.patch: walreceiver name
    is changed to make the check for running wal receiver process
    compatible with PostgreSQL >= 11 (LP: #2013084)

 -- Michal Maloszewski <email address hidden> Wed, 03 May 2023 10:44:56 +0200

Changed in resource-agents (Ubuntu Jammy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.