Switchover action may result in replication errors in some cases

Bug #1927541 reported by Paul Goins
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PostgreSQL Charm
New
Undecided
Unassigned

Bug Description

Hi,

This was encountered on cs:postgresql-208. I'm not sure how frequent of an occurance this may be, or whether it might already be repaired by a newer version (I see rev 233 is available), but I wanted to share the finding just in case.

I recently had to restart some PostgreSQL units for applying package upgrades which required reboots. The cluster is a 2 node cluster for supporting Landscape.

I used the switchover action to change the master to whichever node was going to remain online. The first time I did this, things seemed to work fine. The second time I did this, apparently it did not go well.

I've analyzed logs, and if I'm reading this correctly, it looks like:
* The demoted box didn't restart into read only mode for about 45 minutes after the switchover action occurred.
* However, the promoted box restarted into read/write mode within about 18 seconds.
* This seems to indicate that both boxes may have been in read/write mode for some time, resulting in replication errors once the demoted box finally did restart.

I've provided my analysis below; I hope this is helpful.

Best Regards,
Paul Goins

----

First attempt: switchover from landscape-postgresql/0 to /1 (apparently successful)

juju: 2021-05-03T20:26:17Z: juju run -m lma landscape-postgresql/leader switchover master=landscape-postgresql/1

landscape-postgresql/0: 2021-05-03 20:26:45 UTC [1183]: [17026-1] db=,user= LOG: database system is shut down
                        2021-05-03 20:26:46 UTC [13363]: [4-1] db=,user= LOG: database system is ready to accept read only connections
landscape-postgresql/1: 2021-05-03 20:27:05 UTC [1507]: [8-1] db=,user= LOG: received promote request
                        2021-05-03 20:27:06 UTC [1396]: [16986-1] db=,user= LOG: database system is ready to accept connections
landscape-postgresql/0: Some later restarts occurred here that I was not aware of; not sure regarding the cause, but noting:
                        2021-05-03 22:10:10 UTC [13363]: [7-1] db=,user= LOG: database system is shut down
                        2021-05-03 22:11:44 UTC [4805]: [1-1] db=,user= LOG: database system was shut down in recovery at 2021-05-03 22:10:10 UTC
                        2021-05-03 22:26:20 UTC [4770]: [7-1] db=,user= LOG: database system is shut down
                        2021-05-03 22:29:53 UTC [1402]: [1-1] db=,user= LOG: database system was shut down in recovery at 2021-05-03 22:26:20 UTC

----------------------------------------------------------------------

Second attempt: switchover from landscape-postgresql/1 to /0 (not successful; /1's restart delayed, replication errors occurred)

juju: 2021-05-04T16:06:22Z: juju run -m lma landscape-postgresql/leader switchover master=landscape-postgresql/0

landscape-postgresql/0: 2021-05-04 16:06:40 UTC [1402]: [6-1] db=,user= LOG: received promote request
                        2021-05-04 16:06:40 UTC [1235]: [5-1] db=,user= LOG: database system is ready to accept connections
landscape-postgresql/1: 2021-05-04 16:06:40 UTC [16393]: [3-1] db=[unknown],user=_juju_repl LOG: disconnection: session time: 17:36:44.912 user=_juju_repl database= host=10.192.1.9 port=52284
                        [lots more disconnections]
                        2021-05-04 16:50:54 UTC [1396]: [17237-1] db=,user= LOG: database system is shut down
                        2021-05-04 16:52:03 UTC [19434]: [1-1] db=,user= LOG: database system was shut down at 2021-05-04 16:50:53 UTC
                        2021-05-04 16:52:03 UTC [19385]: [4-1] db=,user= LOG: database system is ready to accept read only connections

ERRORS START OCCURRING:

landscape-postgresql/0: 2021-05-04 16:52:03 UTC [26310]: [1-1] db=[unknown],user=[unknown] LOG: connection received: host=10.192.1.10 port=54244
                        2021-05-04 16:52:03 UTC [26310]: [2-1] db=[unknown],user=_juju_repl LOG: replication connection authorized: user=_juju_repl SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, compression=off)
                        2021-05-04 16:52:03 UTC [26310]: [3-1] db=[unknown],user=_juju_repl ERROR: requested starting point D8/D2000000 on timeline 2 is not in this server's history
                        2021-05-04 16:52:03 UTC [26310]: [4-1] db=[unknown],user=_juju_repl DETAIL: This server's history forked from timeline 2 at D8/D1270B50.
                        2021-05-04 16:52:03 UTC [26310]: [5-1] db=[unknown],user=_juju_repl LOG: disconnection: session time: 0:00:00.015 user=_juju_repl database= host=10.192.1.10 port=54244
                        (Repeats with different PIDs)

landscape-postgresql/1:
                        2021-05-04 16:52:03 UTC [19438]: [1-1] db=,user= LOG: fetching timeline history file for timeline 3 from primary server
                        2021-05-04 16:52:03 UTC [19438]: [2-1] db=,user= FATAL: could not start WAL streaming: ERROR: requested starting point D8/D2000000 on timeline 2 is not in this server's history
                                DETAIL: This server's history forked from timeline 2 at D8/D1270B50.
                        (repeats a lot)
                        2021-05-04 16:52:03 UTC [19385]: [4-1] db=,user= LOG: database system is ready to accept read only connections
                        2021-05-04 17:22:14 UTC [19385]: [7-1] db=,user= LOG: database system is shut down
                        2021-05-04 17:25:39 UTC [1451]: [1-1] db=,user= LOG: database system was shut down in recovery at 2021-05-04 17:22:14 UTC
                        (retries startup, keeps failing because of the same error though.)

Revision history for this message
Paul Goins (vultaire) wrote :

I'm not so sure about my judgment. Take it with a grain of salt.

Another likely useful bit of info is "juju show-status-log".

Here's output from landscape-postgresql/0: https://pastebin.ubuntu.com/p/6Cp49CwcWb/
And landscape-postgresql/1: https://pastebin.ubuntu.com/p/F926MCyjtH/

Still not sure how we ended up with the timeline errors though...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.