503 Service Unavailable after a postgresql failover, Landscape still trying to connect to an old postgres even after rendering a new config file

Bug #2076143 reported by Nobuto Murata
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Landscape Charm
New
Undecided
Unassigned

Bug Description

How to reproduce:

1. deploy the stable bundle

$ juju deploy landscape-scalable
Located bundle "landscape-scalable" in charm-hub, revision 33

2. scale out postgresql to have 3 units for HA

$ juju add-unit postgresql -n 2

3. take down the primary postgresql unit by forcibly powering it off

Then, Landscape stops functioning unless I manually run `lsctl restart` to forcibly reload the config to the Landscape server processes.

$ curl -sLkv 192.168.151.106

...

< HTTP/1.1 503 Service Unavailable
< content-length: 107
< cache-control: no-cache
< content-type: text/html
<
{ [107 bytes data]
* Connection #1 to host 192.168.151.106 left intact

Aug 6 02:57:15 pingserver-2 CRIT Unhandled error in Deferred:
Aug 6 02:57:15 pingserver-2 CRIT #012Traceback (most recent call last):#012 File "/usr/lib/python3/dist-packages/storm/exceptions.py", line 165, in wrap_exceptions#012 yield#012 File "/usr/lib/python3/dist-packages/storm/databases/postgres.py", line 438, in raw_connect#012 return self._raw_connect()#012 File "/usr/lib/python3/dist-packages/storm/databases/postgres.py", line 419, in _raw_connect#012 raw_connection = ConnectionWrapper(psycopg2.connect(self._dsn), self)#012 File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect#012 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)#012psycopg2.OperationalError: connection to server at "192.168.151.108", port 5432 failed: No route to host#012#011Is the server running on that host and accepting TCP/IP connections?#012#012#012The above exception was the direct cause of the following exception:#012#012Traceback (most recent call last):#012 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 244, in inContext#012 result = inContext.theWork() # type: ignore[attr-defined]#012 File "/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 260, in <lambda>#012 inContext.theWork = lambda: context.call( # type: ignore[attr-defined]#012 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 117, in callWithContext#012 return self.currentContext().callWithContext(ctx, func, *args, **kw)#012 File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 82, in callWithContext#012 return func(*args, **kw)#012 File "/usr/lib/python3/dist-packages/storm/twisted/transact.py", line 78, in _wrap#012 result = function(*args, **kwargs)#012 File "/opt/canonical/landscape/canonical/landscape/pingserver/pingserver.py", line 131, in check_for_messages#012 for name, account_store in get_account_stores():#012 File "/opt/canonical/landscape/canonical/landscape/model/account/store.py", line 21, in get_account_stores#012 yield (name, zstorm.get(name))#012 File "/usr/lib/python3/dist-packages/storm/zope/zstorm.py", line 179, in get#012 return self.create(name, default_uri)#012 File "/usr/lib/python3/dist-packages/storm/zope/zstorm.py", line 155, in create#012 store = Store(database)#012 File "/usr/lib/python3/dist-packages/storm/store.py", line 83, in __init__#012 self._connection = database.connect(self._event)#012 File "/usr/lib/python3/dist-packages/storm/database.py", line 584, in connect#012 return self.connection_factory(self, event)#012 File "/usr/lib/python3/dist-packages/storm/database.py", line 270, in __init__#012 self._raw_connection = self._database.raw_connect()#012 File "/usr/lib/python3/dist-packages/storm/databases/postgres.py", line 437, in raw_connect#012 with wrap_exceptions(self):#012 File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__#012 self.gen.throw(typ, value, traceback)#012 File "/usr/lib/python3/dist-packages/storm/exceptions.py", line 184, in wrap_exceptions#012 six.raise_from(wrapped.with_traceback(tb), e)#012 File "<string>", line 3, in raise_from#012 File "/usr/lib/python3/dist-packages/storm/exceptions.py", line 165, in wrap_exceptions#012 yield#012 File "/usr/lib/python3/dist-packages/storm/databases/postgres.py", line 438, in raw_connect#012 return self._raw_connect()#012 File "/usr/lib/python3/dist-packages/storm/databases/postgres.py", line 419, in _raw_connect#012 raw_connection = ConnectionWrapper(psycopg2.connect(self._dsn), self)#012 File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect#012 conn = _connect(dsn, connection_factory=connection_factory, **kwasync)#012storm.database.OperationalError: connection to server at "192.168.151.108", port 5432 failed: No route to host#012#011Is the server running on that host and accepting TCP/IP connections

[initial juju status]

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
landscape maas-controller maas/default 3.5.3 unsupported 02:44:44Z

App Version Status Scale Charm Channel Rev Exposed Message
haproxy active 1 haproxy latest/stable 75 yes Unit is ready
landscape-server active 1 landscape-server latest/stable 111 no Unit is ready
postgresql 14.11 active 3 postgresql 14/stable 429 no
rabbitmq-server 3.9.13 active 1 rabbitmq-server 3.9/stable 188 no Unit is ready

Unit Workload Agent Machine Public address Ports Message
haproxy/0* active idle 0 192.168.151.106 80,443/tcp Unit is ready
landscape-server/0* active idle 1 192.168.151.109 Unit is ready
postgresql/0 active idle 2 192.168.151.107 5432/tcp
postgresql/1 active idle 4 192.168.151.110 5432/tcp
postgresql/2* active idle 5 192.168.151.108 5432/tcp Primary
rabbitmq-server/0* active idle 3 192.168.151.111 5672,15672/tcp Unit is ready

Machine State Address Inst id Base AZ Message
0 started 192.168.151.106 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.109 machine-3 ubuntu@22.04 default Deployed
2 started 192.168.151.107 machine-4 ubuntu@22.04 default Deployed
3 started 192.168.151.111 machine-5 ubuntu@22.04 default Deployed
4 started 192.168.151.110 machine-7 ubuntu@22.04 default Deployed
5 started 192.168.151.108 machine-6 ubuntu@22.04 default Deployed

[initial landscape config]

$ juju exec -u landscape-server/0 -- head -n 12 /etc/landscape/service.conf
[stores]
user = landscape
password = 7YyGp99aOAKBdkCG
host = 192.168.151.108:5432
main = landscape-standalone-main
account-1 = landscape-standalone-account-1
resource-1 = landscape-standalone-resource-1
package = landscape-standalone-package
session = landscape-standalone-session
session-autocommit = landscape-standalone-session?isolation=autocommit
knowledge = landscape-standalone-knowledge

^^^ 192.168.151.108 is written as the current primary postgresql host

[initial curl status]

$ curl -sLkv -o/dev/null 192.168.151.106

...

* Issue another request to this URL: 'https://192.168.151.106/new-standalone-user'
* Found bundle for host 192.168.151.106: 0x63ba18e1a050 [serially]
* Can not multiplex, even if we wanted to!
* Re-using existing connection! (#1) with host 192.168.151.106
* Connected to 192.168.151.106 (192.168.151.106) port 443 (#1)
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
} [5 bytes data]
> GET /new-standalone-user HTTP/1.1
> Host: 192.168.151.106
> User-Agent: curl/7.81.0
> Accept: */*
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
{ [5 bytes data]
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 Ok
< server: TwistedWeb/22.1.0

-> 200

[juju status after a postgresql failover]

$ juju status
Model Controller Cloud/Region Version SLA Timestamp
landscape maas-controller maas/default 3.5.3 unsupported 02:52:45Z

App Version Status Scale Charm Channel Rev Exposed Message
haproxy active 1 haproxy latest/stable 75 yes Unit is ready
landscape-server active 1 landscape-server latest/stable 111 no Unit is ready
postgresql 14.11 active 2/3 postgresql 14/stable 429 no
rabbitmq-server 3.9.13 active 1 rabbitmq-server 3.9/stable 188 no Unit is ready

Unit Workload Agent Machine Public address Ports Message
haproxy/0* active idle 0 192.168.151.106 80,443/tcp Unit is ready
landscape-server/0* active idle 1 192.168.151.109 Unit is ready
postgresql/0* active idle 2 192.168.151.107 5432/tcp Primary
postgresql/1 active idle 4 192.168.151.110 5432/tcp
postgresql/2 unknown lost 5 192.168.151.108 5432/tcp agent lost, see 'juju show-status-log postgresql/2'
rabbitmq-server/0* active idle 3 192.168.151.111 5672,15672/tcp Unit is ready

Machine State Address Inst id Base AZ Message
0 started 192.168.151.106 machine-2 ubuntu@22.04 default Deployed
1 started 192.168.151.109 machine-3 ubuntu@22.04 default Deployed
2 started 192.168.151.107 machine-4 ubuntu@22.04 default Deployed
3 started 192.168.151.111 machine-5 ubuntu@22.04 default Deployed
4 started 192.168.151.110 machine-7 ubuntu@22.04 default Deployed
5 down 192.168.151.108 machine-6 ubuntu@22.04 default Deployed

[landscape config after the failover]

$ juju exec -u landscape-server/0 -- head -n 12 /etc/landscape/service.conf
[stores]
user = landscape
password = 7YyGp99aOAKBdkCG
host = 192.168.151.107:5432
main = landscape-standalone-main
account-1 = landscape-standalone-account-1
resource-1 = landscape-standalone-resource-1
package = landscape-standalone-package
session = landscape-standalone-session
session-autocommit = landscape-standalone-session?isolation=autocommit
knowledge = landscape-standalone-knowledge

^^^ a new host as 192.168.151.107 instead of the previous primary node is written by the charm (probably by the db-relation-changed hook)

But still Landscape server processes try to connect to the previous primary node.

> storm.database.OperationalError: connection to server at "192.168.151.108", port 5432 failed: No route to host#012#011Is the server running on that host and accepting TCP/IP connections

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :

Subscribing ~field-high. HA of Landscape deployment doesn't work after a failover without human intervention.

Revision history for this message
Nobuto Murata (nobuto) wrote (last edit ):

I don't see anywhere restarting or reloading the server processes after update_db_conf().
https://git.launchpad.net/landscape-charm/tree/src/charm.py?h=main#n498

_update_ready_status() is also called *without* restart_services=True.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Somehow I cannot propose a merge (https://bugs.launchpad.net/launchpad/+bug/2076198). But the attached patch should fix the issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.