[2.5] MAAS does not guarantee DNS record updates propagation and returns 200 prematurely

Bug #1817484 reported by Dmitrii Shcherbakov on 2019-02-24
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Critical
Alberto Donato

Bug Description

MAAS currently returns 200 for dns record update (PUT) requests even if the record was not propagated to all bind servers yet. In short: clients get 200 but /etc/bind/maas/zone.<domain> still contains old contents in certain conditions.

ii maas-region-api 2.5.0-7442-gdf68e30a5-0ubuntu1~18.04.1 all Region controller API service for MAAS

This is problematic for dns-ha type of solutions which rely on MAAS to be responsible for updating a DNS record across all bind9 servers managed by MAAS. If they receive 200 but MAAS fails to actually update the database and bind server configuration, the clustering software will not be able to proceed properly.

In particular the case where a MAAS node with a DB master dies and database failover to a slave happens, MAAS seems to fail to ensure that the record is updated in the DB AND all bind servers are reloaded. Then the client thinks that it can now try to resolve the record - in our case this is a pacemaker resource agent trying to resolve the record in a loop comparing a resolution to the desired result which is a start condition of the resource.

In our scenario, a DNS-HA record resource agent is started on one of 3 MAAS hosts (2/3 nodes have a master/slave postgres setup and the third is a standby db node) and systemd-resolved on all 3 hosts is pointed to 3 MAAS servers in the deployment so they are tried in turn until somebody returns either some value or NXDOMAIN:

root@maas-vhost3:~# systemd-resolve --status | grep 'DNS Servers' -A3
         DNS Servers: 10.100.1.2
                      10.100.2.2
                      10.100.3.2
          DNS Domain: maas

The resource agent sets up a "maas-region.test" record (--fqdn parameter of a resource agent) that points to one of MAAS nodes which is decided by Pacemaker and the resource agent logic.

By the crm resource trace it can be seen that the resource agent updates the DNS record to 10.100.3.2 from 10.100.2.2, gets 200 as a response and then tries to resolve it and see if it matches 10.100.3.2 and it never does because 10.100.2.2 is returned. Eventually the resource fails to start due to a timeout.

/var/lib/heartbeat/trace_ra/dns/res_maas_region_hostname.start.2019-02-24.13:35:29:
https://paste.ubuntu.com/p/vkwTHkWPw2/
https://paste.ubuntu.com/p/pTjS5bvwSR/ (crm status after timeout)

2019-02-24 13:35:31 [INFO] Update the dnsresource with IP: 10.100.3.2 # <--- this is when the resource agent updates the record

As for MAAS logs for the same time period:

2019-02-24 13:35:29 maasserver: [error] ################################ Exception: server closed the connection unexpectedly

2019-02-24 13:35:29 regiond: [info] 10.100.3.2 GET /MAAS/rpc/ HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)

2019-02-24 13:35:31 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

2019-02-24 13:35:31 provisioningserver.rpc.common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetControllerType' command. [maas-vhost3:pid=32363:cmd=GetControllerType:ask=44cc]

# <------- PUT request for the update of the record to 10.100.3.2 ---------->
2019-02-24 13:35:32 regiond: [info] 127.0.0.1 PUT /MAAS/api/2.0/dnsresources/1/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

2019-02-24 13:35:47 provisioningserver.rpc.common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetTimeConfiguration' command. [maas-vhost3:pid=32366:cmd=GetTimeConfiguration:ask=5e4]

# <------- A bunch of get requests by the resource agent: ------------->

019-02-24 13:36:16 regiond: [info] 10.10.101.75 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2019-02-24 13:37:12 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2019-02-24 13:37:29 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:39:29 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:42:12 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2019-02-24 13:44:23 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/users/?op=whoami HTTP/1.1 --> 200 OK (referrer: -; agent: Python-httplib2/0.9.2 (gzip))
2019-02-24 13:44:51 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-httplib2/0.9.2 (gzip))
2019-02-24 13:45:30 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

I would expect to either see a failure or something like this:

2019-02-24 xx:xx:xx maasserver.region_controller: [info] Reloaded DNS configuration:
  * ip 10.100.3.2 linked to resource maas-region on zone test
  * ip 10.100.2.2 unlinked from resource maas-region on zone test

But there is nothing like this and to make it work a restart of all regionds is required.

As far as I can tell MAAS processes DNS updates asynchronously to the result of a DNS record change in the database.

src/maasserver/region_controller.py|101| self.processing = LoopingCall(self.process)

src/maasserver/region_controller.py
    def process(self):
        """Process the DNS and/or proxy update."""
# ...
        if self.needsDNSUpdate:
            self.needsDNSUpdate = False
            d = deferToDatabase(transactional(dns_update_all_zones))
            d.addCallback(self._checkSerial)
            d.addCallback(self._logDNSReload)
            d.addErrback(_onFailureRetry, 'needsDNSUpdate')
            d.addErrback(
                log.err,
                "Failed configuring DNS.")

I am quite reliably getting to a point where an entry is updated in the DB but none of the region controllers that remain alive update the zone file (no bind restarts are done as well).

Related branches

description: updated
Changed in maas:
status: New → Incomplete
Andres Rodriguez (andreserl) wrote :

Hi there,

Missing information:

1. Please attach maas.log and regiond.log
2. Attach syslog & bind journal
3. Enable debug logging and attach regiondlog

That said, the database execute triggers once new records are saved in the DB, those cause the regions to write the DNS config and restart bind. Issues could actually be:

1. Database connection issues
2. RPC connection issues
3. bind9 issues when restarting.

Dmitrii Shcherbakov (dmitriis) wrote :

Hi Andres,

I will generate a reproducer with debug logging enabled and upload.

Dmitrii Shcherbakov (dmitriis) wrote :

Andres,

Uploaded /etc/ and /var from the node which processed the DNS update:
https://private-fileshare.canonical.com/~dima/maas-dumps/2019-02-25-maas-vhost3-etc-var-log.tar.gz

2019-02-25 10:28:39 regiond: [info] 127.0.0.1 PUT /MAAS/api/2.0/dnsresources/1/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

Additional outputs to illustrate what happened:
https://pastebin.canonical.com/p/zqKSvHxJrT/
https://pastebin.canonical.com/p/Y6Vjftxdg6/ (resource agent log)

The force stop of maas-vhost1 node happened a little before 10:28:39. Then the DB failover happened to maas-vhost2 and maas-vhost3 was the node selected by pacemaker to execute DNS update from 10.100.1.2 -> 10.100.3.2 which succeeded but then zone files on surviving 2 maas servers (maas-vhost2 and maas-vhost3) were not updated so this reproducer is identical to the original one:

    name | name | ip
-------------+------+------------
 maas-region | test | 10.100.3.2

Non-authoritative answer:
Name: maas-region.test
Address: 10.100.1.2

for i in 1 2 3 ; do ssh ubuntu@10.100.$i.2 nslookup maas-region.test ; done
ssh: connect to host 10.100.1.2 port 22: No route to host
Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
Name: maas-region.test
Address: 10.100.1.2

Server: 127.0.0.53
Address: 127.0.0.53#53

Non-authoritative answer:
Name: maas-region.test
Address: 10.100.1.2

Changed in maas:
status: Incomplete → New
Changed in maas:
importance: Undecided → Critical
assignee: nobody → Alberto Donato (ack)
Andres Rodriguez (andreserl) wrote :

Hi Dimitrii,

Could you please attach logs from all your regions and not only one?

That said (please correct me if I'm wrong), what's happening is something like:

1. A record is being created over the API.
2. The DB where all machines are connected to fails, which causes a fail over.
3. The other machines did not get DNS updates because the DB failed over.

So, I have a quick question - When the failover happens, do the region controllers get restarted or do they simply wait for the database to be available again?

Changed in maas:
status: New → Incomplete
Dmitrii Shcherbakov (dmitriis) wrote :

Andres,

Logs from maas-vhost2 (current db master)
https://private-fileshare.canonical.com/~dima/maas-dumps/2019-02-25-maas-vhost2-etc-var-log.tar.gz

Logs from maas-vhost1 (failed master) - I used libguestfs to extract the logs from the killed machine (it is offline and you can see garbage at the and of the region log file there because obviously page cache was not dropped before the forced shutoff)
https://private-fileshare.canonical.com/~dima/maas-dumps/2019-02-25-maas-vhost1-etc-var-log.tar.gz

1. yes, it gets created at first (successfully), then a pacemaker resource agent updates it on failover of the node the record points to;

In this case, the record points to maas-vhost1 which also happens to be a postgres DB master

When maas-vhost1 is killed, the record is updated from a node where pacemaker decides to start the "dns record resource" (called res_maas_region_hostname) - from maas-vhost3 in this case.

2. DB fails over to maas-vhost2 and there is a resource constraint on res_maas_region_hostname such that pacemaker does not attempt to start it (on maas-vhost3 or maas-vhost2) until the DB VIP resource starts. And DB VIP only starts after a postgres master is available.

So: (1) db failover completes (2) DNS update is issued from maas-vhost3

https://pastebin.canonical.com/p/ZZKMvRWDQN/
order ord_promote inf: ms_pgsql:promote res_pgsql_vip:start symmetrical=false
order ord_hostname_start inf: res_pgsql_vip:start res_maas_region_hostname:start symmetrical=false

3. the DB contains the new record so it's notifications from postgres that somehow do not get delivered from the new master or MAAS does not process them correctly.

> do the region controllers get restarted

No, we do not restart them - we just expect that DB notifications will arrive causing a bind9 reload.

Changed in maas:
status: Incomplete → New
Vern Hart (vhart) wrote :

Andres,

I'll let Dimitrii respond with the logs from the other region server. Note that one is down so there will only be two sets of logs.

The scenario is part of a failure test for a customer. The three MAAS nodes are on different subnets so are using dns-ha instead of a vip. MAAS is access with a hostname, maas-region.test, that gets moved by pacemaker when the node with the name fails.

To be clear, this is a non-clean loss of one server -- the postgres master.

So a specific scenario is (I might have the hosts different than Dima's reports):

Postgresql master is maas-vhost1
Postgresql slave is maas-vhost2
maas-region.test is on maas-vhost1

1. Deprive power to maas-vhost1
2. Pacemaker promotes maas-vhost2 to pgsql master and makes maas-vhost3 the new slave
3. Pacemaker moves (starts) the hostname on maas-vhost2 (or 3)

Pacemaker is not monitoring regiond so it would not be restarting those services. They have to wait for the database to be available again.

At step 3, the first one or two starts may fail (500 server error) as the new master starts up and becomes available to MAAS. But eventually pacemaker succeeds in running the script that talks to MAAS to assign the new host's IP to maas-region.test. When this happens, querying MAAS (or pgsql directly) shows the new IP but querying DNS (or the zone files directly) shows the old IP address. This holds true no matter which MAAS server we're looking at.

summary: - [2.5] MAAS does not guarantee DNS dns record updates propagation and
- returns 200 prematurely
+ [2.5] MAAS does not guarantee DNS record updates propagation and returns
+ 200 prematurely
Andres Rodriguez (andreserl) wrote :
Download full text (8.0 KiB)

Hi guys,

Looking at the logs on #3, and based on the times specified there, I see the following:

1. At 10:28:38 I see a traceback, that would lead me to believe that there's DB connection issues:

2019-02-25 10:28:38 maasserver: [error] ################################ Exception: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.
 ################################
2019-02-25 10:28:38 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/db/backends/postgresql/base.py", line 236, in _set_autocommit
    self.connection.autocommit = autocommit
psycopg2.OperationalError: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 235, in handle_uncaught_exception
    raise exc from exc.__cause__
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 297, in get_response
    with transaction.atomic():
  File "/usr/lib/python3/dist-packages/django/db/transaction.py", line 184, in __enter__
    connection.set_autocommit(False, force_begin_transaction_with_broken_autocommit=True)
  File "/usr/lib/python3/dist-packages/django/db/backends/base/base.py", line 411, in set_autocommit
    self._set_autocommit(autocommit)
  File "/usr/lib/python3/dist-packages/django/db/backends/postgresql/base.py", line 236, in _set_autocommit
    self.connection.autocommit = autocommit
  File "/usr/lib/python3/dist-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/usr/lib/python3/dist-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/django/db/backends/postgresql/base.py", line 236, in _set_autocommit
    self.connection.autocommit = autocommit
django.db.utils.OperationalError: server closed the connection unexpectedly
 This probably means the server terminated abnormally
 before or while processing the request.

2019-02-25 10:28:38 regiond: [info] 10.100.2.2 GET /MAAS/rpc/ HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2019-02-25 10:28:38 regiond: [info] 10.10.101.74 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)

2. At 10:28:39, I see the API PUT for DNS, as per #3 I assume this is the update for 10.100.3.2. However, also look at the following failure for ListNodePowerParameters. This leads me to believe that there are still DB connection issues.

2019-02-25 10:28:39 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-25 10:28:39 regiond: [info] 127.0.0.1 PUT /MAAS/api/2.0/dnsresources/1/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-25 10:28:39 provisioningserver.rpc.common: [critical] Unhandled ...

Read more...

Changed in maas:
status: New → Incomplete
Andres Rodriguez (andreserl) wrote :

Actually, the answer to the last question may be here in comment #5:

"So: (1) db failover completes (2) DNS update is issued from maas-vhost3"

If this is the case, this seems to imply that you wait only for postgresql failover to complete in order to execute DNS update, instead of waiting for MAAS to be fully reconnected to postgresql before executing the DNS update.

It would seem to be that you should be waiting not only for the DB to have fully failed over, but also wait for MAAS to be fully connected to the DB before attempting to execute the DNS update.

Dmitrii Shcherbakov (dmitriis) wrote :

Andres,

Doesn't MAAS need to return something like "500 Internal Server Error" if it does not have proper connectivity to the database?

We can retry the update request based on that, however, if 200 is returned for DNS update and the entry is in the DB as a client we consider the operation to be successful and then proceed to checking of the resolution against a desired result.

> in what situation does res_maas_region_hostname executes the update

We wait until the DB VIP comes up which means that the master is up and then start the dns resource which sends an update to MAAS.

In general, the resource agent that updates the record is no different from any other API client - so it cannot track MAAS internal DB connectivity state and issue requests based on that.

Could the issue arise because the DB update is executed in a different thread(s) from the one that updates bind9 and listens to postgres notification? Maybe different sockets are used for this which causes them to be closed separately which results in multiple failures ? (I can see that many connections are maintained to the db https://pastebin.canonical.com/p/z3KRMmMCRC/).

From what I can see DNS update is done asynchronously:

https://github.com/maas/maas/blob/2.5.0/src/maasserver/region_controller.py#L160-L189

Changed in maas:
status: Incomplete → New
Andres Rodriguez (andreserl) wrote :

Hi guys.

Please correct me if I'm wrong as I don't know all the details, but as I understand the configuration of HA is as follows:

1. Pacemaker handles the fail-over of postresql as a resource, and also handles updating the DNS entry in MAAS as another resource.
2. Once pacemaker has confirmed the failover of the DB, pacemaker immediately runs the DNS updating resource.
3. Pacemaker never ensures that MAAS has fully recovered from the fail-over before attempting to run a DNS update.

From that perspective, it would seem to me that given that pacemaker is handling the fail-over of the database, and that MAAS has not fully recovered from the fail-over, pacemaker shouldn't be attempting to run the DNS update until it can ensure that MAAS is up and running and fully recovered from it.

I would recommend that either your DNS update resource/script ensure that MAAS is fully connected before attempting to update the DNS record *or*, you should add a new pacemaker resource that ensures MAAS fully reconnected in the event of a fail-over.

On the other hand, what seems to be an issue in the MAAS side is that it is accepting API requests while there's DB connection issues. MAAS shouldn't be allowing API requests until it can ensure it is fully connected to the database. However, this is outside of the fact that pacemaker shouldn't really assume it can execute a resource on a service (maas) that's not tracked by pacemaker, but its dependent on one that's tracked by pacemaker (postgresql).

Again, we shouldn't be making the assumption that because pacemaker has recovered the database, we can execute actions in MAAS when we cannot even ensure that MAAS itself has recovered from the failover (specially when regiond is not even managed by pacemaker, hence, pacemaker cannot even ensure its fully recovered).

Changed in maas:
status: New → Incomplete

How can we determine and observe externally that MAAS is fully functional
after a database failover?

In this case Dmitri got 200 from MAAS. I personally have no idea how that
kind of pacemaker resource of checking maas health externally can be
created.

Dmitrii Shcherbakov (dmitriis) wrote :

Andres,

1 - yes
2 - yes (master is up -> pg vip goes up -> dns update issued)
3 - correct, the resource agent we have does not do that.

As a temporary workaround, we could check if MAAS has completed settling but we'd need to know how to do it very reliably.

Changed in maas:
status: Incomplete → New
Alberto Donato (ack) wrote :

Dmitrii, could you please try this http://paste.ubuntu.com/p/QXQWdBsdwY/ and see if it solves your issue?
This will return a 500 if the region is not fully connected to the DB

Alberto Donato (ack) wrote :

Apologies, the previous paste had an error in it, please try http://paste.ubuntu.com/p/SXhPsPqz8w/

Dmitrii Shcherbakov (dmitriis) wrote :

For #15:

# time before failover
ubuntu@maas-vhost3:~$ date
Mon Feb 25 18:12:47 UTC 2019

# PUT request
2019-02-25 18:13:16 regiond: [info] 127.0.0.1 PUT /MAAS/api/2.0/dnsresources/1/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

Dmitrii Shcherbakov (dmitriis) wrote :

I looked at the situation at a different angle:

1) MAAS was up the whole night without regiond restarts (had enough time to settle) on two surviving nodes;
2) Stored procedures that issue pg_notify are present on the new master: https://pastebin.canonical.com/p/SfYWjpzc2v/

https://github.com/maas/maas/blob/2.5.0/src/maasserver/triggers/websocket.py#L937-L955 (rrset_sipaddress_link_notify)
https://github.com/maas/maas/blob/2.5.0/src/maasserver/triggers/websocket.py#L1268-L1270 (register_websocket_triggers)

3) When I manually execute updates via `maasdb# update maasserver_dnsresource_ip_addresses set staticipaddress_id=<new_staticip_id> where id=<dnsresource_id>` the surviving region controllers do not get any notifications.

4) There is no way to list all current listeners via pg_listening_channels for all postgres sessions (only for the current session):
https://pastebin.canonical.com/p/Kc63bJr23T/

So I cannot say if there are any registered listeners.

5) `maas <profile> dnsresources read` returns the same value as stored in the DB at any given moment (so the connectivity to the DB is fine).

6) SELECT pg_notify('domain_update', "<domain_id>"); does not trigger updates.

So, the patched MAAS region still does not get notifications.

Dmitrii Shcherbakov (dmitriis) wrote :

Also to #15, #16 and #17:

It appears to be that the regiond db listener only noticed that the connection was lost at 19:02:30 (almost an hour after failover which happened at ~~18:12:50):

https://pastebin.canonical.com/p/g8FDBCtst5/

2019-02-25 18:02:24 maasserver.listener: [info] Listening for database notifications.
2019-02-25 19:02:27 maasserver.listener: [debug] Connection lost.
2019-02-25 19:02:30 maasserver.listener: [info] Listening for database notifications.
2019-02-25 20:03:54 maasserver.listener: [debug] Connection lost.
2019-02-25 20:03:54 maasserver.listener: [debug] Connection lost.
2019-02-25 20:03:54 maasserver.listener: [debug] Connection lost.
2019-02-25 20:03:57 maasserver.listener: [info] Listening for database notifications.
2019-02-25 20:03:57 maasserver.listener: [info] Listening for database notifications.
2019-02-25 20:03:57 maasserver.listener: [info] Listening for database notifications.
2019-02-25 20:10:19 maasserver.listener: [debug] Connection lost.
2019-02-25 20:10:22 maasserver.listener: [info] Listening for database notifications.

There were some 500 errors returned to client requests (because of the patched code):
https://pastebin.canonical.com/p/Q47fpgTfkf/

Then listener reported that it started listening for notifications a couple of times:
https://pastebin.canonical.com/p/Q47fpgTfkf/

And also the log gets messages like this periodically:

2019-02-25 21:02:26 maasserver.bootresources: [critical] Importing boot resources failed.

        Traceback (most recent call last):
# ...
            listener.register("sys_stop_import", stop_import)
          File "/usr/lib/python3/dist-packages/maasserver/listener.py", line 223, in register
            "System channel '%s' has already been registered." % channel)
        maasserver.listener.PostgresListenerRegistrationError: System channel 'sys_stop_import' has already been registered.

This behavior started at 2019-02-25 19:02:27 when the listener first reported that it has lost the connection to the DB and continues now:

https://pastebin.canonical.com/p/KTHVjYGXHg/

The registration error comes from here: https://github.com/maas/maas/blob/2.5.0/src/maasserver/listener.py#L209-L223

          File "/usr/lib/python3/dist-packages/maasserver/listener.py", line 223, in register
            "System channel '%s' has already been registered." % channel)
        maasserver.listener.PostgresListenerRegistrationError: System channel 'sys_stop_import' has already been registered.

Dmitrii Shcherbakov (dmitriis) wrote :

An update after using this patch suggested by Alberto: http://paste.ubuntu.com/p/fDbrrsBmdP/

This was applied onto:

dpkg -l maas-region-api | grep ii
ii maas-region-api 2.5.1-7508-gb70537f3d-0ubuntu1~18.04.1 all Region controller API service for MAAS

(2.5.1 as I rebuilt the env from scratch and got the newly released version)

Also, I applied pretty aggressive system-wide TCP keepalive settings so that sockets MAAS creates to connect to postgres time out much faster (1 + 3 * 1 = 4 seconds) because it does not tune them.

# sysctl -a --pattern keepalive
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 1

default kernel settings were not adequate (7200 seconds ~ 2 hours before keepalives are even sent)
https://paste.ubuntu.com/p/qy38pQHGRr/

# time before failover
root@maas-vhost1:~# date
Tue Feb 26 21:20:28 UTC 2019

https://paste.ubuntu.com/p/mwCMwzY2sT/

I managed to get 500 for both start actions on 2 surviving nodes which lead to:

Failed Actions:
* res_maas_region_hostname_start_0 on maas-vhost1 'unknown error' (1): call=86, status=complete, exitreason='',
    last-rc-change='Tue Feb 26 21:20:51 2019', queued=0ms, exec=688ms
* res_maas_region_hostname_start_0 on maas-vhost3 'unknown error' (1): call=127, status=complete, exitreason='',
    last-rc-change='Tue Feb 26 21:20:50 2019', queued=0ms, exec=1030ms

This is expected and I can make a couple of retry attempts in the RA before failing.

Since I have not tuned failure-timeout the failure happened after a few minutes:
https://paste.ubuntu.com/p/kJRnYjjKQg/

When I tuned meta failure-timeout to 3 seconds for the DNS resource the failover happened much faster because pacemaker did a stop/start on that resource faster after start failure.

So, I believe, the patch suggested and tcp keepalive tuning combined give a much better result. E.g. I started seeing OperationalErrors from psycopg2 which means that the keepalive setting did work.

The timeline with the last test I performed with failure-timeout=5:
https://paste.ubuntu.com/p/b6Nn3KvgfD/

Considering that there there's a 4 second gap for TCP timeouts, 5-second monitor on res_maas_region_hostname, some time before the new PG VIP and master comes up and some time for MAAS to recover ~ 20 second failover time seems somewhat sane and can be tweaked further.

crm configure show | grep cluster-recheck-interval
 cluster-recheck-interval=10s

The good side is that patching MAAS and tweaking keepalive settings worked.

I will upload more logs in the morning to illustrate how the patched version behaved.

Stuart Bishop (stub) wrote :

It is possible to set keepalives specifically on the PostgreSQL connection, per https://www.postgresql.org/docs/10/libpq-connect.html#LIBPQ-PARAMKEYWORDS (which psycopg2 accepts as keyword parameters IIRC). It should be functionally the same as setting system-wide TCP settings, but might be easier to rollout as a fix (just a software update, no need to change system settings).

Alexander Litvinov (alitvinov) wrote :

Tested in identical scenario as #19 (maas 2.5.1 with patches)

Out of 4 times - 3 times failover worked fine.
In the last case on of the regions didn't have their API sockets present
Here is the regiond log from the node where that happened (unfortunately is not debug=true enabled)

http://paste.ubuntu.com/p/QncM3Ybkxc/

root@maas-rack3:/var/lib/heartbeat/trace_ra/dns# pgrep -af regiond
997 /bin/sh -c exec /usr/sbin/regiond 2>&1 | tee -a $LOGFILE
1003 /usr/bin/python3 /usr/sbin/regiond
1004 tee -a /var/log/maas/regiond.log
1775 /usr/bin/python3 /usr/sbin/regiond
1783 /usr/bin/python3 /usr/sbin/regiond
1784 /usr/bin/python3 /usr/sbin/regiond
1788 /usr/bin/python3 /usr/sbin/regiond

root@maas-rack3:/var/lib/heartbeat/trace_ra/dns# ss -tulpan 'sport = 5240'
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port
udp UNCONN 0 0 *:5240 *:* users:(("rackd",pid=1066,fd=20))

Changed in maas:
status: New → In Progress
Dmitrii Shcherbakov (dmitriis) wrote :

Reverted to 2.5.0 and tried the patch with that:

https://private-fileshare.canonical.com/~dima/maas-dumps/2019-02-28-maas-vhost1-2-3-etc-var-log.tar.gz

https://pastebin.canonical.com/p/VQ55DTSTYt/

So far so good.

I went through the failover scenario several times. Logs after 3 or so consecutive successful failovers (no manual or automatic maas service restarts):

https://private-fileshare.canonical.com/~dima/maas-dumps/2019-02-28-2-maas-vhost1-2-3-etc-var-log.tar.gz

Debug logging was enabled for both regiond and rackd so you might find useful info on what was logged during failovers.

Alberto Donato (ack) wrote :

Dmitrii,

I revised the previous patch to also set keepalives for the connection as suggested by Stuard.
Could you please try the diff from https://code.launchpad.net/~ack/maas/+git/maas/+merge/363768 and see if it makes the situation better?

Vern Hart (vhart) wrote :

The patch, even with the keepalive settings, was not working for me on-site at customer so I added crm configuration to restart maas-regiond whenever pgsql gets promoted to master.

  primitive res_maas-regiond systemd:maas-regiond \
    op start interval=0 timeout=120s \
    op monitor interval=10s timeout=120s \
    op stop interval=0 timeout=120s

  clone cl_maas-regiond res_maas-regiond

  order ord_maas_restart Mandatory: \
    ms_pgsql:promote \
    cl_maas-regiond:start \
    symmetrical=false

This resource is cloned to all nodes and then tied to the promotion of any ms_pgsql node. If a node in ms_pgsql is promoted, it will then restart maas-regiond (via systemd) on all nodes.

I feel this is solving the problem with a hammer but it works.

Blake Rouse (blake-rouse) wrote :

Can you describe what you mean by not working? Code is erroring? DNS still missing? What type of delay did you wait for?

1. Can you confirm that you are running with the patch?
2. You should get a log entry saying DNS was reloaded? Can you check the logs to see if that occurs when the fail over occurs.

Dmitrii Shcherbakov (dmitriis) wrote :

Alberto,

It seems that with the patch from #24 (applied on top of 2.5.0) your settings get applied to sockets created by regiond and override the system ones as intended:

+ 'keepalives': 1,
+ 'keepalives_idle': 15,
+ 'keepalives_interval': 15,
+ 'keepalives_count': 2

I built knetstat with this patch https://github.com/veithen/knetstat/pull/17 to test:

sysctl -a --pattern keepalive
net.ipv4.tcp_keepalive_intvl = 1
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 200

sudo cat /proc/net/tcpstat | grep 5432 | tail -n2
     0 0 100.64.0.3:56968 100.64.0.254:5432 ESTB SO_REUSEADDR=0,SO_REUSEPORT=0,SO_KEEPALIVE=1,TCP_KEEPIDLE=15,TCP_KEEPCNT=2,TCP_KEEPINTVL=15,TCP_NODELAY=1,TCP_DEFER_ACCEPT=0
     0 0 100.64.0.3:56870 100.64.0.254:5432 ESTB SO_REUSEADDR=0,SO_REUSEPORT=0,SO_KEEPALIVE=1,TCP_KEEPIDLE=15,TCP_KEEPCNT=2,TCP_KEEPINTVL=15,TCP_NODELAY=1,TCP_DEFER_ACCEPT=0

I also tested the DB failover scenario a few times - it seems to work reliably with our DNS notification use-case.

What I would advise considering:

1) making keepalive parameters tunable via MAAS settings (for region and rack controllers) so that we can apply more aggressive keepalives if needed at the socket level;

2) check if similar patching needs to be done for rack controllers (not sure if you have listeners there as with regiond).

Alberto Donato (ack) wrote :

Dmitrii,

thanks for testing it, I'll make the parameters configurable, keeping the current values as defaults.
As for 2), there shouldn't be any change needed in rack controllers, as they don't have database connections.

Dmitrii Shcherbakov (dmitriis) wrote :

Another update: I managed to reproduce the issue with missing regiond sockets.

See maas-vhost2 logs:
https://private-fileshare.canonical.com/~dima/maas-dumps/2019-03-01-maas-vhost1-2-3-etc-var-log.tar.gz

# no listening sockets
ubuntu@maas-vhost2:~$ sudo ss -tlpna 'sport = 5240'
State Recv-Q Send-Q Local Address:Port Peer Address:Port

# however, the processes are running
ubuntu@maas-vhost2:~$ pgrep -af regiond
1002 /bin/sh -c exec /usr/sbin/regiond 2>&1 | tee -a $LOGFILE
1004 /usr/bin/python3 /usr/sbin/regiond
1005 tee -a /var/log/maas/regiond.log
1967 /usr/bin/python3 /usr/sbin/regiond
1969 /usr/bin/python3 /usr/sbin/regiond
1972 /usr/bin/python3 /usr/sbin/regiond
1973 /usr/bin/python3 /usr/sbin/regiond

It definitely has the right content in the zone file so it receives DNS updates, however, no API sockets are present (port 5240):

ubuntu@maas-vhost2:~$ cat /etc/bind/maas/zone.test
; Zone file modified: 2019-02-28 23:26:43.245638.
$TTL 30
#...
@ 30 IN NS maas.
maas-region 0 IN A 10.100.1.2

ubuntu@maas-vhost1:~$ cat /etc/bind/maas/zone.test
; Zone file modified: 2019-02-28 23:26:32.619195.
$TTL 30
#...
@ 30 IN NS maas.
maas-region 0 IN A 10.100.1.2

Given our resource agent tries to use http://localhost:5240/MAAS for DNS update API calls the relevant timestamps can be seen by the "connection refused" errors reported by it:

root@maas-vhost2:~# grep -B1 -A1 -a refused /var/log/pacemaker.log
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ sock.connect(sa) ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ ConnectionRefusedError: [Errno 111] Connection refused ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ ]
--
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ raise URLError(err) ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: notice: operation_finished: res_maas_region_hostname_start_0:2891:stderr [ urllib.error.URLError: <urlopen error [Errno 111] Connection refused> ]
Feb 28 23:26:30 [1401] maas-vhost2 lrmd: info: log_finished: finished - rsc:res_maas_region_hostname action:start call_id:25 pid:2891 exit-code:1 exec-time:9651ms queue-time:0ms

Maybe the sockets were gone even before that because I can see "request to http://127.0.0.1:5240/MAAS/metadata/2012-03-01/ failed" messages earlier in the regiond.log.

Changed in maas:
milestone: none → next
status: In Progress → Fix Committed
Dmitrii Shcherbakov (dmitriis) wrote :

Have anybody looked at the logs mentioned in https://bugs.launchpad.net/maas/+bug/1817484/comments/29 with regards to API sockets disappearing?

Blake Rouse (blake-rouse) wrote :

@dmitriis

Please file a seperate bug, as this is a different issue then this bug is highlighting. It is not a side-effect of the change, as the change just re-writes the zone files on re-connect. In no way does the change affect the listening socket of the regiond.

Are you seeing any stack traces in the regiond logs? Please add that to the new bug.

Vern Hart (vhart) wrote :

In response to #26, from Blake Rouse: When I say it was not working, at first the dnsresource setting was getting 500 server errors (as expected) but when it succeeded, the new IP was set in the postgres DB but the zone files on each node still reflected the old IP.

I confirmed with Dima that the md5sums of the patched files matched his.

Since this is an environment where I don't have hands on keyboard, digging into the log files is a little challenging. Also, I left the customer location on Thursday so I needed to have a solution in place. This is why I spent the remainder of my on-site time implementing roaksoax's suggestion of restarting maas-regiond when the PostgreSQL master changes.

Dmitrii Shcherbakov (dmitriis) wrote :

Blake,

Created a separate bug for the API sockets case https://bugs.launchpad.net/maas/+bug/1818703

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers