Comment 0 for bug 1817484

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote : [2.5] MAAS does not guarantee DNS dns record updates propagation and returns 200 prematurely

MAAS currently returns 200 for dns record update (PUT) requests even if the record was not propagated to all bind servers yet. In short: clients get 200 but /etc/bind/maas/zone.<domain> still contains old contents in certain conditions.

ii maas-region-api 2.5.0-7442-gdf68e30a5-0ubuntu1~18.04.1 all Region controller API service for MAAS

This is problematic for dns-ha type of solutions which rely on MAAS to be responsible for updating a DNS record across all bind9 servers managed by MAAS. If they receive 200 but MAAS fails to actually update the database and bind server configuration, the clustering software will not be able to proceed properly.

In particular the case where a MAAS node with a DB master dies and database failover to a slave happens, MAAS seems to fail to ensure that the record is updated in the DB AND all bind servers are reloaded. Then the client thinks that it can now try to resolve the record - in our case this is a pacemaker resource agent trying to resolve the record in a loop comparing a resolution to the desired result which is a start condition of the resource.

In our scenario, a DNS-HA record resource agent is started on one of 3 MAAS hosts (2/3 nodes have a master/slave postgres setup and the third is a standby db node) and systemd-resolved on all 3 hosts is pointed to 3 MAAS servers in the deployment so they are tried in turn until somebody returns either some value or NXDOMAIN:

root@maas-vhost3:~# systemd-resolve --status | grep 'DNS Servers' -A3
         DNS Servers: 10.100.1.2
                      10.100.2.2
                      10.100.3.2
          DNS Domain: maas

The resource agent sets up a "maas-region.test" record (--fqdn parameter of a resource agent) that points to one of MAAS nodes which is decided by Pacemaker and the resource agent logic.

By the crm resource trace it can be seen that the resource agent updates the DNS record to 10.100.3.2 from 10.100.2.2, gets 200 as a response and then tries to resolve it and see if it matches 10.100.3.2 and it never does because 10.100.2.2 is returned. Eventually the resource fails to start due to a timeout.

/var/lib/heartbeat/trace_ra/dns/res_maas_region_hostname.start.2019-02-24.13:35:29:
https://paste.ubuntu.com/p/vkwTHkWPw2/
https://paste.ubuntu.com/p/pTjS5bvwSR/ (crm status after timeout)

2019-02-24 13:35:31 [INFO] Update the dnsresource with IP: 10.100.3.2 # <--- this is when the resource agent updates the record

As for MAAS logs for the same time period:

2019-02-24 13:35:29 maasserver: [error] ################################ Exception: server closed the connection unexpectedly

2019-02-24 13:35:29 regiond: [info] 10.100.3.2 GET /MAAS/rpc/ HTTP/1.1 --> 500 INTERNAL_SERVER_ERROR (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)

2019-02-24 13:35:31 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

2019-02-24 13:35:31 provisioningserver.rpc.common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetControllerType' command. [maas-vhost3:pid=32363:cmd=GetControllerType:ask=44cc]

# <------- PUT request for the update of the record to 10.100.3.2 ---------->
2019-02-24 13:35:32 regiond: [info] 127.0.0.1 PUT /MAAS/api/2.0/dnsresources/1/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

2019-02-24 13:35:47 provisioningserver.rpc.common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetTimeConfiguration' command. [maas-vhost3:pid=32366:cmd=GetTimeConfiguration:ask=5e4]

# <------- A bunch of get requests by the resource agent: ------------->

019-02-24 13:36:16 regiond: [info] 10.10.101.75 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2019-02-24 13:37:12 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2019-02-24 13:37:29 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:39:29 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:42:12 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2019-02-24 13:44:23 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/users/?op=whoami HTTP/1.1 --> 200 OK (referrer: -; agent: Python-httplib2/0.9.2 (gzip))
2019-02-24 13:44:51 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-httplib2/0.9.2 (gzip))
2019-02-24 13:45:30 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

I would expect to either see a failure or something like this:

2019-02-24 xx:xx:xx maasserver.region_controller: [info] Reloaded DNS configuration:
  * ip 10.100.3.2 linked to resource maas-region on zone test
  * ip 10.100.2.2 unlinked from resource maas-region on zone test

But there is nothing like this and to make it work a restart of all regionds is required.

As far as I can tell MAAS processes DNS updates asynchronously to the result of a DNS record change in the database.

src/maasserver/region_controller.py|101| self.processing = LoopingCall(self.process)

src/maasserver/region_controller.py
    def process(self):
        """Process the DNS and/or proxy update."""
# ...
        if self.needsDNSUpdate:
            self.needsDNSUpdate = False
            d = deferToDatabase(transactional(dns_update_all_zones))
            d.addCallback(self._checkSerial)
            d.addCallback(self._logDNSReload)
            d.addErrback(_onFailureRetry, 'needsDNSUpdate')
            d.addErrback(
                log.err,
                "Failed configuring DNS.")

I am quite reliably getting to a point where an entry is updated in the DB but none of the region controllers that remain alive update the zone file.