MAAS currently returns 200 for dns record update (PUT) requests even if the record was not propagated to all bind servers yet. In short: clients get 200 but /etc/bind/maas/zone.<domain> still contains old contents in certain conditions.
ii maas-region-api 2.5.0-7442-gdf68e30a5-0ubuntu1~18.04.1 all Region controller API service for MAAS
This is problematic for dns-ha type of solutions which rely on MAAS to be responsible for updating a DNS record across all bind9 servers managed by MAAS. If they receive 200 but MAAS fails to actually update the database and bind server configuration, the clustering software will not be able to proceed properly.
In particular the case where a MAAS node with a DB master dies and database failover to a slave happens, MAAS seems to fail to ensure that the record is updated in the DB AND all bind servers are reloaded. Then the client thinks that it can now try to resolve the record - in our case this is a pacemaker resource agent trying to resolve the record in a loop comparing a resolution to the desired result which is a start condition of the resource.
In our scenario, a DNS-HA record resource agent is started on one of 3 MAAS hosts (2/3 nodes have a master/slave postgres setup and the third is a standby db node) and systemd-resolved on all 3 hosts is pointed to 3 MAAS servers in the deployment so they are tried in turn until somebody returns either some value or NXDOMAIN:
root@maas-vhost3:~# systemd-resolve --status | grep 'DNS Servers' -A3
DNS Servers: 10.100.1.2 10.100.2.2 10.100.3.2
DNS Domain: maas
The resource agent sets up a "maas-region.test" record (--fqdn parameter of a resource agent) that points to one of MAAS nodes which is decided by Pacemaker and the resource agent logic.
By the crm resource trace it can be seen that the resource agent updates the DNS record to 10.100.3.2 from 10.100.2.2, gets 200 as a response and then tries to resolve it and see if it matches 10.100.3.2 and it never does because 10.100.2.2 is returned. Eventually the resource fails to start due to a timeout.
2019-02-24 13:35:31 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:35:31 provisioningserver.rpc.common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetControllerType' command. [maas-vhost3:pid=32363:cmd=GetControllerType:ask=44cc]
# <------- PUT request for the update of the record to 10.100.3.2 ---------->
2019-02-24 13:35:32 regiond: [info] 127.0.0.1 PUT /MAAS/api/2.0/dnsresources/1/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:35:47 provisioningserver.rpc.common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetTimeConfiguration' command. [maas-vhost3:pid=32366:cmd=GetTimeConfiguration:ask=5e4]
# <------- A bunch of get requests by the resource agent: ------------->
019-02-24 13:36:16 regiond: [info] 10.10.101.75 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2019-02-24 13:37:12 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2019-02-24 13:37:29 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:39:29 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:42:12 maasserver.regiondservices.active_discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan.
2019-02-24 13:44:23 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/users/?op=whoami HTTP/1.1 --> 200 OK (referrer: -; agent: Python-httplib2/0.9.2 (gzip))
2019-02-24 13:44:51 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-httplib2/0.9.2 (gzip))
2019-02-24 13:45:30 regiond: [info] 127.0.0.1 GET /MAAS/api/2.0/dnsresources/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
I would expect to either see a failure or something like this:
2019-02-24 xx:xx:xx maasserver.region_controller: [info] Reloaded DNS configuration:
* ip 10.100.3.2 linked to resource maas-region on zone test
* ip 10.100.2.2 unlinked from resource maas-region on zone test
But there is nothing like this and to make it work a restart of all regionds is required.
As far as I can tell MAAS processes DNS updates asynchronously to the result of a DNS record change in the database.
MAAS currently returns 200 for dns record update (PUT) requests even if the record was not propagated to all bind servers yet. In short: clients get 200 but /etc/bind/ maas/zone. <domain> still contains old contents in certain conditions.
ii maas-region-api 2.5.0-7442- gdf68e30a5- 0ubuntu1~ 18.04.1 all Region controller API service for MAAS
This is problematic for dns-ha type of solutions which rely on MAAS to be responsible for updating a DNS record across all bind9 servers managed by MAAS. If they receive 200 but MAAS fails to actually update the database and bind server configuration, the clustering software will not be able to proceed properly.
In particular the case where a MAAS node with a DB master dies and database failover to a slave happens, MAAS seems to fail to ensure that the record is updated in the DB AND all bind servers are reloaded. Then the client thinks that it can now try to resolve the record - in our case this is a pacemaker resource agent trying to resolve the record in a loop comparing a resolution to the desired result which is a start condition of the resource.
In our scenario, a DNS-HA record resource agent is started on one of 3 MAAS hosts (2/3 nodes have a master/slave postgres setup and the third is a standby db node) and systemd-resolved on all 3 hosts is pointed to 3 MAAS servers in the deployment so they are tried in turn until somebody returns either some value or NXDOMAIN:
root@maas-vhost3:~# systemd-resolve --status | grep 'DNS Servers' -A3
10.100. 2.2
10.100. 3.2
DNS Servers: 10.100.1.2
DNS Domain: maas
The resource agent sets up a "maas-region.test" record (--fqdn parameter of a resource agent) that points to one of MAAS nodes which is decided by Pacemaker and the resource agent logic.
By the crm resource trace it can be seen that the resource agent updates the DNS record to 10.100.3.2 from 10.100.2.2, gets 200 as a response and then tries to resolve it and see if it matches 10.100.3.2 and it never does because 10.100.2.2 is returned. Eventually the resource fails to start due to a timeout.
/var/lib/ heartbeat/ trace_ra/ dns/res_ maas_region_ hostname. start.2019- 02-24.13: 35:29: /paste. ubuntu. com/p/vkwTHkWPw 2/ /paste. ubuntu. com/p/pTjS5bvwS R/ (crm status after timeout)
https:/
https:/
2019-02-24 13:35:31 [INFO] Update the dnsresource with IP: 10.100.3.2 # <--- this is when the resource agent updates the record
As for MAAS logs for the same time period:
2019-02-24 13:35:29 maasserver: [error] ####### ####### ####### ####### #### Exception: server closed the connection unexpectedly
2019-02-24 13:35:29 regiond: [info] 10.100.3.2 GET /MAAS/rpc/ HTTP/1.1 --> 500 INTERNAL_ SERVER_ ERROR (referrer: -; agent: provisioningser ver.rpc. clusterservice. ClusterClientSe rvice)
2019-02-24 13:35:31 regiond: [info] 127.0.0.1 GET /MAAS/api/ 2.0/dnsresource s/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:35:31 provisioningser ver.rpc. common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetController Type' command. [maas-vhost3: pid=32363: cmd=GetControll erType: ask=44cc]
# <------- PUT request for the update of the record to 10.100.3.2 ----------> 2.0/dnsresource s/1/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:35:32 regiond: [info] 127.0.0.1 PUT /MAAS/api/
2019-02-24 13:35:47 provisioningser ver.rpc. common: [critical] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the b'GetTimeConfig uration' command. [maas-vhost3: pid=32366: cmd=GetTimeConf iguration: ask=5e4]
# <------- A bunch of get requests by the resource agent: ------------->
019-02-24 13:36:16 regiond: [info] 10.10.101.75 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningser ver.rpc. clusterservice. ClusterClientSe rvice) regiondservices .active_ discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan. 2.0/dnsresource s/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6) 2.0/dnsresource s/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6) regiondservices .active_ discovery: [info] Active network discovery: Active scanning is not enabled on any subnet. Skipping periodic scan. 2.0/users/ ?op=whoami HTTP/1.1 --> 200 OK (referrer: -; agent: Python- httplib2/ 0.9.2 (gzip)) 2.0/dnsresource s/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python- httplib2/ 0.9.2 (gzip)) 2.0/dnsresource s/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2019-02-24 13:37:12 maasserver.
2019-02-24 13:37:29 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:39:29 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:42:12 maasserver.
2019-02-24 13:44:23 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:44:51 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:45:30 regiond: [info] 127.0.0.1 GET /MAAS/api/
I would expect to either see a failure or something like this:
2019-02-24 xx:xx:xx maasserver. region_ controller: [info] Reloaded DNS configuration:
* ip 10.100.3.2 linked to resource maas-region on zone test
* ip 10.100.2.2 unlinked from resource maas-region on zone test
But there is nothing like this and to make it work a restart of all regionds is required.
As far as I can tell MAAS processes DNS updates asynchronously to the result of a DNS record change in the database.
src/maasserver/ region_ controller. py|101| self.processing = LoopingCall( self.process)
src/maasserver/ region_ controller. py date:
self. needsDNSUpdate = False (transactional( dns_update_ all_zones) )
d. addCallback( self._checkSeri al)
d. addCallback( self._logDNSRel oad)
d. addErrback( _onFailureRetry , 'needsDNSUpdate')
d. addErrback(
log.err,
"Failed configuring DNS.")
def process(self):
"""Process the DNS and/or proxy update."""
# ...
if self.needsDNSUp
d = deferToDatabase
I am quite reliably getting to a point where an entry is updated in the DB but none of the region controllers that remain alive update the zone file.