[2.5] MAAS does not guarantee DNS record updates propagation and returns 200 prematurely
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
Critical
|
Alberto Donato |
Bug Description
MAAS currently returns 200 for dns record update (PUT) requests even if the record was not propagated to all bind servers yet. In short: clients get 200 but /etc/bind/
ii maas-region-api 2.5.0-7442-
This is problematic for dns-ha type of solutions which rely on MAAS to be responsible for updating a DNS record across all bind9 servers managed by MAAS. If they receive 200 but MAAS fails to actually update the database and bind server configuration, the clustering software will not be able to proceed properly.
In particular the case where a MAAS node with a DB master dies and database failover to a slave happens, MAAS seems to fail to ensure that the record is updated in the DB AND all bind servers are reloaded. Then the client thinks that it can now try to resolve the record - in our case this is a pacemaker resource agent trying to resolve the record in a loop comparing a resolution to the desired result which is a start condition of the resource.
In our scenario, a DNS-HA record resource agent is started on one of 3 MAAS hosts (2/3 nodes have a master/slave postgres setup and the third is a standby db node) and systemd-resolved on all 3 hosts is pointed to 3 MAAS servers in the deployment so they are tried in turn until somebody returns either some value or NXDOMAIN:
root@maas-vhost3:~# systemd-resolve --status | grep 'DNS Servers' -A3
DNS Servers: 10.100.1.2
DNS Domain: maas
The resource agent sets up a "maas-region.test" record (--fqdn parameter of a resource agent) that points to one of MAAS nodes which is decided by Pacemaker and the resource agent logic.
By the crm resource trace it can be seen that the resource agent updates the DNS record to 10.100.3.2 from 10.100.2.2, gets 200 as a response and then tries to resolve it and see if it matches 10.100.3.2 and it never does because 10.100.2.2 is returned. Eventually the resource fails to start due to a timeout.
/var/lib/
https:/
https:/
2019-02-24 13:35:31 [INFO] Update the dnsresource with IP: 10.100.3.2 # <--- this is when the resource agent updates the record
As for MAAS logs for the same time period:
2019-02-24 13:35:29 maasserver: [error] #######
2019-02-24 13:35:29 regiond: [info] 10.100.3.2 GET /MAAS/rpc/ HTTP/1.1 --> 500 INTERNAL_
2019-02-24 13:35:31 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:35:31 provisioningser
# <------- PUT request for the update of the record to 10.100.3.2 ---------->
2019-02-24 13:35:32 regiond: [info] 127.0.0.1 PUT /MAAS/api/
2019-02-24 13:35:47 provisioningser
# <------- A bunch of get requests by the resource agent: ------------->
019-02-24 13:36:16 regiond: [info] 10.10.101.75 GET /MAAS/rpc/ HTTP/1.1 --> 200 OK (referrer: -; agent: provisioningser
2019-02-24 13:37:12 maasserver.
2019-02-24 13:37:29 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:39:29 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:42:12 maasserver.
2019-02-24 13:44:23 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:44:51 regiond: [info] 127.0.0.1 GET /MAAS/api/
2019-02-24 13:45:30 regiond: [info] 127.0.0.1 GET /MAAS/api/
I would expect to either see a failure or something like this:
2019-02-24 xx:xx:xx maasserver.
* ip 10.100.3.2 linked to resource maas-region on zone test
* ip 10.100.2.2 unlinked from resource maas-region on zone test
But there is nothing like this and to make it work a restart of all regionds is required.
As far as I can tell MAAS processes DNS updates asynchronously to the result of a DNS record change in the database.
src/maasserver/
src/maasserver/
def process(self):
"""Process the DNS and/or proxy update."""
# ...
if self.needsDNSUp
d = deferToDatabase
I am quite reliably getting to a point where an entry is updated in the DB but none of the region controllers that remain alive update the zone file (no bind restarts are done as well).
Related branches
- Alberto Donato (community): Approve
-
Diff: 265 lines (+81/-36)8 files modifiedsrc/maasserver/config.py (+16/-0)
src/maasserver/djangosettings/settings.py (+6/-0)
src/maasserver/listener.py (+4/-0)
src/maasserver/management/commands/tests/test_config.py (+7/-2)
src/maasserver/region_controller.py (+9/-5)
src/maasserver/tests/test_config.py (+12/-1)
src/maasserver/tests/test_listener.py (+17/-0)
src/maasserver/tests/test_region_controller.py (+10/-28)
- MAAS Lander: Needs Fixing
- Blake Rouse (community): Approve
- Newell Jensen (community): Approve
-
Diff: 265 lines (+81/-36)8 files modifiedsrc/maasserver/config.py (+16/-0)
src/maasserver/djangosettings/settings.py (+6/-0)
src/maasserver/listener.py (+4/-0)
src/maasserver/management/commands/tests/test_config.py (+7/-2)
src/maasserver/region_controller.py (+9/-5)
src/maasserver/tests/test_config.py (+12/-1)
src/maasserver/tests/test_listener.py (+17/-0)
src/maasserver/tests/test_region_controller.py (+10/-28)
description: | updated |
Changed in maas: | |
status: | New → Incomplete |
Changed in maas: | |
importance: | Undecided → Critical |
assignee: | nobody → Alberto Donato (ack) |
summary: |
- [2.5] MAAS does not guarantee DNS dns record updates propagation and - returns 200 prematurely + [2.5] MAAS does not guarantee DNS record updates propagation and returns + 200 prematurely |
Changed in maas: | |
status: | New → Incomplete |
Changed in maas: | |
status: | New → In Progress |
Changed in maas: | |
milestone: | none → next |
status: | In Progress → Fix Committed |
Changed in maas: | |
milestone: | next → 2.6.0 |
Changed in maas: | |
status: | Fix Committed → Fix Released |
Hi there,
Missing information:
1. Please attach maas.log and regiond.log
2. Attach syslog & bind journal
3. Enable debug logging and attach regiondlog
That said, the database execute triggers once new records are saved in the DB, those cause the regions to write the DNS config and restart bind. Issues could actually be:
1. Database connection issues
2. RPC connection issues
3. bind9 issues when restarting.