DNS updates are consumed concurrently, leading to an incorrect nsupdate payload

Bug #2045228 reported by Jacopo Rota
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Jacopo Rota
3.3
Fix Released
High
Jacopo Rota
3.4
Fix Released
High
Jacopo Rota
3.5
Fix Released
High
Jacopo Rota

Bug Description

SOLQA run https://solutions.qa.canonical.com/testruns/f94cb5ae-652f-455d-9b3f-5b15228b5db9.

The 3 regions
10.246.64.5
10.246.64.6
10.246.64.7
Have a different record set for the zone `zone.64.246.10.in-addr.arpa`:
10.246.64.5 (looks good):
212 PTR eth0.meowth.silo2.lab0.solutionsqa.
      PTR meowth.silo2.lab0.solutionsqa.
213 PTR eth0.duosion.silo2.lab0.solutionsqa.
      PTR duosion.silo2.lab0.solutionsqa.
214 PTR eth0.mudkip.silo2.lab0.solutionsqa.
      PTR mudkip.silo2.lab0.solutionsqa.
10.246.64.6 (NOT good):
211 PTR eth0.taillow.silo2.lab0.solutionsqa.
      PTR taillow.silo2.lab0.solutionsqa.
212 PTR eth0.meowth.silo2.lab0.solutionsqa.
      PTR meowth.silo2.lab0.solutionsqa.
214 PTR eth0.mudkip.silo2.lab0.solutionsqa.
      PTR mudkip.silo2.lab0.solutionsqa.
10.246.64.7 (looks good):
212 PTR eth0.meowth.silo2.lab0.solutionsqa.
      PTR meowth.silo2.lab0.solutionsqa.
213 PTR eth0.duosion.silo2.lab0.solutionsqa.
      PTR duosion.silo2.lab0.solutionsqa.
214 PTR eth0.mudkip.silo2.lab0.solutionsqa.
      PTR mudkip.silo2.lab0.solutionsqa.

All of 3 send the same nsupdate payload, but with a different order of records.
The 10.246.64.5 sends

server localhost
zone 64.246.10.in-addr.arpa
...
update delete 213.64.246.10.in-addr.arpa. PTR duosion.silo2.lab0.solutionsqa
update delete 213.64.246.10.in-addr.arpa. PTR eth0.duosion.silo2.lab0.solutionsqa
update add 213.64.246.10.in-addr.arpa. 30 PTR duosion.silo2.lab0.solutionsqa
update add 213.64.246.10.in-addr.arpa. 30 PTR eth0.duosion.silo2.lab0.solutionsqa
...

and looking at the bind logs, the records are first deleted and then added

Nov 25 20:24:17 anahuac named[1232742]: client @0x7fa57002dab8 ::1#36021/key maas: updating zone '64.246.10.in-addr.arpa/IN': deleting an RR at 213.64.246.10.in-addr.arpa PTR
Nov 25 20:24:17 anahuac named[1232742]: client @0x7fa57002dab8 ::1#36021/key maas: updating zone '64.246.10.in-addr.arpa/IN': deleting an RR at 213.64.246.10.in-addr.arpa PTR
Nov 25 20:24:17 anahuac named[1232742]: client @0x7fa57002dab8 ::1#36021/key maas: updating zone '64.246.10.in-addr.arpa/IN': adding an RR at '213.64.246.10.in-addr.arpa' PTR duosion.silo2.lab0.solutionsqa.
Nov 25 20:24:17 anahuac named[1232742]: client @0x7fa57002dab8 ::1#36021/key maas: updating zone '64.246.10.in-addr.arpa/IN': adding an RR at '213.64.246.10.in-addr.arpa' PTR eth0.duosion.silo2.lab0.solutionsqa.
so this region looks good and it's actually ok.

The 10.246.64.6, the faulty one, sends

update add 213.64.246.10.in-addr.arpa. 30 PTR duosion.silo2.lab0.solutionsqa
update add 213.64.246.10.in-addr.arpa. 30 PTR eth0.duosion.silo2.lab0.solutionsqa
update delete 213.64.246.10.in-addr.arpa. PTR duosion.silo2.lab0.solutionsqa
update delete 213.64.246.10.in-addr.arpa. PTR eth0.duosion.silo2.lab0.solutionsqa
i.e. first add and then delete. Looking at the bind logs,

Nov 25 20:24:18 noma named[2789281]: client @0x7f193801e098 ::1#35153/key maas: updating zone '64.246.10.in-addr.arpa/IN': adding an RR at '213.64.246.10.in-addr.arpa' PTR duosion.silo2.lab0.solutionsqa.
Nov 25 20:24:18 noma named[2789281]: client @0x7f193801e098 ::1#35153/key maas: updating zone '64.246.10.in-addr.arpa/IN': adding an RR at '213.64.246.10.in-addr.arpa' PTR eth0.duosion.silo2.lab0.solutionsqa.
Nov 25 20:24:18 noma named[2789281]: client @0x7f193801e098 ::1#35153/key maas: updating zone '64.246.10.in-addr.arpa/IN': deleting an RR at 213.64.246.10.in-addr.arpa PTR
Nov 25 20:24:18 noma named[2789281]: client @0x7f193801e098 ::1#35153/key maas: updating zone '64.246.10.in-addr.arpa/IN': deleting an RR at 213.64.246.10.in-addr.arpa PTR
it is adding and then deleting.

The region 10.246.64.7 is exactly doing the same as 10.246.64.5 and it works fine.

One possible issue is that in the RegionControllerService the queueDynamicDNSUpdate handler that is passed to the Postgres listener service is asynchronous, so it might end up being run concurrently with multiple updates, which might lead to race conditions updating the list of dynamic updates.
The handler should be a synchronous function.
By patching the function with

def queueDynamicDNSUpdate(self, channel, message):
   """
   Called when the `sys_dns_update` message is received
   and queues updates for existing domains
   """
   print("XXX: " + str(message))
   if message == "":
       return

   (new_updates, need_reload) = yield deferToDatabase(
       process_dns_update_notify, message
   )

   self._dns_requires_full_reload = (
       self._dns_requires_full_reload or need_reload
   )
   if self._dns_update_in_progress:
       self._queued_updates += new_updates
   else:
       self._dns_updates += new_updates
   import time
   print("XXX: " + str(message) + " SLEEPING")
   time.sleep(2)
   print("XXX: " + str(message) + " RETURNING")

As you can see the execution is interleaved

2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas maas-edge A 10.0.1.18
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas eth0.maas-edge A 10.0.1.18
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas maas-edge A 0 10.0.1.18
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas eth0.maas-edge A 0 10.0.1.18
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas maas-edge A 10.0.2.254
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas eth1.maas-edge A 10.0.2.254
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas maas-edge A 0 10.0.2.254
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas eth1.maas-edge A 0 10.0.2.254
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas maas-edge A 10.115.195.1
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas lxdbr0.maas-edge A 10.115.195.1
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas maas-edge A 0 10.115.195.1
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas lxdbr0.maas-edge A 0 10.115.195.1
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas maas-edge A fd42:271e:c21e:36bb::1
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas lxdbr0.maas-edge A fd42:271e:c21e:36bb::1
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas maas-edge A 0 fd42:271e:c21e:36bb::1
2023-11-29 15:26:27 stdout: [info] XXX: INSERT maas lxdbr0.maas-edge A 0 fd42:271e:c21e:36bb::1
2023-11-29 15:26:27 stdout: [info] XXX: RELOAD
2023-11-29 15:26:27 stdout: [info] XXX: DELETE maas maas-edge A 10.0.2.254 SLEEPING
2023-11-29 15:26:29 stdout: [info] XXX: DELETE maas maas-edge A 10.0.2.254 RETURNING
2023-11-29 15:26:29 stdout: [info] XXX: DELETE maas eth0.maas-edge A 10.0.1.18 SLEEPING
2023-11-29 15:26:31 stdout: [info] XXX: DELETE maas eth0.maas-edge A 10.0.1.18 RETURNING
2023-11-29 15:26:31 stdout: [info] XXX: DELETE maas eth1.maas-edge A 10.0.2.254 SLEEPING
2023-11-29 15:26:33 stdout: [info] XXX: DELETE maas eth1.maas-edge A 10.0.2.254 RETURNING
2023-11-29 15:26:33 stdout: [info] XXX: DELETE maas maas-edge A 10.0.1.18 SLEEPING
2023-11-29 15:26:35 stdout: [info] XXX: DELETE maas maas-edge A 10.0.1.18 RETURNING
2023-11-29 15:26:35 stdout: [info] XXX: DELETE maas lxdbr0.maas-edge A 10.115.195.1 SLEEPING
2023-11-29 15:26:37 stdout: [info] XXX: DELETE maas lxdbr0.maas-edge A 10.115.195.1 RETURNING

Related branches

Revision history for this message
Jacopo Rota (r00ta) wrote :
Download full text (3.5 KiB)

According to the failed run with verbose logging here https://solutions.qa.canonical.com/testruns/312e80e0-b17a-466d-be30-e03788fbfd6c

we are still consuming the Postgres notifications concurrently

2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update e2abeef3-46e1-4d89-8725-6177713b4181] start processing message DELETE silo2.lab0.solutionsqa landscapeamqp-3-4 A 10.246.64.192
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update 3ef5ea26-48c6-460a-a765-4b01945b2321] start processing message DELETE silo2.lab0.solutionsqa ubuntu A 10.246.64.192
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update d606a8ad-6288-4b3c-af21-5cff388a9dbd] start processing message INSERT silo2.lab0.solutionsqa landscapeamqp-3-4 A 0 10.246.64.192
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update 6c557e22-b621-4031-baf1-d77136f3e94e] start processing message
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update 3ef5ea26-48c6-460a-a765-4b01945b2321] _dns_requires_full_reload = False, need_reload = False
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update 3ef5ea26-48c6-460a-a765-4b01945b2321] requires full reload False
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update 3ef5ea26-48c6-460a-a765-4b01945b2321] dns update is NOT in progress. The updates are [] and the new updates are [DynamicDNSUpdate(operation='DELETE', name='ubuntu.silo2.lab0.solutionsqa', zone='silo2.lab0.solutionsqa', rectype='A', rev_zone=None, ttl=None, subnet=None, answer='10.246.64.192', ip=None)]
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update e2abeef3-46e1-4d89-8725-6177713b4181] _dns_requires_full_reload = False, need_reload = False
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update e2abeef3-46e1-4d89-8725-6177713b4181] requires full reload False
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update e2abeef3-46e1-4d89-8725-6177713b4181] dns update is NOT in progress. The updates are [DynamicDNSUpdate(operation='DELETE', name='ubuntu.silo2.lab0.solutionsqa', zone='silo2.lab0.solutionsqa', rectype='A', rev_zone=None, ttl=None, subnet=None, answer='10.246.64.192', ip=None)] and the new updates are [DynamicDNSUpdate(operation='DELETE', name='landscapeamqp-3-4.silo2.lab0.solutionsqa', zone='silo2.lab0.solutionsqa', rectype='A', rev_zone=None, ttl=None, subnet=None, answer='10.246.64.192', ip=None)]
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update d606a8ad-6288-4b3c-af21-5cff388a9dbd] _dns_requires_full_reload = False, need_reload = False
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update d606a8ad-6288-4b3c-af21-5cff388a9dbd] requires full reload False
2023-12-07 20:03:26 maasserver.region_controller: [info] [Dynamic update d606a8ad-6288-4b3c-af21-5cff388a9dbd] dns update is NOT in progress. The updates are [DynamicDNSUpdate(operation='DELETE', name='ubuntu.silo2.lab0.solutionsqa', zone='silo2.lab0.solutionsqa', rectype='A', rev_zone=None, ttl=None, subnet=None, answer='10.246.64.192', ip=None), DynamicDNSUpdate(operation='DELETE', name='landscapeam...

Read more...

Revision history for this message
Jacopo Rota (r00ta) wrote :

I confirm that we are still not processing the postgres notifications in sequence. The reason is that the `asynchronous` wrapper we used in the first fix
```
+ @asynchronous(timeout=FOREVER)
     @inlineCallbacks
     def queueDynamicDNSUpdate(self, channel, message):
```

is not awaiting the result if the function is called from the reactor thread like in this case.

Revision history for this message
Jacopo Rota (r00ta) wrote :

Proposed a fix to offload the notifications to a DatabaseTaskService to ensure the processing is performed in the strict sequence

Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.