Comment 23 for bug 1710278

Revision history for this message
Sam Lee (sjwl) wrote :

OK - I was able to repro again, and this time with MAAS 2.6.

Here are the steps

PREP WORK
1) Have 50 machines in Ready state with one interface enabled configured as 'Autoassign' to Default VLAN PXE subnet (auto assign so that every deploy/release causes MAAS to reload DNS)
2) Clear out any DNS entries in the PXE subnet (this forces nodes to send DNS queries to MAAS)
3) Settings-> Network Services -> DNS -> Upstream DNS -> enter valid upstream DNS IP
4) Settings-> Network Services -> DNS -> DNSSEC -> Automatic (for some reason this breaks Upstream DNS)
5) Verify that Upstream DNS is broken
a) Rescue Mode one machine
b) ssh to Rescue machine
c) dig www.google.com
d) (dig should timeout/fail)
e) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Disable
f) dig www.google.com
g) (dig should succeed)
h) MAAS->Settings-> Network Services -> DNS -> DNSSEC -> Automatic
i) Release Rescue machine

REPRO
1) run repro.py (attached, WARNING this code will use all machines available to MAAS)
2) wait up to 3 hours, checking if bind9 is hung by regularly running `sudo rndc status` on MAAS

monitoring steps (optional)
(See DNS Query activity)
in one ssh window to Maas run
sudo tcpdump dst <your-rack-controller-ip> -i ens3 and dst port 53
(See DNS reloads, and why)
in another ssh window to Maas run
sudo tail -f /var/log/maas/regiond.log |grep Reloaded -A 3