Machines fail to commission using the 3.0 snap due to possible? DNS issue

Bug #1929576 reported by Michael Skalka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Christian Grabowski
3.0
Fix Released
Critical
Björn Tillenius

Bug Description

As seen during this release test run: https://solutions.qa.canonical.com/testruns/testRun/c6b754fd-0c7d-45b2-9a0e-7e5a7e3ffe21

Maas logs: https://oil-jenkins.canonical.com/artifacts/c6b754fd-0c7d-45b2-9a0e-7e5a7e3ffe21/generated/generated/maas/logs-2021-05-25-14.06.37.tar

Machines are set to commission but never get recognized by maas:

2021-05-25-13:35:32 foundationcloudengine.layers.configuremaas DEBUG looking for nodes: ['quilava', 'armaldo', 'ralts', 'spearow', 'elgyem', 'beartic']
2021-05-25-13:35:32 root DEBUG [localhost]: maas root machines read
2021-05-25-13:35:43 foundationcloudengine.layers.configuremaas DEBUG Current states: {'quilava': 'unknown', 'armaldo': 'unknown', 'ralts': 'unknown', 'spearow': 'unknown', 'elgyem': 'unknown', 'beartic': 'unknown'}
...
...
2021-05-25-14:05:20 root DEBUG [localhost]: maas root machines read
2021-05-25-14:05:24 foundationcloudengine.layers.configuremaas DEBUG Current states: {'quilava': 'unknown', 'armaldo': 'unknown', 'ralts': 'unknown', 'spearow': 'unknown', 'elgyem': 'unknown', 'beartic': 'unknown'}

Looking into the maas regiond logs the only thing that leaps out is the regiond complaining about failing to reload DNS shortly before the nodes are commissioned:

2021-05-25 13:33:43 maasserver.region_controller: [critical] Failed configuring DNS; killing and restarting
        Traceback (most recent call last):
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
            return _cancellableInlineCallbacks(gen)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
            _inlineCallbacks(None, g, status)
        --- <exception caught here> ---
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
            result = g.send(result)
          File "/snap/maas/14411/lib/python3.8/site-packages/maasserver/region_controller.py", line 222, in _checkSerial
            raise DNSReloadError(
        maasserver.region_controller.DNSReloadError: Failed to reload DNS; timeout or rdnc command failed.

Then it goes into a loop of failing and restarting:

2021-05-25 13:34:13 maasserver.region_controller: [critical] Failed to kill and restart DNS.
        Traceback (most recent call last):
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
            self._startRunCallbacks(fail)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
            _inlineCallbacks(r, g, status)
        --- <exception caught here> ---
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/snap/maas/14411/lib/python3.8/site-packages/provisioningserver/utils/service_monitor.py", line 432, in killService
            state = yield self.ensureService(name)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
            result = g.send(result)
          File "/snap/maas/14411/lib/python3.8/site-packages/provisioningserver/utils/service_monitor.py", line 741, in _ensureService
            raise ServiceActionError(error_msg)
        provisioningserver.utils.service_monitor.ServiceActionError: Service 'bind9' failed to start. Its current state is 'dead' and 'Result: exit-code'.

2021-05-25 13:34:14 maasserver.region_controller: [critical] Failed configuring DNS; killing and restarting
        Traceback (most recent call last):
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
            return _cancellableInlineCallbacks(gen)
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
            _inlineCallbacks(None, g, status)
        --- <exception caught here> ---
          File "/snap/maas/14411/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
            result = g.send(result)
          File "/snap/maas/14411/lib/python3.8/site-packages/maasserver/region_controller.py", line 222, in _checkSerial
            raise DNSReloadError(
        maasserver.region_controller.DNSReloadError: Failed to reload DNS; timeout or rdnc command failed.

This persists until the test run was killed due to the 30 minute commissioning timeout.

From named.log:

25-May-2021 14:07:03.124 starting BIND 9.16.1-Ubuntu (Stable Release) <id:d497c32>
25-May-2021 14:07:03.124 running on Linux x86_64 5.4.0-56-generic #62-Ubuntu SMP Mon Nov 23 19:20:19 UTC 2020
25-May-2021 14:07:03.124 built with '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=/usr/include' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-silent-rules' '--libdir=/usr/lib/x86_64-linux-gnu' '--runstatedir=/run' '--disable-maintainer-mode' '--disable-dependency-tracking' '--libdir=/usr/lib/x86_64-linux-gnu' '--sysconfdir=/etc/bind' '--with-python=python3' '--localstatedir=/' '--enable-threads' '--enable-largefile' '--with-libtool' '--enable-shared' '--enable-static' '--with-gost=no' '--with-openssl=/usr' '--with-gssapi=/usr' '--with-libidn2' '--with-json-c' '--with-lmdb=/usr' '--with-gnu-ld' '--with-maxminddb' '--with-atf=no' '--enable-ipv6' '--enable-rrl' '--enable-filter-aaaa' '--disable-native-pkcs11' '--disable-isc-spnego' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/build/bind9-QF7jf9/bind9-9.16.1=. -fstack-protector-strong -Wformat -Werror=format-security -fno-strict-aliasing -fno-delete-null-pointer-checks -DNO_VERSION_DATE -DDIG_SIGCHASE' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'
25-May-2021 14:07:03.124 running as: named -c /var/snap/maas/14411/bind/named.conf -g
25-May-2021 14:07:03.124 compiled by GCC 9.3.0
25-May-2021 14:07:03.124 compiled with OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
25-May-2021 14:07:03.124 linked to OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
25-May-2021 14:07:03.124 compiled with libxml2 version: 2.9.10
25-May-2021 14:07:03.124 linked to libxml2 version: 20910
25-May-2021 14:07:03.124 compiled with json-c version: 0.13.1
25-May-2021 14:07:03.124 linked to json-c version: 0.13.1
25-May-2021 14:07:03.124 compiled with zlib version: 1.2.11
25-May-2021 14:07:03.124 linked to zlib version: 1.2.11
25-May-2021 14:07:03.124 ----------------------------------------------------
25-May-2021 14:07:03.124 BIND 9 is maintained by Internet Systems Consortium,
25-May-2021 14:07:03.124 Inc. (ISC), a non-profit 501(c)(3) public-benefit
25-May-2021 14:07:03.124 corporation. Support and training for BIND 9 are
25-May-2021 14:07:03.124 available at https://www.isc.org/support
25-May-2021 14:07:03.124 ----------------------------------------------------
25-May-2021 14:07:03.124 found 20 CPUs, using 20 worker threads
25-May-2021 14:07:03.124 using 20 UDP listeners per interface
25-May-2021 14:07:03.136 using up to 21000 sockets
25-May-2021 14:07:03.140 loading configuration from '/var/snap/maas/14411/bind/named.conf'
25-May-2021 14:07:03.144 /var/snap/maas/current/bind/named.conf.maas:180: zone '37.168.192.in-addr.arpa': already exists previous definition: /var/snap/maas/current/bind/named.conf.maas:176
25-May-2021 14:07:03.144 loading configuration: failure
25-May-2021 14:07:03.144 exiting (due to fatal error)

Related branches

description: updated
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 3.0.0-rc2
tags: added: cdo-qa cdo-release-blocker foundations-engine
Revision history for this message
Björn Tillenius (bjornt) wrote :

I would suspect the issue is in ZoneGenerator._gen_reverse_zones(). If we grab the subnets and domains from the db dump, hopefully we can trigger the error (duplicate zones) in a unit test.

See also dns_update_all_zones(), which is the one that ultimately calls ZoneGenerator.as_list().

Revision history for this message
Björn Tillenius (bjornt) wrote :

I didn't find a dump of the database in the linked failed run. Could you please provide a link to a failed run that has a dump of the database? It's going to be hard to reproduce this locally, without it.

Changed in maas:
status: Triaged → Incomplete
Changed in maas:
assignee: nobody → Björn Tillenius (bjornt)
Revision history for this message
Björn Tillenius (bjornt) wrote :

I managed to reproduce the bug with this test:

  https://pastebin.canonical.com/p/M4vG869BYm/

It currently fails, since the generated output has two 37.168.192.in-addr.arpa zones.

It could be that one of the existing tests triggers this bug as well, but they all use set() to compare the zones, which means that it won't catch duplicate zones.

I'm still not sure why there is a 37.168.192.in-addr.arpa zone at all, though, considering there's no such subnet.

Changed in maas:
status: Incomplete → In Progress
Revision history for this message
Björn Tillenius (bjornt) wrote :

I've spotted one bug in ZoneGenerator._gen_reverse_zones().

At the end, there's this code:

        # Now provide any remaining rfc2317 glue networks.
        for network, ranges in rfc2317_glue.items():
            import pdb; pdb.set_trace()
            yield DNSReverseZoneConfig(
                ns_host_name,
                serial=serial,
                default_ttl=default_ttl,
                network=network,
                ns_host_name=ns_host_name,
                rfc2317_ranges=ranges,
                exclude=set(
                    [IPNetwork(s.cidr) for s in subnets if s is not subnet]
                ),
            )

'subnet' there was used in the previous for loop, so it's always the same thing. I think we should check whether s is a subset network instead, or something like that.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

Found the issue. When DNSReverseZoneConfig is generating the reverse zones, the IP address used can be incremented past the upper bounds of the subnet if it is incrementing over an overlapping subnet that then results in an IP address beyond the last IP in the actual subnet, and then the while loop exits. I'll put up a MP with a bounds check to return early in the event of this, along with the fix for the bug Bjorn found.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

To clarify, that then out of bounds IP address does not have a corresponding subnet to check if it will generate a duplicate zone.

Changed in maas:
assignee: Björn Tillenius (bjornt) → Christian Grabowski (cgrabowski)
Revision history for this message
Björn Tillenius (bjornt) wrote :

BTW, Michael, I assume you recently change your network topology? The affected code is actually from 2016, so I don't think this is a regression. It's triggered by having subnets that are smaller than /24, together with some race condition.

We'll fix this for rc2, but we'll probably have to backport this to 2.9, and possible 2.8, as well.

Changed in maas:
milestone: 3.0.0-rc2 → none
Revision history for this message
Michael Skalka (mskalka) wrote :

@bjorn

We have not changed the topology of our lab network for ~2 years actually, though we do have a /25 and a /26 in use here. That said it's interesting that we have not yet hit it.

Changed in maas:
milestone: none → next
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
milestone: next → none
Revision history for this message
David A. Desrosiers (setuid) wrote :
Download full text (3.9 KiB)

I can confirm this is still affecting both 2.9.2, 3.0 and 3.1 under Focal as recent as today, from PPA, non-snap version (I will never run MAAS from a snap).

I have 60 machines that I create and import into MAAS as a single virsh chassis, then I commission those nodes all at once. Every release > 2.8.7 on Bionic, fails to complete the commissioning of these nodes, and I have to do them in small batches (batches specifically 12 or less. If I select > 12, every one > 12, will fail).

I see several thousand of these messages within about 60 seconds of all machines being powered on and attempting a commissioning, as rndc is attempted to be (unsuccessfully) restarted roughly 500 times/minute.

==> /var/log/maas/regiond.log <==
2022-03-07 20:48:45 maasserver.region_controller: [critical] Failed to kill and restart DNS.
        Traceback (most recent call last):
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
            self._startRunCallbacks(fail)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
            _inlineCallbacks(r, g, status)
        --- <exception caught here> ---
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/service_monitor.py", line 432, in killService
            state = yield self.ensureService(name)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/service_monitor.py", line 722, in _ensureService
            yield self._performServiceAction(service, action)
          File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
            result = g.send(result)
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/service_monitor.py", line 561, in _performServiceAction
            raise ServiceActionError(error_msg)
        provisioningserver.utils.service_monitor.ServiceActionError: Service 'bind9' failed to start: Job for named.service failed because a fatal signal was delivered to the control process.
        See "systemctl status named.service" and "journalctl -xe" for details.

Once this happens, every failed commissioned machine will start throwing "Ca...

Read more...

Revision history for this message
Björn Tillenius (bjornt) wrote :

Hi David,

could you please file a new bug and attach all the logs?

At a first glance, this doesn't look like being the same bug, so it's better to start afresh.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.