[MAAS 3.2.9] Adding subnet sends named into crash loop [rdns zones]

Bug #2041276 reported by Peter Jose De Sousa
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Christian Grabowski
3.2
New
Undecided
Unassigned
3.3
Fix Released
High
Christian Grabowski
3.4
Fix Released
High
Christian Grabowski

Bug Description

Hello,

When adding the subnet 10.33.56.0/24 to my maas installation named starts to crash and this in turn breaks commissioning & deployment

Named logs show the following:

tf=no' '--enable-ipv6' '--enable-rrl' '--enable-filter-aaaa' '--disable-native-pkcs11' '--disable-isc-spnego' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -fdebug-prefix-map=/build/bind9-TyRg0u/bind9-9.16.1=. -fstack-protector-strong
 -Wformat -Werror=format-security -fno-strict-aliasing -fno-delete-null-pointer-checks -DNO_VERSION_DATE -DDIG_SIGCHASE' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2'
26-Oct-2023 16:26:47.364 running as: named -c /var/snap/maas/31022/bind/named.conf -S 524288 -g
26-Oct-2023 16:26:47.364 compiled by GCC 9.4.0
26-Oct-2023 16:26:47.364 compiled with OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
26-Oct-2023 16:26:47.364 linked to OpenSSL version: OpenSSL 1.1.1f 31 Mar 2020
26-Oct-2023 16:26:47.364 compiled with libxml2 version: 2.9.10
26-Oct-2023 16:26:47.364 linked to libxml2 version: 20910
26-Oct-2023 16:26:47.364 compiled with json-c version: 0.13.1
26-Oct-2023 16:26:47.364 linked to json-c version: 0.13.1
26-Oct-2023 16:26:47.364 compiled with zlib version: 1.2.11
26-Oct-2023 16:26:47.364 linked to zlib version: 1.2.11
26-Oct-2023 16:26:47.364 ----------------------------------------------------
26-Oct-2023 16:26:47.364 BIND 9 is maintained by Internet Systems Consortium,
26-Oct-2023 16:26:47.364 Inc. (ISC), a non-profit 501(c)(3) public-benefit
26-Oct-2023 16:26:47.364 corporation. Support and training for BIND 9 are
26-Oct-2023 16:26:47.364 available at https://www.isc.org/support
26-Oct-2023 16:26:47.364 ----------------------------------------------------
26-Oct-2023 16:26:47.364 found 40 CPUs, using 40 worker threads
26-Oct-2023 16:26:47.364 using 40 UDP listeners per interface
26-Oct-2023 16:26:47.632 using up to 524288 sockets
26-Oct-2023 16:26:47.636 loading configuration from '/var/snap/maas/31022/bind/named.conf'
26-Oct-2023 16:26:47.640 /var/snap/maas/current/bind/named.conf.maas:348: zone '56.33.10.in-addr.arpa': already exists previous definition: /var/snap/maas/current/bind/named.conf.maas:328
26-Oct-2023 16:26:47.644 loading configuration: failure
26-Oct-2023 16:26:47.644 exiting (due to fatal error)
^C

[Steps to reproduce]

1. Add subnet 10.33.56.0/24 to MAAS

Observe named logs and MAAS status goes unhealthy

[Additional note]

- Deleting resolves the issue, but prevents configuration of nodes
- Readding the subnet causes the issue to arise again

[Workaround]

Re-add subnet disabling rdns. e.g. maas bego-root subnets create cidr='10.33.56.0/24' name='prod-net-6' rdns_mode=0.

SOS, Subnet info and named.log included here: https://drive.google.com/file/d/1h10ko79IoQfFKEki6pR0eL2tFd6PMevC/view

Thank you,
Peter

Related branches

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

subscribing field critical as its blocking a deployment of the prod environment.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Peter,

in /var/snap/maas/current/bind/named.conf.maas

section:

zone "56.33.10.in-addr.arpa" {
    type master;
    file "/var/snap/maas/31022/bind/zone.56.33.10.in-addr.arpa";
};

appears to be repeated twice (line 328 and 348). As a workaround, could you try to delete one such section and restart MAAS?

What were the steps to trigger this issue? maas.log is full of bind restarts and there are no meaningful interactions shown there.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Hello Jerzy

Thank you for the response, my remote access has expired for the day, so i cannot attempt the workaround right now - sorry about that

Regarding meaningful steps we noticed the issue after adding subnets including this one that MAAS was failing to commission any node - after spotting the logging we removed this subnet and another which allowed named to start again.

The installation has been running for roughly two weeks now, and the end user wants to understand why this breaks now, happy to probe for more information if needed

Will be OOO until Monday

Thank you,
Peter

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Hello @Jerzy - the workaround does not work - the file is immediately re-written out

Revision history for this message
Peter Jose De Sousa (pjds) wrote :
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

WORKAROUND:

Re-add subnet disabling rdns. e.g. maas bego-root subnets create cidr='10.33.56.0/24' name='prod-net-6' rdns_mode=0.

description: updated
summary: - [MAAS 3.2.9] Adding subnet sends named into crash loop
+ [MAAS 3.2.9] Adding subnet sends named into crash loop [rdns zones]
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Moving down to field high - as issue is worked around by disabling rdns. Might be able to move to field medium depending on discovered impact of workaround.

Revision history for this message
Peter Jose De Sousa (pjds) wrote (last edit ):

Notes on workaround:

In the ZoneGenerator there is this logic;

https://git.launchpad.net/maas/tree/src/maasserver/dns/zonegenerator.py#n376

It appears it creates reverse zones, moving the netmask up to /24 if needed. I suspect with my subnet configuration this is causing a duplicate zone to generated, but didn't have time to confirm definitively. Need to continue with deployment for now.

Changed in maas:
assignee: nobody → Christian Grabowski (cgrabowski)
importance: Undecided → High
status: New → Triaged
Revision history for this message
Christian Grabowski (cgrabowski) wrote (last edit ):

Some notes on the issue:

This stems from two possible issues: MAAS allowing overlapping subnets (i.e we can create 10.0.0.0/22 and 10.0.1.0/24), and the fact that glue zones (reverse zones for subnets with a prefix length < /24 or < /124 for IPv6) can also generate a zonefile for a /24 (or /124 for IPv6).

The former is also an issue for our DHCP, as it turns out, you can configure MAAS to have two overlapping subnets and attempt to allocate the same IP twice, we only guard against this by the uniqueness on StaticIPAddress.

It is also worth noting BIND does not crash in 3.3 or newer in this case, however the newer subnet's DNS config will overwrite the older subnet's.

We could solve this by validating against overlapping subnets, which also solves the DHCP issue, but this could potentially break existing deployments. We also can't allow overlapping subnets in separate VLANs due to the way we track IPs in the database, though from a networking perspective, that would work.

Alternatively, we could merge the DNS changes on full reloads (this is already a non-issue for dynamic updates), but the potential DHCP issue would persist.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Thank you for your inputs MAAS team.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Let me know if I can help further.

tags: added: bug-council
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Let's fix the DNS zone file generation in the way Christian suggests.

tags: removed: bug-council
Changed in maas:
milestone: none → 3.5.0
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 3.5.0 → 3.5.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.