Adding overlapping subnets in fabric breaks deployments

Bug #1964644 reported by Alan Baghumian
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Status tracked in 3.6
3.5
Won't Fix
Medium
Unassigned
3.6
Triaged
Medium
Unassigned
MAAS documentation
Fix Released
High
Bill Wear

Bug Description

Here is how to reproduce the issue:

1) MAAS 3.1 Snap with an existing 10.1.8.0/22 subnet configured + DHCP Provided by MAAS. Currently installed version on all region and rack controllers is 3.1.0-10901-g.f1f8f1505 (channel latest/stable Jan 19, 2022).

2) Performed a test commissioning and deployment prior to the experiment. Everything worked.

3) Added a new subnet 10.1.10.0/24 via WebUI to fabric-0 which already includes an existing overlapping subnet 10.1.8.0/22. MAAS did not stop me from adding the overlapping network.

4) Tested deploying an already commissioned machine:

     - Edited the network interface and put it under the 10.1.10.0/24 as well as 10.1.8.0/22 subnets.
     - Tried DHCP and static IP addresses.
     - Tried Focal (20.04) and Groovy (20.10)
     - In all scenarios the machine performed PXE boot then went into a boot loop causing the deployments to fail.

5) Removed the overlapping subnet and re-tested deployments, they still failed.

6) Rebooted all region (2) and rack (2) controllers.

7) Tested deployments again and they started working again.

Suggested Solution: Do not allow user to add overlapping subnets. This should be possible by implementing some sort of validation upon creating subnets.

Revision history for this message
Bill Wear (billwear) wrote :

Triaging because I have already seen this weirdness. Subnets aren't intended to overlap. The IP range of one subnet should be unique compared to every other subnet on the same segment. This is mainly because routers can't reliably determine which subnet should get a packet destined for one of the overlapping addresses. That might be what's gumming up the rack controller in this instance, dunno.

That said, I'm not sure if MAAS should prevent you from doing it, that is, I'm not sure if it's a doc/troubleshooting bug or a code bug. Either way, we should talk about it.

Changed in maas:
status: New → Triaged
Revision history for this message
Bill Wear (billwear) wrote :

After discussion with field, this bug needs clarification as to how the /22 and /24 subnets were overlapped. Need more info to understand how to classify this problem..

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Alan Baghumian (alanbach) wrote :

As discussed on the channel today, I deployed a brand new machine with MAAS 3.1 Snap (3.1.0-10901-g.f1f8f1505), PostgreSQL 12 outside of snap and initialized it as region+rack controller. The configured subnet was 10.1.10.0/24 (IP: 10.1.10.254/24).

The test involved two scenarios:

Scenario 1)

- Used a blank installed 20.04 LTS + PostgreSQL 12 + MAAS 3.1 Snap (3.1.0-10901-g.f1f8f1505)
- Went through MAAS initial setup screen.
- Changed network discovery interval to 10 minutes.
- Edited machines' netplan configuration, switching subnet from /24 to /22
- Rebooted the machine, a new 10.1.8.0/22 subnet was added under subnets/fabric-0 next to the existing 10.1.10.0/24 (See logs package for screenshots)

Scenario 2)

- Used a blank installed 20.04 LTS + PostgreSQL 12 + MAAS 3.1 Snap (3.1.0-10901-g.f1f8f1505)
- Went through MAAS initial setup screen.
- Changed network discovery interval to 10 minutes.
- From MAAS WebUI, Subnets tab, Changed 10.1.10.0/24 subnet to 10.1.8.0/22
- No new subnets were added besides 10.1.8.0/22

The logs package includes:

- Logs for scenario 1 and 2 captured from /var/snap/maas/common/log/
- Screenshots from scenario 1

This process was repeated twice with the exact same results.

Hope this helps to shed a bit of light on the issue.

Revision history for this message
Bill Wear (billwear) wrote :

Triaging this now. Not clear to me, personally, whether MAAS should do extensive sanity checks on network inputs, as this might restrict user freedom to handle edge cases, but (1) clearly there is a path here to introduce a non-working configuration, whether that's by design or by mistake, (2) the MAAS PM Anton Smith has stated a (weight-bearing) preference that the MAAS networking model not tolerate cross-wiring like this without at least a warning, and (3) MAAS developer Alexsander Sousa has indicated, wisely, that MAAS should recover from these kind of connection errors without resorting to controller restarts.

Setting importance to Medium, and adding a doc track at importance level High, so this can be added to troubleshooting information, at least.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → Medium
Changed in maas-offline-docs:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Bill Wear (billwear)
Bill Wear (billwear)
Changed in maas-offline-docs:
status: Triaged → Fix Committed
status: Fix Committed → Fix Released
summary: - [3.1] Adding overlapping subnets in fabric breaks deployments
+ Adding overlapping subnets in fabric breaks deployments
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Let's try to reproduce this in MAAS newer than 3.1, there were multiple changes in networking code, and MAAS is intended to gracefully deal with propagating network changes without restarts.

Changed in maas:
milestone: none → 3.5.0
Changed in maas:
milestone: 3.5.0 → 3.5.x
Revision history for this message
Alan Baghumian (alanbach) wrote :

By the way this is still happening with newer versions of MAAS such as 3.3.x

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.