MAAS aggressively de-configures network interface

Bug #1915176 reported by Rod Smith
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
High
Unassigned

Bug Description

We've recently upgraded the Server Certification lab from MAAS 2.6 to MAAS 2.9 (version 2.9.1 (9153-g.66318f531), to be precise), and we've encountered a new problem with MAAS 2.9. Our lab configuration has two IP blocks (10.245.128.0/22, the "external network"; and 10.1.10.0/23, the "internal network") on the same VLAN. The MAAS server has two physical NICs, one for each range of IP addresses. MAAS manages DHCP for the internal network (10.1.10.0/23), but not for the external one (10.245.128.0/22). This configuration enables us to easily move individual nodes' IP addresses from one IP range to another, and it worked fine with MAAS 2.6; however, in MAAS 2.9, a problem can arise:

If a specific network device for a node is configured to be on the external network, but if the node PXE-boots from that network device, then MAAS 2.9 will aggressively unconfigure the device; its settings disappear from the MAAS "Network" tab, and the deployed node has no network options set; the /etc/netplan/50-cloud-init.yaml entry for the device looks something like this:

        eno2:
            match:
                macaddress: 4c:52:62:9e:b9:99
            mtu: 1500
            set-name: eno2

Although it's possible to control the PXE-boot device on most servers and thus avoid this problem by ensuring the server PXE-boots from a device configured for the internal network, this isn't always 100% reliable. The server might fail over to another device if the first attempt times out or otherwise fails. Some servers have configuration options that are difficult, and perhaps impossible, to set correctly.

Reverting to the behavior of MAAS 2.6, which would PXE-boot on the internal network but do a final network configuration on the external network, is desirable.

Jeff Lane  (bladernr)
tags: added: hwcert-server
Revision history for this message
Rod Smith (rodsmith) wrote :

I'm attaching a tarball containing several files showing the state of the MAAS server's network and the before, during, and after states of a node (polari) configured as described. (There's no "during" screen shot, since the network device went to an unconfigured about a minute after I began the deployment, and nothing else interesting happened. The "during" text file was created after the network interface went to an unconfigured state.)

Revision history for this message
Björn Tillenius (bjornt) wrote :

Ok. I see what's going on there. You're not only using different subnets, but different fabrics and vlans as well. That goes against the model a bit, since you actually have the same vlan.

I agree that it's a bit odd that this gets changed when deploying, though. I'm going to have to look at the code to see why we do that.

Changed in maas:
assignee: nobody → Björn Tillenius (bjornt)
Revision history for this message
Björn Tillenius (bjornt) wrote :

Ok. I see that there's code in src/maasserver/rpc/boot.py to update the VLAN of the boot interface in get_config().

That specific piece of code was added in commit 7d699a58e49f7c312953aac471bde3550d51e07c, which is from 2016, so it should be present in 2.6 as well.

I still have to investigate why we have that code in there. But at the very least, we should only update the info if the machine is in a commissioning state. But even then, the commissioning scripts should be able to detect the correct fabric and vlan anyway.

Revision history for this message
Björn Tillenius (bjornt) wrote :

The workaround would be to move both subnets to be on the same VLAN. Would that break your workflow?

Changed in maas:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Rod Smith (rodsmith) wrote :

Thanks for looking into this so quickly!

In the MAAS web UI, both subnets are showing as being on the same "untagged" VLAN. I'm willing to try making changes to this, but I'm not sure what to do. Or are you suggesting moving both subnets to the same fabric? I think I see how to do that in the web UI, and I'm willing to give it a try, but I'll wait for confirmation before trying that.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Yes, I meant having both subnets on the same fabric.

Even though the two VLANS are both named "untagged", they are actually different VLANs, since they are on their own fabric. Each fabric has its own set of VLANs, and they can't be shared across fabrics.

Revision history for this message
Adam Collard (adam-collard) wrote :

Is this still an issue in a recent version of MAAS (3.3 onwards)?

Changed in maas:
status: Triaged → Incomplete
assignee: Björn Tillenius (bjornt) → nobody
Revision history for this message
Rod Smith (rodsmith) wrote :

I haven't seen it in quite a while, but I don't know if that's because it's been fixed or because of changes in the way we've configured our servers.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Environment configuration may have contributed to the observed behaviour. Closing the issue for now as it's not reproducible in recent MAAS versions.

Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.