MAAS

MAAS aggressively de-configures network interface

Bug #1915176 reported by Rod Smith on 2021-02-09

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	High	Unassigned

Bug Description

We've recently upgraded the Server Certification lab from MAAS 2.6 to MAAS 2.9 (version 2.9.1 (9153-g.66318f531), to be precise), and we've encountered a new problem with MAAS 2.9. Our lab configuration has two IP blocks (10.245.128.0/22, the "external network"; and 10.1.10.0/23, the "internal network") on the same VLAN. The MAAS server has two physical NICs, one for each range of IP addresses. MAAS manages DHCP for the internal network (10.1.10.0/23), but not for the external one (10.245.128.0/22). This configuration enables us to easily move individual nodes' IP addresses from one IP range to another, and it worked fine with MAAS 2.6; however, in MAAS 2.9, a problem can arise:

If a specific network device for a node is configured to be on the external network, but if the node PXE-boots from that network device, then MAAS 2.9 will aggressively unconfigure the device; its settings disappear from the MAAS "Network" tab, and the deployed node has no network options set; the /etc/netplan/50-cloud-init.yaml entry for the device looks something like this:

        eno2:
            match:
                macaddress: 4c:52:62:9e:b9:99
            mtu: 1500
            set-name: eno2

Although it's possible to control the PXE-boot device on most servers and thus avoid this problem by ensuring the server PXE-boots from a device configured for the internal network, this isn't always 100% reliable. The server might fail over to another device if the first attempt times out or otherwise fails. Some servers have configuration options that are difficult, and perhaps impossible, to set correctly.

Reverting to the behavior of MAAS 2.6, which would PXE-boot on the internal network but do a final network configuration on the external network, is desirable.

Tags:

Jeff Lane  (bladernr) on 2021-02-09

tags:

added: hwcert-server

Revision history for this message

Rod Smith (rodsmith) wrote on 2021-02-09:

#1

Tarball containing diagnostic data Edit (164.7 KiB, application/x-tar)

I'm attaching a tarball containing several files showing the state of the MAAS server's network and the before, during, and after states of a node (polari) configured as described. (There's no "during" screen shot, since the network device went to an unconfigured about a minute after I began the deployment, and nothing else interesting happened. The "during" text file was created after the network interface went to an unconfigured state.)

Revision history for this message

Björn Tillenius (bjornt) wrote on 2021-02-10:

#2

Ok. I see what's going on there. You're not only using different subnets, but different fabrics and vlans as well. That goes against the model a bit, since you actually have the same vlan.

I agree that it's a bit odd that this gets changed when deploying, though. I'm going to have to look at the code to see why we do that.

Changed in maas:
assignee:	nobody → Björn Tillenius (bjornt)

Revision history for this message

Björn Tillenius (bjornt) wrote on 2021-02-10:

#3

Ok. I see that there's code in src/maasserver/rpc/boot.py to update the VLAN of the boot interface in get_config().

That specific piece of code was added in commit 7d699a58e49f7c312953aac471bde3550d51e07c, which is from 2016, so it should be present in 2.6 as well.

I still have to investigate why we have that code in there. But at the very least, we should only update the info if the machine is in a commissioning state. But even then, the commissioning scripts should be able to detect the correct fabric and vlan anyway.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2021-02-10:

#4

The workaround would be to move both subnets to be on the same VLAN. Would that break your workflow?

Björn Tillenius (bjornt) on 2021-02-10

Changed in maas:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Rod Smith (rodsmith) wrote on 2021-02-10:

#5

Thanks for looking into this so quickly!

In the MAAS web UI, both subnets are showing as being on the same "untagged" VLAN. I'm willing to try making changes to this, but I'm not sure what to do. Or are you suggesting moving both subnets to the same fabric? I think I see how to do that in the web UI, and I'm willing to give it a try, but I'll wait for confirmation before trying that.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2021-02-15:

#6

Yes, I meant having both subnets on the same fabric.

Even though the two VLANS are both named "untagged", they are actually different VLANs, since they are on their own fabric. Each fabric has its own set of VLANs, and they can't be shared across fabrics.

Revision history for this message

Adam Collard (adam-collard) wrote on 2023-08-24:

#7

Is this still an issue in a recent version of MAAS (3.3 onwards)?

Changed in maas:
status:	Triaged → Incomplete
assignee:	Björn Tillenius (bjornt) → nobody

Revision history for this message

Rod Smith (rodsmith) wrote on 2023-08-24:

#8

I haven't seen it in quite a while, but I don't know if that's because it's been fixed or because of changes in the way we've configured our servers.

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2023-09-07:

#9

Environment configuration may have contributed to the observed behaviour. Closing the issue for now as it's not reproducible in recent MAAS versions.

Changed in maas:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Tarball containing diagnostic data Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.