[2.2] MAAS Updating the VLAN of the boot interface during PXE breaks deployment under DHCP relaying

Bug #1685306 reported by Evan Sikorski
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse

Bug Description

* I am using 2.2 rc1
* We are not using VLANs (all untagged)
* We are separating L2's by MAAS Fabrics
* We have DHCP relay turned on between fabrics.
* Rack controllers are NOT in the same L2 as the nodes we are building
* All traffic is going through a single network interface on the MAAS systems.
* All DHCP traffic is forwarded by the gateway for that subnet, to the MAAS Rack Controller.

What worked: Enlisting and Commissioning

    After enlisting a node and commissioning it, the network is successfully discovered for each active interface.

What didn't work: Deploying

    When the node is deployed it changes the network config from auto-assign in the correct subnet & fabric to the rack controller's fabric as unconfigured. (screen shots attached)

Related branches

Revision history for this message
Evan Sikorski (evan.sikorski) wrote :
Revision history for this message
Evan Sikorski (evan.sikorski) wrote :
summary: - [2.2] Subnet changes during deployment
+ [2.2] Subnet changes fabrics during deployment
Revision history for this message
Evan Sikorski (evan.sikorski) wrote : Re: [2.2] Subnet changes fabrics during deployment
Revision history for this message
Evan Sikorski (evan.sikorski) wrote :

Important clarification after talking with Corey:

When it ends, it shows Unmanaged in the MAAS web UI, but if you login to the node (via its OOB controller) you can see it does get a valid IP address

However, MAAS seems to have no idea that an address was asigned and only shows that IP address as 'observed' in the UI.

example output from a system that deployed, with the UI showing 'Unmanaged':

[cloud-user@03-23W7RV1-DEMO ~]$ ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 10.63.20.22 netmask 255.255.255.128 broadcast 10.63.20.127

(attached screenshot of matching Observed output)

Revision history for this message
Evan Sikorski (evan.sikorski) wrote :
summary: - [2.2] Subnet changes fabrics during deployment
+ [2.2] MAAS-UI loses networking config after deployment
Revision history for this message
Evan Sikorski (evan.sikorski) wrote : Re: [2.2] MAAS-UI loses networking config after deployment

I received a little more information that I was not given before

The IP address it is assigned during deployment, is in the DHCP range, not the range it should have its final (static dhcp via auto-assign) IP address.

Revision history for this message
Evan Sikorski (evan.sikorski) wrote :
Revision history for this message
Mike Pontillo (mpontillo) wrote :

To get to the root cause of this, we need to find out why the interface suddenly becomes unconfigured when the machine is deployed.

There isn't anything obvious in the deployment process itself that would do this, but I think I figured out what might be causing this: a post-save handler that attempts to clean up interfaces after they are saved to ensure consistency.

In maasserver/models/signals/interfaces.py we have some code as follows:

    def interface_vlan_update(...):
        ....
            # Interface VLAN was changed on a machine or device. Remove all its
            # links except the DISCOVERED ones.
            instance.ip_addresses.exclude(
                alloc_type=IPADDRESS_TYPE.DISCOVERED).delete()

This seems to match the behavior you're seeing. My guess is that when the interface is updated during deployment, MAAS realizes that the interface isn't on the VLAN it thought it was on. So we assign the IP address and update the interface, but then this code runs and deletes everything.

After you reconfigure all the fabrics to separate out each Layer 2 network, you may need to recommission each node in order to prevent this.

Also, once the VLANs "settle", I would think that a subsequent deployment would not have this problem. So I wonder if this is repeatable.

I'll run some more tests locally to see if I can replicate this.

Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.2.0rc3
assignee: nobody → Mike Pontillo (mpontillo)
Revision history for this message
Evan Sikorski (evan.sikorski) wrote :

Attached (sanitized) results of the following:

root@hostname:~# sudo maas-region dbshell
psql (9.5.6)
Type "help" for help.

maasdb=# \pset pager off
Pager usage is off.
maasdb=# select * from maas_support__node_networking;

Revision history for this message
Evan Sikorski (evan.sikorski) wrote :
Revision history for this message
Mike Pontillo (mpontillo) wrote :

Can we see if we can isolate this to an issue that is specific to HA?

If the nodes can reach the rack controller on XXXmaas01, let's test disabling XXXmaas02 and using XXXmaas01 for region and rack services. (Reason: I know you tried shutting down the rack controller on XXXmaas01, but in addition to the rack controller, the region controller will also periodically refresh its network configuration, which may cause the network model changes that I suspect may be erroneously clearing out your IP addresses.)

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I have a new theory about how this might happen. It looks like when a machine PXE boots, we might update its VLAN. In a DHCP relay scenario, this would cause the symptom you're seeing. Stay tuned... I will look more into this tomorrow.

Changed in maas:
importance: High → Critical
assignee: Mike Pontillo (mpontillo) → Blake Rouse (blake-rouse)
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Evan Sikorski (evan.sikorski) wrote :

Can confirm that after applying http://bazaar.launchpad.net/~blake-rouse/maas/fix-1685306/revision/6017 to our RC2 deployment we no longer observed the symptoms of this bug.

summary: - [2.2] MAAS-UI loses networking config after deployment
+ [2.2] MAAS Updating the VLAN of the boot interface during PXE breaks
+ deployment under DHCP relaying
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Jim Tilander (p-jim-8) wrote :

We are running MAAS 2.2.1 (6078-g2a6d96e-0ubuntu1~16.04.1) and it's unclear if this bug should be fixed in that release?

We are *definitely* running into this issue on above MAAS version.

What version should we go to for a fix?

Revision history for this message
Evan Sikorski (evan.sikorski) wrote :

As the original reporter of the issue I can assure you it is gone on our own environment.

I would advise opening a new bug and referencing this one as a possible source.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.