Machine interface configuration changes when preseed is requested

Bug #1741279 reported by Blake Rouse
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Expired
Medium
Unassigned

Bug Description

When performing a deployment the interface configuration of the machine is automatically changed when curtin requests the installation preseed.

Before curtin preseed request:

enp3s0f0 - fabric0 - untagged - 12.12.8.0/24 - DHCP
enp3s0f1 - Disconnected

During/after curtin preseed request:

enp3s0f0 - fabric6 - untagged - 192.168.17.0/24 - Unconfigured
enp3s0f1 - Disconnected

This affects the deployment of the machine as the network interface configuration is not correct when it is placed on the disk. Once the machine reboots into the local disk cloud-init fails since the network configuration is incorrect.

network:
  config:
  - id: enp3s0f0
    mac_address: 0c:c4:7a:df:23:80
    mtu: 1500
    name: enp3s0f0
    subnets:
    - type: manual
    type: physical
  - id: enp3s0f1
    mac_address: 0c:c4:7a:df:23:81
    mtu: 1500
    name: enp3s0f1
    subnets:
    - type: manual
    type: physical
  - address:
    - 172.19.11.12
    search:
    - maas
    type: nameserver
  version: 1

curtin-install.log - https://paste.ubuntu.com/26319894/
curtin-install.yaml - https://paste.ubuntu.com/26319885/
cloud-init.log - https://paste.ubuntu.com/26319901/

In the ephemeral environment the machine PXE booted with the IP of 12.12.1.143 as you can see in the regiond log snippet below. Once it requested the data it change the network configuration as described above.

2018-01-03 09:55:32 regiond: [info] 12.12.1.143 GET /MAAS/metadata/curtin/2012-03-01/meta-data/instance-id HTTP/1.1 --> 200 OK (referrer: -; agent: Cloud-Init/17.1)
2018-01-03 09:55:32 regiond: [info] 12.12.1.143 GET /MAAS/metadata/curtin/2012-03-01/meta-data/instance-id HTTP/1.1 --> 200 OK (referrer: -; agent: python-requests/2.9.1)
2018-01-03 09:55:32 regiond: [info] 12.12.1.143 GET /MAAS/metadata/curtin/2012-03-01/meta-data/local-hostname HTTP/1.1 --> 200 OK (referrer: -; agent: python-requests/2.9.1)
2018-01-03 09:55:32 regiond: [info] 12.12.1.143 GET /MAAS/metadata/curtin/2012-03-01/meta-data/public-keys HTTP/1.1 --> 200 OK (referrer: -; agent: python-requests/2.9.1)
2018-01-03 09:55:33 regiond: [info] 12.12.1.143 GET /MAAS/metadata/curtin/2012-03-01/meta-data/vendor-data HTTP/1.1 --> 200 OK (referrer: -; agent: python-requests/2.9.1)
2018-01-03 09:55:33 regiond: [info] ::1 GET /MAAS/rpc/ HTTP/1.0 --> 200 OK (referrer: -; agent: provisioningserver.rpc.clusterservice.ClusterClientService)
2018-01-03 09:55:33 maasserver.preseed: [warn] WARNING: '/etc/maas/preseeds/curtin_userdata_centos' contains deprecated preseed variables. Please remove: main_archive_directory, ports_archive_directory, http_proxy
2018-01-03 09:55:33 regiond: [info] 12.12.1.143 GET /MAAS/metadata/curtin/2012-03-01/user-data HTTP/1.1 --> 200 OK (referrer: -; agent: python-requests/2.9.1)

Changed in maas:
milestone: 2.3.1 → 2.4.0alpha1
Revision history for this message
Mike Pontillo (mpontillo) wrote :

So far it's been difficult to reproduce this issue; I set up a similar topology (including an unmanaged network) and hit a different issue[1], though so far it looks unrelated.

https://paste.ubuntu.com/26348273/

More information about which subnets and IP ranges exist in MAAS, and how they are configured, would help to get to the bottom of this. I'll keep looking.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

I think we should really be looking into if its even possible during a get-curtin-config for it to actually change the interface configuration. Seems like it shouldn't be possible but something is changing it.

Maybe device discovery? Might have him turn that off to see if that solves the problem?

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I think it's worth a try; we need to do more to narrow down the cause. I've looked at the code path which returns the preseed, and didn't see anything that looked like it could cause something like this.

We do save() the Node object in the metadata server, and related objects such as scripts, in a few places during status reporting.

Looking at the device discovery code, (report_vid() in models/interface.py) neighbour discovery will only update the VLAN for an interface if it isn't already set. That does not seem to be the case in the above example, because your example shows it being changed from a *configured* VLAN to a completely different VLAN.

The only way I can think to explain this bizarre behavior is if there is a duplicate physical MAC on the network, such as if the interface is suddenly moved to a different node. But we would see evidence in the log for that.

So this is still somewhat of a mystery, but anything that can be done to narrow down the problem further would help.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

I found a section of code I'm suspicious about.

In maasserver/rpc/boot.py we update the boot interface of the VLAN depending on where we saw the PXE request come in on the rack controller.[1]

This information isn't in the bug, but I'm guessing this logic doesn't account for DHCP relay.

I tested this scenario on an unmanaged subnet (in my tests, the DHCP server simply sets the next server to MAAS, so that MAAS can handle PXE booting) without DHCP relay, and it works in that situation.

I'll need to take a closer look at the logic and compare it to the situation outlined in the bug.

[1]: https://paste.ubuntu.com/26350152/

Revision history for this message
Mike Pontillo (mpontillo) wrote :

The way to test the theory that the problem is the aforementioned code in boot.py would be to specify the relay VLAN (using the API, so that you don't have to define the dynamic range). This can be done using the CLI as follows:

    maas $PROFILE vlan update $SOURCE_FABRIC $SOURCE_VID relay_vlan=$RACK_DHCP_VLAN_DBID

In this example, $PROFILE is the CLI profile, $SOURCE_FABRIC would be the name of the fabric the machine is configured for DHCP on, $SOURCE_VID would be the VID in that fabric (0 if untagged), and $RACK_DHCP_VLAN_DBID is the database ID of the VLAN the rack sees the PXE request on. (The database ID can be found in the URL when you browse to the VLAN, or via the API.)

If this prevents the issue, then I think it's a viable workaround. I tested the workaround locally and it does not seem to cause any issues (even though I'm relaying a VLAN without any IP ranges defined, which I thought might cause issues).

Revision history for this message
Blake Rouse (blake-rouse) wrote :

* He is not using a DHCP relay at all.

* I thought about the path in boot/rpc.py but that would occur when the machine does its first TFTP request. I specifically watched for that case and it doesn't happen then. It happens once its in the ephemeral before curtin starts. What is really wierd is that its to a subnet that the interface would never get an IP for.

Don't forget this is consistent. Happens at the same place everytime, so if it was a duplicate MAC on the network I would think it to be random on when that would occur.

I think it could be an event in the metadata API that is causing it to change if its not the preseed generation.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

If an external DHCP server is setting options to allow machines to PXE boot from MAAS, this will look very similar to a DHCP relay - except the MAAS DHCP server will not actually see the DHCP
request, only the PXE request. Your example shows that the network configuration changes as follows:

    12.12.8.0/24 - DHCP --> 192.168.17.0/24 - Unconfigured

So my question is, how are those VLANs related to the rack controller, and what is the difference between any other ephemeral environment (commissioning, testing, etc) and deploying the machine? Does the network configuration change between commissioning/testing and deployment, or is everything static?

In auditing the code, I didn't find anything in the metadata server that could cause this (though I could have missed it). There are three main things (besides user/API interaction) that could cause the IP configuration to change:

(1) PXE boot (sets boot interface to the interface the PXE request was seen on the rack controller, *unless* that VLAN is a DHCP relay destination.)

(2) Commissioning (sets interfaces to networks seen by DHCP - should have already happened)

(3) Deployment (allocates AUTO IPs - should not be happening here)

In `signals/interfaces.py` we also clear out all the subnet links except the DISCOVERED links. So in this situation, the best theory I still have is that (1) occurs, which causes the post-save interface update to occur, which deletes all the IP addresses associated with the interface except the DISCOVERED address (the one seen during the last commissioning).

Besides enabling DHCP relay, this could be easily verified by running testing on the node after it has been configured as desired, to see if a PXE boot changes the interface configuration independently of commissioning or deployment.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

If it was the PXE booting problem why does it only happen during deployment? This doesn't happen during commissioning at all.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Commissioning could initially change the VLAN, but would later reset the interfaces to what was found via DHCP. That's why I think running 'Testing' would be a good test.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Marking Incomplete since we could not determine the cause of the issue after another hour-long triage session. (I still believe it has to do with the PXE request, but that has not been proven nor disproven.)

Workaround was to combine the subnets in MAAS into a single fabric on a single VLAN, so that the PXE request could not have any possible reason to assume the machine's interface moved. This workaround is required because of the not-fully-supported nature of the configuration (MAAS supports full DHCP relay, but in this situation only the PXE request was being forwarded to MAAS.)

Changed in maas:
status: Triaged → Incomplete
importance: Critical → Medium
Changed in maas:
milestone: 2.4.0alpha1 → 2.4.0alpha2
Changed in maas:
milestone: 2.4.0alpha2 → 2.4.0beta1
Changed in maas:
milestone: 2.4.0beta1 → 2.4.0beta2
Changed in maas:
milestone: 2.4.0beta2 → 2.4.0beta3
Changed in maas:
milestone: 2.4.0beta3 → 2.4.0beta4
Changed in maas:
milestone: 2.4.0beta4 → 2.4.x
Changed in maas:
milestone: 2.4.x → 2.5.x
milestone: 2.5.x → 2.5.0alpha2
Changed in maas:
milestone: 2.5.0alpha2 → 2.5.0beta1
Changed in maas:
milestone: 2.5.0beta1 → 2.5.0beta2
Changed in maas:
milestone: 2.5.0beta2 → 2.5.0rc1
Changed in maas:
milestone: 2.5.0rc1 → 2.5.x
no longer affects: maas/2.3
Revision history for this message
Adam Collard (adam-collard) wrote :

This bug has not seen any activity in the last 6 months, so it is being automatically closed.

If you are still experiencing this issue, please feel free to re-open.

MAAS Team

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.