[1.9] Static/Automatic IP addresses inside the dynamic range conflict with DHCP lease uploads

Bug #1635735 reported by David Lawson
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Won't Fix
Undecided
Unassigned
1.9
Won't Fix
Wishlist
Unassigned

Bug Description

We've been seeing intermittent issues at a customer site where after a period of time (presumably the lease time for the auto-assigned IP address) if a node is rebooted it will come back up with a dynamic IP that doesn't match the one MaaS has auto-assigned. After digging for a while, it appears that MaaS is discovering the DHCP IP issued while commissioning the node and isn't properly deleting that from the database or clearing out the leases for it.

https://pastebin.canonical.com/168664/ <- relevant snippet of node configuration output from MaaS.

Is there something we can do to remove those IPs from the node configs? Is there specific information we can provide to help diagnose why this is happening?

Revision history for this message
Mike Pontillo (mpontillo) wrote :

First, I assume that the customer is using a separate static and dynamic range on their cluster interfaces?

Second, I suspect the discovered IP address in this case is just a symptom of the overall problem; it looks like MAAS /may/ be out of sync with the DHCP server in this case.

Since the node has an automatic IP address, when MAAS goes to deploy the node, MAAS (via curtin) will set up /etc/network/interfaces to use the static IP address. (assuming it is not a custom image without this capability) So when this node is deployed, it should properly use the static IP address.

However, when the machine performs a network boot, commissions, or re-deploys, at those times it will not be able to configure its IP address statically, so what we also do is inform the DHCP server that we have leased out the IP address to that node. If /that/ communication is somehow interrupted, MAAS might get into a bad state, because now the state that MAAS believes is true will be out of sync with what DHCP believes is true.

I would check the DHCP lease database to see if the lease for the node is in the expected state. MAAS will normally use omshell to write the static lease to the DHCP server (using omapi). The fact that you see a discovered IP address in there you don't expect might indicate that the DHCP server has forgotten about the static lease.

Is there something about the DHCP server that could cause this "forgetfulness"? Maybe it was restored from a backup or snapshot (if it's a VM), or was otherwise offline for a little while? You might try releasing and re-assigning the IP address to see if that resolves the issue.

MAAS 2.x fixes this issue by design, in that we no longer rely on parsing the lease file; we write out a static configuration for the DHCP server. (Though the lingering "discovered" DHCP leases may still be seen at times.)

tags: added: canonical-bootstack
Revision history for this message
David Lawson (deej) wrote :

The dynamic and auto-assigned ranges are the same at this site. Note that the discovered IP is in the same CIDR block as the assigned IP, the problem we're seeing is nodes coming up on that discovered IP rather than the one that MaaS has assigned and is using in DNS, etc.

The DHCP server is on the same machine as MaaS and as far as I know hasn't been restored from backup or unavailable for any period of time which MaaS was available, the example I pulled that node config from was a machine I'd redeployed early in the day. I believe the DHCP leases file was edited by hand a couple times in a desperate attempt to make the dynamic addresses go away, but the issue was manifesting prior to that, at least as far as I know. What other situations could have caused the MaaS server to get out of sync with DHCP? We've been deleting and re-adding nodes relatively regularly for various reasons (particularly MAC address updates) could that have been a cause? Is there a way we can force MaaS to write the DHCP config out fresh so we know we're at a known good state?

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Ah. Well, it's not recommended to use MAAS in this way; that is why "static ranges" are in MAAS 1.x (and why MAAS 2.x introduces "reserved ranges", so you can tell MAAS which IP addresses should be used dynamically, and which addresses should be reserved for non-MAAS-managed nodes). If it's possible, I would try to narrow the dynamic range to something smaller, and add a static range to cover the automatically-assigned IP addresses.

Which version of MAAS is this causing you trouble on? I can think of some ways to hack the code to prevent the discovered addresses from showing up (which would have the side effect of preventing you from seeing DHCP IP addresses assigned to commissioning or deploying nodes). But it would depend on what version is being used. From the pastebin you sent, I think I can assume it's ~1.9.x; do you know the exact version?

Changed in maas:
status: New → Incomplete
Revision history for this message
Mike Pontillo (mpontillo) wrote :

*slaps forehead*; I see the bug title mentions MAAS 1.9, I forgot to scroll that far up. ;-) Knowing the exact release would still be nice, but I can check into where to hack the code to make this stop for you.

Changed in maas:
status: Incomplete → Opinion
status: Opinion → Triaged
importance: Undecided → High
milestone: none → 2.1.1
Revision history for this message
Mike Pontillo (mpontillo) wrote :

OK. So here's how to hack the code to disable the DHCP lease parsing and uploading from each rack (cluster) controller. (That is what starts the process that eventually causes the discovered addresses to show up.)

This change would mean that MAAS would not be able to determine which IP address the DHCP server assigned to a node set to DHCP recommended, so the MAAS team cannot recommend or support this.

https://paste.ubuntu.com/23375728/

You would need to reboot the rack (cluster) after applying this patch to each machine it's running on.

To remove the discovered addresses from the region, you would need to do something like this:

sudo maas-region shell
from maasserver.models import StaticIPAddress
from maasserver.enum import IPADDRESS_TYPE
StaticIPAddress.objects.filter(alloc_type=IPADDRESS_TYPE.DISCOVERED).delete()

I cannot understate how completely unsupported these hacks are. I have not even tested them myself on a MAAS 1.9 installation. Proceed with extreme caution.

Also note that in MAAS 2.0 this configuration is not supported. In 1.9, the dynamic range must not come from the same allocation pool as the static range. (I assume the way this was set up in 1.x is by not specifying a static range at all - just leaving it blank?)

Revision history for this message
David Lawson (deej) wrote :

It's 1.9.4+bzr4592-0ubuntu1~trusty1, I'll have to talk with the BootStack folks and see if there's a way we can split the range to limit dynamic assignments to the high end of the assigned range and leave the automatic assignments on the low end. I know they're planning to do a re-deploy for this customer so it's possible we could do that without too much disruption.

Would those smaller network blocks be configured as subnets in MaaS then?

Revision history for this message
Mike Pontillo (mpontillo) wrote :

MAAS does not require a static or dynamic range to fill an entire subnet CIDR. Each cluster (rack) interface is configured with a non-overlapping range of IP addresses to use for the static and dynamic ranges. That way, the DHCP server cannot be confused by static leases happening within its dynamic range, which leads to "interesting" race conditions like what you see here. In MAAS 1.9 you don't have the flexibility to configure more than one subnet with a range, but you can do that in MAAS 2.x since we enable DHCP on a per-VLAN basis, and look at what subnets are inside the VLAN to generate the configuration.

Changed in maas:
status: Triaged → Invalid
no longer affects: maas/2.0
Changed in maas:
importance: High → Undecided
Revision history for this message
Mike Pontillo (mpontillo) wrote :

Marked this Invalid for MAAS 2.x since MAAS 2.x does not allow this configuration.

summary: - MaaS 1.9 not deleting discovered addresses from commissioning
+ [1.9] Static/Automatic IP addresses inside the dynamic range conflict
+ with DHCP lease uploads
Revision history for this message
Andres Rodriguez (andreserl) wrote :

@David,

Let me understand this correctly:

1. There is *no* static range configured.
2. The machine interface is set to "Auto Assign" and a IP from the /dynamic/ range is being selected?
3. e/n/i is configured to 'dhcp' instead of 'static' ?

Changed in maas:
status: Invalid → Won't Fix
Revision history for this message
David Lawson (deej) wrote :

@Andres, yeah, though it's a little more complex than that. These are OpenStack compute nodes that need bridges and since MaaS 1.9 doesn't support configuring bridges directly, the deployment scripts pull an interfaces file onto the machine post-deployment to set them up.

There is no static range configured, the interface is DHCPing post-deployment (as it should, it's configured to do so via the /e/n/i pulled down at the end of the deploy) but it's being leased the same IP it had when it commissioned rather than the one assigned to it at deployment time and placed in DNS.

So, in the case of the pastebin I posted in the original comment, when that machine commissioned it got the IP 10.116.11.165 from the DHCP server for the commissioning process, when it deployed it was auto-assigned 10.116.8.62, that's the IP it has in DNS, what shows up as assigned to eth2 in the MaaS UI, etc. When that machine was next rebooted, a day or so after it was deployed, it came up on 10.116.11.165 while DNS and the MaaS UI still had it listed as 10.116.8.62.

Does that clarify the situation somewhat?

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Just a few theories:

DHCP servers tend to assign the same address to a node that they assigned previously. The DHCP server may think the node has still leased the address it used for commissioning, assuming the lease has not expired, and hand that back out. Also, DHCP clients tend to request the same IP address that they had previously, before trying to lease a new IP address (and I'm not sure what the behavior of the ISC client here) - so even if the lease has expired, if the client tries to renew an address it was previously assigned, the DHCP server might allow it.

I think the above subtleties, plus the fact that MAAS 1.x uses a runtime mechanism to inform the DHCP server of new leases, leads to a "perfect storm" situation where the IP address assigned isn't the one the DHCP client decides to use.

Not sure it's possible, but if you can move to MAAS 2.x, I would recommend that. Many of these issues were addressed (by design, not by a bugfix that can be backported) in 2.x.

Revision history for this message
David Lawson (deej) wrote :

Yeah, the weird part about this is that these nodes tend to end up on their commissioning IP again well after the lease should have expired. The situations we've seen it have been ones where the lease of the auto-assigned IP has expired, which should by all rights be happening well after the lease on the commissioning IP has expired as well.

We certainly have aspirations to move to MaaS2 but I'm not sure that's going to be feasible with this site for this customer.

Changed in maas:
milestone: 2.1.1 → none
Revision history for this message
Andres Rodriguez (andreserl) wrote :

We believe that this is not longer an issue in the latest releases of MAAS. If you believe this is still an issue, please re-open this bug report and target it accordingly.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.