IP exhaustion during deployment of machine

Bug #2050200 reported by Jacopo Rota
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
High
Unassigned

Bug Description

There is an environment running 3.3.4 and there a /21 subnet with more than 400 IPs available.

During some deployments the machines are not getting the `addresses` part of the netplan configuration populated.

summary: - IP exaustion during deployment of machine
+ IP exhaustion during deployment of machine
Jacopo Rota (r00ta)
Changed in maas:
status: New → Triaged
Revision history for this message
Alan Baghumian (alanbach) wrote :

Hello,

This is the Netplan file from the installation output:

writing to file /tmp/tmpy_etcicq/state/network_config with network config: network:
  ethernets:
    enp66s0:
      gateway4: 10.194.175.254
      match:
        macaddress: e8:80:88:08:17:76
      mtu: 1500
      nameservers:
        addresses:
        - 10.194.175.49
        - 10.194.175.92
        - 10.229.9.99
        - 10.231.1.11
        search:
        - spse-maas
      set-name: enp66s0
    enx6ae46da5ca06:
      match:
        macaddress: 6a:e4:6d:a5:ca:06
      mtu: 1500
      set-name: enx6ae46da5ca06
  version: 2

I'm also attaching the full installation log to the bug report.

Please let me know if anything else is needed.

Best,
Alan

Revision history for this message
Jacopo Rota (r00ta) wrote :

just to summarise my investigations so far:

1) when a deployment is started the machine is powered on and the maas tries to allocate an IP for the interfaces with AUTO iP.
2) there might be some cases in which maas fails to allocate the IP and this is how we end up with no address in the Netplan config.

in order to reproduce this, you can force this function https://github.com/maas/maas/blob/deab73792a4fe839a2e84a926a6c728d510fc9ad/src/maasserver/models/node.py#L4626 to return like in the cases where no ip is allocated or the max retry limit is reached.

An hypothetic situation that can lead to this bug is if maas allocates 3 times an ip and they are already in use in the subnet by another device.

in order to triage this bug I think we need to understand if the customer is in a situazione like the one I just explained or similar.

we can statically try to find the bug looking at the code, but it might be time consuming and we might not find the issue of the customer.
IMO the best thing here is to add extra logging and extract an sos report when the issue is hit

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

I suspect we have a bug here https://github.com/maas/maas/blob/deab73792a4fe839a2e84a926a6c728d510fc9ad/src/maasserver/models/subnet.py#L814 . In short, at every retry we add the IP to an "excluded" array so to ensure that we don't pick again the same address. However, we are not excluding such addresses in case we pick an address from the neighbours, and from the logs from the customer I see we pick such addresses a lot of times.

Next step is to confirm my assumption by reproducing this using a properly crafted environment and hardcoding some preconditions

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

Looks like we have a bug here https://github.com/maas/maas/blob/dbd701455fa1045d7fbab45e4fc1daa139e4c6cb/src/maasserver/models/subnet.py#L792

```
>>> s.get_ipranges_not_in_use(with_neighbours=True)
MAASIPSet([])

>>> s.get_ipranges_not_in_use(with_neighbours=False)
MAASIPSet([MAASIPRange('10.245.208.255', '10.245.209.13' purpose={'unused'}), MAASIPRange('10.245.209.15', '10.245.210.7' purpose={'unused'}), MAASIPRange('10.245.210.9', '10.245.211.255' purpose={'unused'}), MAASIPRange('10.245.212.255', '10.245.217.106' purpose={'unused'}), MAASIPRange('10.245.217.108', '10.245.218.43' purpose={'unused'}), MAASIPRange('10.245.218.45', '10.245.221.125' purpose={'unused'}), MAASIPRange('10.245.221.127', '10.245.222.169' purpose={'unused'}), MAASIPRange('10.245.222.171', '10.245.222.181' purpose={'unused'}), MAASIPRange('10.245.222.183', '10.245.222.185' purpose={'unused'}), MAASIPRange('10.245.222.187', '10.245.222.188' purpose={'unused'}), MAASIPRange('10.245.222.191', '10.245.222.195' purpose={'unused'}), MAASIPRange('10.245.222.200', '10.245.222.200' purpose={'unused'}), MAASIPRange('10.245.222.203', '10.245.222.203' purpose={'unused'}), MAASIPRange('10.245.222.206', '10.245.222.206' purpose={'unused'}), MAASIPRange('10.245.222.208', '10.245.222.210' purpose={'unused'}), MAASIPRange('10.245.222.212', '10.245.222.212' purpose={'unused'}), MAASIPRange('10.245.222.214', '10.245.222.214' purpose={'unused'}), MAASIPRange('10.245.222.216', '10.245.222.223' purpose={'unused'}), MAASIPRange('10.245.222.225', '10.245.223.254' purpose={'unused'})])
>>>
```

extracted from an internal lab

Revision history for this message
Jacopo Rota (r00ta) wrote :

It turns out that get_ipranges_not_in_use(with_neighbours=True) is intended to consider neighbours as used addresses. Problem is that in some environments many IPs in the subnet end up in the neighbour table, which means that the get_ipranges_not_in_use(with_neighbours=True) return no available IPs.

For the internal logic, in this case MAAS will recursively call get_ipranges_not_in_use with _neighbours=False and it will pick up the least recent neighbour address. But this logic is very suboptimal and can lead to other issues.

The quick workaround is to cleanup the discoveries.

I'm summarizing here my opinion on the real fix.
Hosts not managed by MAAS can get an IP from the MAAS dhcp server. That's fine and we don't identify them as "neighbours".
If a user configures a static IP on a machine and MAAS spots it on the network, then it's a "neighbour".

Now, MAAS pretends to own the network except for the reserved ranges. This means that we should not care about "discoveries" on the ranges we own: if during the preflight check of the ip allocation we spot that the address is in use, than we should consider it temporarily in use (also, tell the user that there is an host using an ip in the MAAS ranges) and move forward with another one. Meaning, an host using an IP that is supposed to belong to MAAS should not be considered as "neighbour" and we should re-consider it after a timeout.
If the user wants to set some static ip addresses on hosts not managed by MAAS, a reserved range should be used

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.