observed IP is not released if dynamic range is modified before lease expire

Bug #1896292 reported by Rodrigo Barbieri on 2020-09-18
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Unassigned

Bug Description

Normal behavior:

When commissioning a machine, IPs from the reserved dynamic range are used. During commissioning, the IPs are listed in the subnet page as "Observed". After commissioning is complete, the IPs remain in the list for 10 minutes (default lease time) until the lease expires, and then it is released back to the reserved dynamic range pool.

Bug:

If, at any time between commissioning has started and the lease expires, the reserved dynamic range is modified to not include the dynamic IP used by the machine, then the IP is no longer released when the lease expires. The IP then stays indefinitely listed as "Observed" in the subnet list and cannot be used anywhere else (example: since it is no longer in the dynamic range, it should be usable as a static IP), because MaaS wouldn't allow it saying the IP is already in-use.

It has been validated that the problem persists even with machine turned off with "Ready" status and waiting 10 minutes.

How it should behave:

The IP should get released when its lease expires since it was allocated from the dynamic range. Therefore it would not be shown in the subnet list anymore and could be used as a static IP afterwards.

Ideally, the modification to the reserved dynamic range should not cause the issue.

Alternatively, modification of the dynamic reserved range could be not allowed while there are IPs waiting on to have their leases expired.

Workaround:

The only way to address the problem is either re-commissioning the machine, or deleting it. Then the IP gets removed from the list and freed so it can be re-used.

The workaround is acceptable if the machine's role is not important, but it is very unpleasant if it is part of a production cloud.

Tested on MaaS versions:

- 2.7.3-8291-g.384e521e6-0ubuntu1~18.04.2
- 2.9.0~beta2-8933-g.5b0d5b3d7

Steps to reproduce:

1) Commission the machine
2) Wait for the machine to obtain dynamic IP or be successfully commissioned
3) Within 10 minutes after commissioning has started/completed, change the dynamic range to not include the selected IP
4) Wait 10 minutes (since machine commissioning is complete) for lease to expire
5) IP remains instead of being released

Tags: sts Edit Tag help
description: updated
tags: added: sts
description: updated

Investigating further into this, part of the reason the bug happens is because MaaS keeps track of the leases in the file /var/lib/maas/dhcp/dhcpd.leases which is read by the dhcpd process.

In that file there are instructions for what to run when the lease expires, such as:

lease 192.168.105.222 {
  starts 5 2020/09/25 19:30:28;
  ends 5 2020/09/25 19:40:28;
  cltt 5 2020/09/25 19:30:28;
  binding state active;
  next binding state free;
  rewind binding state free;
  hardware ethernet 52:54:00:13:2c:8d;
  set clht = "ubuntu";
  set cllt = "600";
  set clip = "192.168.105.222";
  set clhw = "52:54:0:13:2c:8d";
  set vendor-class-identifier = "Linux ipconfig";
  client-hostname "ubuntu";
  on expiry {
    set clhw =
       binary-to-ascii (16, 8, ":",
                        substring (hardware, 1, 6)) ;
    set clip =
       binary-to-ascii (10, 8, ".", leased-address) ;
    execute ("/usr/sbin/maas-dhcp-helper", "notify", "--action", "expiry",
        "--mac", clhw, "--ip-family", "ipv4", "--ip", clip, "--socket", "/var/lib/maas/dhcpd.sock");
  }
  on release {
    set clhw =
       binary-to-ascii (16, 8, ":",
                        substring (hardware, 1, 6)) ;
    set clip =
       binary-to-ascii (10, 8, ".", leased-address) ;
    execute ("/usr/sbin/maas-dhcp-helper", "notify", "--action", "release",
        "--mac", clhw, "--ip-family", "ipv4", "--ip", clip, "--socket", "/var/lib/maas/dhcpd.sock");
  }
}

So, it invokes /usr/sbin/maas-dhcp-helper when it expires, triggering an RPC action.

However, when the reserved dynamic range is modified, the dhcpd process is reloaded and the file /var/lib/maas/dhcp/dhcpd.leases is cleared. Therefore the actions defined in the file will not be run.

Also, after modifying the reserved dynamic range, I tried running the expiry and release actions manually, but they still do not clear the observed IP. There is something else in MaaS preventing it from being cleared.

Problem was in [1]. For the normal scenario, the expiry message triggers [2] and then [3]. By removing the check in [1], manually running the expiry trigger results in the same code path, and resolves the issue.

However, there is no trigger for that message, as the /var/lib/maas/dhcp/dhcpd.leases was previously cleared. A new routine would need to be created (periodically, perhaps, to clean up those IPs), the code logic would be pretty much the same as the code path followed by the expiry message.

[1] https://github.com/maas/maas/blob/04a456af79ebdfd4858bd317f48e42959cc97670/src/maasserver/rpc/leases.py#L120

[2] https://github.com/maas/maas/blob/04a456af79ebdfd4858bd317f48e42959cc97670/src/maasserver/rpc/leases.py#L152

[3] https://github.com/maas/maas/blob/04a456af79ebdfd4858bd317f48e42959cc97670/src/maasserver/rpc/leases.py#L212

Re-tested now but tried modifying the reserved dynamic range (therefore causing a restart and clean up for leases file), but without removing the in-use IPs from the range. The result is that the leases file doesn't get cleaned up for the in-use IPs, only for removed ones.

Digging further, found out that the cause is actually the range defined in /var/lib/maas/dhcpd.conf which is set by MaaS as the reserved dynamic range gets set, and then maas-dhcpd is restarted. DHCPD then removes the entries that are not within the defined range, causing the issue.

Considering now that both DHCPD and MaaS are doing the right thing, it doesn't seem to be a way to avoid the issue, the only alternative is to clean up those IPs left behind in MaaS so they can be reused.

The "Last seen" timestamp needs to be validated before cleaning up those IPs (perhaps doing a ping test or attempting active discovery), as if those timestamps are derived directly from the leases, then they cannot be trusted.

Igor Gnip (igorgnip) wrote :

Could maas analyse modification to dynamic ranges as they happen and do the cleanup then ? Even if the IP is stll in use, ADMIN should be able to remove the range and cause the observed IPs cleanup.

Another idea, leases are not just cleared. Dhcp config is altered. Can hook be configured on the dhcpd to also invoke dhcp helper for all ips which are by new configuration out of range ?

Just throwing some ideas. The suggested ones are also fine.
Most of all, I would also inquire about static/dynamic IPS hiding behind a bond or bridge. Can you please spend some time checking if those get cleared cause if I remember, we found one instance hanging on a secondary physical interface which is part of a bond.

Regards,
Igor

Björn Tillenius (bjornt) wrote :

I don't think that we should remove the old IP addresses when the dynamic range is changed, since the dhcpd service did hand those out with a specified lease time. Until that lease expires, the IP address may be used the the client.

But we do store the lease time in the datatabase. I think we need a service that runs and cleans up any IP where the lease has expired.

In addition to that, we should probably have the new service clean up OBSERVED/DISCOVERED addressses without a lease, where the where last seen X minutes ago. Where X can be configured by the user.

Currently we don't clear out OBSERVED/DISCOVER IP addresses, but if we run out of free IPs, we start using the OBSERVED/DISCOVERED IPs that haven't been seen in a while, i.e. use oldest first.

Changed in maas:
status: New → Triaged
importance: Undecided → Medium
importance: Medium → High
Björn Tillenius (bjornt) wrote :

Another, maybe complement fix, to solve your problem, would be to still allow OBSERVED addresses to be used as static addresses, but with a warning. Or maybe even add way to manually remove the OBSERVED address.

Discussed today with MaaS team. I personally lean more towards having an automated service periodic check removing OBSERVED IPs with/without leases according to their last seen date (+ping check if it makes sense).

Waiting for more evaluation from the MaaS team on the best approach and best place to add the service/check/hook.

Bill Wear (billwear) wrote :

based on conversations today, i don't think there is any documentation action that should take place at the moment. i will monitor this bug and be ready to document the eventual solutions, whatever they may be.

Discussed today with the MaaS team. The ideal fix is agreed to be the service that monitors the observed IPs and cleans them up based on specific criteria. Right now, the quickest solution (albeit not the ideal) is assumed to be to use the Release API, but that is not working due to bug https://bugs.launchpad.net/maas/+bug/1898122

Fixing that bug is expected to be simpler and quicker than implementing the service, for the purpose of having an interim solution for this issue available as soon as possible.

Igor Gnip (igorgnip) wrote :

I still believe that allowing static ip assignment to override the observed IP is required.
Otherwise, anyone using the static ip - if stepped onto the observed ip in maas needs to handle the error, perform the release ip call (which is broken) and then retry the operation
as opposed to simply relying on maas to check the observed ip, realize it fulfills the criteria defined for 'service cleanup' and just clean it immediately.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers