Subnet changed to wrong fabric, impacting DHCP

Bug #2031482 reported by Gregory Orange
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
High
Jacopo Rota
3.2
Fix Committed
High
Jacopo Rota
3.3
In Progress
High
Björn Tillenius
3.4
Fix Committed
High
Jacopo Rota
3.5
Fix Committed
High
Jacopo Rota

Bug Description

Context:
2x rackd, 1x regiond
Ubuntu 22.04, packages, 1:3.3.3-13184-g.3e9972c19-0ubuntu1~22.04.1

https://discourse.maas.io/t/why-did-a-subnet-change-to-a-different-fabric/7327 describes the issue, and therein we were asked to submit this bug report.

After DHCP stopped working on a subnet served by our second rackd controller, we discovered that that subnet had somehow been moved into the wrong fabric. This meant that `dhcpd-interfaces` no longer had the relevant interface (a bridge named admin), and dhcpd.conf had no reference to that subnet.

When we moved the subnet back to the correct fabric, it all started working again as expected.

We did not look at dhcpd.conf on the first rackd controller, which houses that other fabric. The subnet in that other fabric continued to work fine during this time.

We do not know how the subnet was reconfigured. Only three people here know how to do that, and all know enough not to make the 6 or so clicks required to make the change in the web UI. We do not easily know how to do it via the CLI.

Soon I will attach logs from the times in question. From the virsh errors in the logs we believe that the problem began at or very close to 2023-08-09T11:17:31.421297+08:00. Note UTC+8 on second rackd controller and regiond controller, while first rackd controller is UTC.

Related branches

Revision history for this message
Gregory Orange (gregoryo2017) wrote (last edit ):

It was around 2023-08-14T09:21:36.043193+08:00 when we resolved the issue, when the first virsh state changed `Power state has changed from error to on.`

This (and the errors that preceded it) is because the KVM Pod nodes hosting these machines lost the IP on their admin interface as part of the DHCP outage, then regained it once it was fixed.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

At 85MB gzipped, I have uploaded the log files to Google Drive covering the time period. I gave each file a prefix as to whether it was rackd1, rackd2 or region controller.

https://drive.google.com/file/d/1BRVhrKQLsaxR8rhk1ne3lfQ5feBfEYr7/view?usp=sharing

Let me know if you would prefer that I upload the file(s) directly here.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

Is it somehow relevant that it was fabric 0 that the subnet was erroneously moved to? Some kind of error and default selection to the first one?

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

This exception is repeating in the regiond logs:

2023-08-13 00:00:47 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 293, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 370, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 860, in signal
    target_status = process(node, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 682, in _process_commissioning
    self._store_results(
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 565, in _store_results
    script_result.store_result(
  File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 270, in store_result
    assert self.status in SCRIPT_STATUS_RUNNING_OR_PENDING
AssertionError

This is likely due to an error when fetching hardware info for the region, which if the region has an IP within the subnet in question, it can be erroneously moved for this reason. Can you confirm whether the region controller does in fact have an IP in this subnet?

Changed in maas:
status: New → Incomplete
Revision history for this message
Gregory Orange (gregoryo2017) wrote :

No, the region controller has only one IP, in subnet 28, fabric 20.

For comparison, the two rack controller have IPs in a range of subnets they serve, but not in subnet 28.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

It happened again, so here is some more context. DHCP leases on the subnet in question are set to 1 hour.

2023-09-06 AWST

11:39:39 We restarted maas-rackd daemon on the 2nd (acacia) rackd server
13:17:21 DHCP leases started to drop and we went about discovering the problem
13:45 We set subnet 63 from fabric 0 back to fabric 55 where it was before
13:47:41 DHCP leases began to be reissued

Here are the logs. I trimmed syslog a little to only show more recent entries.
https://drive.google.com/file/d/1_b7KueXlTPuayFv7u8KDbLbIpNDH6JLX/view?usp=drive_link
Remember that nimbus rackd1 logs are in UTC, while the others are in AWST UTC+8.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

It has just happened again, with no restarts to any daemons.

Revision history for this message
Gregory Orange (gregoryo2017) wrote :

We increased our DHCP lease (with a snippet) to 2 weeks while we are experiencing this issue, to reduce its impact. I have just discovered that it has occurred again.

Changed in maas:
status: Incomplete → New
Bill Wear (billwear)
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.5.0
Changed in maas:
milestone: 3.5.0 → 3.5.x
Revision history for this message
David Torrey (dotj) wrote :

I have a customer reporting very similar behavior, having upgraded stepwise from 2.9.3 to 3.2.10.

Happy to provide more troubleshooting information from the support case out of band.

Revision history for this message
Jorge Merlino (jorge-merlino) wrote :

Hi all, I've been reviewing the possible causes for this error and I've noticed logs like these in the syslog for some machines (see for example logs on comment #6):

IP address [146.118.58.161] subnet 146.118.58.0/23: VLAN updated (vlan=5001)

This log comes from this function in src/maasserver/models/signals/interfaces.py:

def interface_vlan_update(instance, old_values, **kwargs):
 """When an interfaces VLAN is changed we need to do the following.

 * If its a controller move all assigned subnets to the new VLAN. This
   is done because the subnets defined are discovered on the
   controller and an administrator cannot change them.
 * If its a machine or device then we need to remove all links if the
   VLAN is different.
 """

This function has two different behaviors if the node is a controller or not but this log line:

log.msg(
   "%s: IP address [%s] subnet %s: "
   "VLAN updated (vlan=%s)." ...

Only occurs on the controller path and we are seeing it in non-controller machines. To separate these two cases there is this if line:

if not node.is_commissioning():

Which supposedly indicates that the node is a controller in case it is true. It is not intuitive that not is_commissioning equals that the node is a controller so it is possible that this not always works and thus sometimes the code path for controller is being taken for non-controller nodes.

Revision history for this message
Björn Tillenius (bjornt) wrote :

> if not node.is_commissioning():
>
> Which supposedly indicates that the node is a controller in case it is true. It is not intuitive that not > is_commissioning equals that the node is a controller so it is possible that this not always works and
> thus sometimes the code path for controller is being taken for non-controller nodes.

This logic is also executed for hardware-sync. Do you have any nodes deployed with hardware-sync enabled?

Revision history for this message
Igor Gnip (igorgnip) wrote :

We do not use hardware sync and it is not enabled.
The logic applies like this:
dhcp serves pxe boot or dhcp requests to a machine
whatever vlan id was configured on the switch port for that machine determines from which vlan will dhcp server receive request and which ip address it will assigne (no ip overlap limitation).
DHCP server politely informs maas what was done - which ip and vlan were assigned / observed.
In the interfaces.py the logic to 'fix' stuff makes global adjustments to vlans and subnets. In our case, a subnet was shifted to a different vlan but id probably depends on topology.

I am 99% sure if not node_is_commissioning() is incorrect for determining if server is a rack or region controller and it is triggering this global shuffle of subnets.

Revision history for this message
Igor Gnip (igorgnip) wrote :

    @property
    def is_controller(self):
        return self.node_type in [
            NODE_TYPE.REGION_CONTROLLER,
            NODE_TYPE.REGION_AND_RACK_CONTROLLER,
            NODE_TYPE.RACK_CONTROLLER,
        ]

    def is_commissioning(self):
        return self.status not in (NODE_STATUS.DEPLOYED, NODE_STATUS.DEPLOYING)

are there any obstacles to replace the check with is_controller and test behavior ?

Jacopo Rota (r00ta)
tags: added: bug-council
Changed in maas:
importance: Medium → High
Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
Jacopo Rota (r00ta)
Changed in maas:
status: Triaged → In Progress
Revision history for this message
Jacopo Rota (r00ta) wrote :

I confirm I'm able to reproduce this. I'm investigating the issue

Jacopo Rota (r00ta)
Changed in maas:
milestone: 3.5.x → 3.6.0
Revision history for this message
Jacopo Rota (r00ta) wrote :

The bug has been introduced here https://code.launchpad.net/~ack/maas/+git/maas/+merge/406360 .
In short, in MAAS < 3.3 we moved the subnets to the new VLAN only when the VLAN of an interface of a CONTROLLER was changed.
From MAAS >= 3.3, whenever the VLAN of an interface is updated we move the subnet to the new VLAN.

In order to fix this, we have first to fix https://bugs.launchpad.net/maas/+bug/2064281 on 3.5 and backport it to 3.4. Once the backport has landed, we can fix this bug.

tags: removed: bug-council
Revision history for this message
Jorge Merlino (jorge-merlino) wrote :

I just wanted to mention that this bug is also present in MAAS 3.2 as the mentioned patch is also present on the 3.2 branch.

Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm reverting the backport for 3.3 for now. We've released 3.4.2 without this fix, and not it turns out that we land some more fixes to 3.3. in order to release 3.3.7.

We don't want to land a fix in 3.3.7 that isn't in the latest 3.4 release.

I'll re-land the backport after 3.3.7 is released.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.