Bug #2031482 “Subnet changed to wrong fabric, impacting DHCP” : Bugs : MAAS

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-08-16 (last edit on 2023-08-16):

#1

It was around 2023-08-14T09:21:36.043193+08:00 when we resolved the issue, when the first virsh state changed `Power state has changed from error to on.`

This (and the errors that preceded it) is because the KVM Pod nodes hosting these machines lost the IP on their admin interface as part of the DHCP outage, then regained it once it was fixed.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-08-16:

#2

At 85MB gzipped, I have uploaded the log files to Google Drive covering the time period. I gave each file a prefix as to whether it was rackd1, rackd2 or region controller.

https://drive.google.com/file/d/1BRVhrKQLsaxR8rhk1ne3lfQ5feBfEYr7/view?usp=sharing

Let me know if you would prefer that I upload the file(s) directly here.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-08-18:

#3

Is it somehow relevant that it was fabric 0 that the subnet was erroneously moved to? Some kind of error and default selection to the first one?

Revision history for this message

Christian Grabowski (cgrabowski) wrote on 2023-09-05:

#4

This exception is repeating in the regiond logs:

2023-08-13 00:00:47 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 293, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 370, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 860, in signal
    target_status = process(node, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 682, in _process_commissioning
    self._store_results(
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 565, in _store_results
    script_result.store_result(
  File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 270, in store_result
    assert self.status in SCRIPT_STATUS_RUNNING_OR_PENDING
AssertionError

This is likely due to an error when fetching hardware info for the region, which if the region has an IP within the subnet in question, it can be erroneously moved for this reason. Can you confirm whether the region controller does in fact have an IP in this subnet?

This exception is repeating in the regiond logs:

2023-08-13 00:00:47 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 293, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 62, in __call__
    response = super().__call__(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 20, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 197, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3.10/dist-packages/piston3/resource.py", line 195, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 370, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 860, in signal
    target_status = process(node, request, status)
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 682, in _process_commissioning
    self._store_results(
  File "/usr/lib/python3/dist-packages/metadataserver/api.py", line 565, in _store_results
    script_result.store_result(
  File "/usr/lib/python3/dist-packages/metadataserver/models/scriptresult.py", line 270, in store_result
    assert self.status in SCRIPT_STATUS_RUNNING_OR_PENDING
AssertionError

This is likely due to an error when fetching hardware info for the region, which if the region has an IP within the subnet in question, it can be erroneously moved for this reason. Can you confirm whether the region controller does in fact have an IP in this subnet?

Changed in maas:
status:	New → Incomplete

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-06:

#5

No, the region controller has only one IP, in subnet 28, fabric 20.

For comparison, the two rack controller have IPs in a range of subnets they serve, but not in subnet 28.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-06:

#6

It happened again, so here is some more context. DHCP leases on the subnet in question are set to 1 hour.

2023-09-06 AWST

11:39:39 We restarted maas-rackd daemon on the 2nd (acacia) rackd server
13:17:21 DHCP leases started to drop and we went about discovering the problem
13:45 We set subnet 63 from fabric 0 back to fabric 55 where it was before
13:47:41 DHCP leases began to be reissued

Here are the logs. I trimmed syslog a little to only show more recent entries.
https://drive.google.com/file/d/1_b7KueXlTPuayFv7u8KDbLbIpNDH6JLX/view?usp=drive_link
Remember that nimbus rackd1 logs are in UTC, while the others are in AWST UTC+8.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-06:

#7

It has just happened again, with no restarts to any daemons.

Revision history for this message

Gregory Orange (gregoryo2017) wrote on 2023-09-18:

#8

We increased our DHCP lease (with a snippet) to 2 weeks while we are experiencing this issue, to reduce its impact. I have just discovered that it has occurred again.

Changed in maas:
status:	Incomplete → New

Bill Wear (billwear) on 2023-09-28

Changed in maas:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → 3.5.0

Anton Troyanov (troyanov) on 2024-03-05

Changed in maas:
milestone:	3.5.0 → 3.5.x

Revision history for this message

David Torrey (dotj) wrote on 2024-03-06:

#9

I have a customer reporting very similar behavior, having upgraded stepwise from 2.9.3 to 3.2.10.

Happy to provide more troubleshooting information from the support case out of band.

Revision history for this message

Jorge Merlino (jorge-merlino) wrote on 2024-03-13:

#10

Hi all, I've been reviewing the possible causes for this error and I've noticed logs like these in the syslog for some machines (see for example logs on comment #6):

IP address [146.118.58.161] subnet 146.118.58.0/23: VLAN updated (vlan=5001)

This log comes from this function in src/maasserver/models/signals/interfaces.py:

def interface_vlan_update(instance, old_values, **kwargs):
"""When an interfaces VLAN is changed we need to do the following.

* If its a controller move all assigned subnets to the new VLAN. This
   is done because the subnets defined are discovered on the
   controller and an administrator cannot change them.
* If its a machine or device then we need to remove all links if the
   VLAN is different.
"""

This function has two different behaviors if the node is a controller or not but this log line:

log.msg(
"%s: IP address [%s] subnet %s: "
"VLAN updated (vlan=%s)." ...

Only occurs on the controller path and we are seeing it in non-controller machines. To separate these two cases there is this if line:

if not node.is_commissioning():

Which supposedly indicates that the node is a controller in case it is true. It is not intuitive that not is_commissioning equals that the node is a controller so it is possible that this not always works and thus sometimes the code path for controller is being taken for non-controller nodes.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2024-03-15:

#12

> if not node.is_commissioning():
>
> Which supposedly indicates that the node is a controller in case it is true. It is not intuitive that not > is_commissioning equals that the node is a controller so it is possible that this not always works and
> thus sometimes the code path for controller is being taken for non-controller nodes.

This logic is also executed for hardware-sync. Do you have any nodes deployed with hardware-sync enabled?

Revision history for this message

Igor Gnip (igorgnip) wrote on 2024-03-23:

#13

We do not use hardware sync and it is not enabled.
The logic applies like this:
dhcp serves pxe boot or dhcp requests to a machine
whatever vlan id was configured on the switch port for that machine determines from which vlan will dhcp server receive request and which ip address it will assigne (no ip overlap limitation).
DHCP server politely informs maas what was done - which ip and vlan were assigned / observed.
In the interfaces.py the logic to 'fix' stuff makes global adjustments to vlans and subnets. In our case, a subnet was shifted to a different vlan but id probably depends on topology.

I am 99% sure if not node_is_commissioning() is incorrect for determining if server is a rack or region controller and it is triggering this global shuffle of subnets.

Revision history for this message

Igor Gnip (igorgnip) wrote on 2024-03-26:

#14

    @property
    def is_controller(self):
        return self.node_type in [
            NODE_TYPE.REGION_CONTROLLER,
            NODE_TYPE.REGION_AND_RACK_CONTROLLER,
            NODE_TYPE.RACK_CONTROLLER,
        ]

def is_commissioning(self):
return self.status not in (NODE_STATUS.DEPLOYED, NODE_STATUS.DEPLOYING)

are there any obstacles to replace the check with is_controller and test behavior ?

Jacopo Rota (r00ta) on 2024-04-05

tags:

added: bug-council

Jerzy Husakowski (jhusakowski) on 2024-04-11

Changed in maas:
importance:	Medium → High

Jacopo Rota (r00ta) on 2024-04-24

Changed in maas:
assignee:	nobody → Jacopo Rota (r00ta)

Jacopo Rota (r00ta) on 2024-04-25

Changed in maas:
status:	Triaged → In Progress

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-04-25:

#15

I confirm I'm able to reproduce this. I'm investigating the issue

Jacopo Rota (r00ta) on 2024-04-25

Changed in maas:
milestone:	3.5.x → 3.6.0

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-04-30:

#16

The bug has been introduced here https://code.launchpad.net/~ack/maas/+git/maas/+merge/406360 .
In short, in MAAS < 3.3 we moved the subnets to the new VLAN only when the VLAN of an interface of a CONTROLLER was changed.
From MAAS >= 3.3, whenever the VLAN of an interface is updated we move the subnet to the new VLAN.

In order to fix this, we have first to fix https://bugs.launchpad.net/maas/+bug/2064281 on 3.5 and backport it to 3.4. Once the backport has landed, we can fix this bug.

Jerzy Husakowski (jhusakowski) on 2024-05-02

tags:

removed: bug-council

Revision history for this message

Jorge Merlino (jorge-merlino) wrote on 2024-05-02:

#17

I just wanted to mention that this bug is also present in MAAS 3.2 as the mentioned patch is also present on the 3.2 branch.

MAAS Lander (maas-lander) on 2024-05-06

Changed in maas:
status:	In Progress → Fix Committed

Revision history for this message

Björn Tillenius (bjornt) wrote on 2024-05-10:

#18

I'm reverting the backport for 3.3 for now. We've released 3.4.2 without this fix, and not it turns out that we land some more fixes to 3.3. in order to release 3.3.7.

We don't want to land a fix in 3.3.7 that isn't in the latest 3.4 release.

I'll re-land the backport after 3.3.7 is released.

Revision history for this message

Jorge Merlino (jorge-merlino) wrote on 2024-07-09:

#19

As this bug has been released on 3.4.3 it should be re-landed in MAAS 3.3, right?

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-07-09:

#20

> As this bug has been released on 3.4.3 it should be re-landed in MAAS 3.3, right?

We had to revert it from 3.3 because the release was already ongoing. I'm backporting it now and it will be available in the next minor 3.3 release (3.3.8)

Revision history for this message

Jorge Merlino (jorge-merlino) wrote on 2024-07-11:

#21

Thanks!

MAAS

Subnet changed to wrong fabric, impacting DHCP

Bug Description

Related branches

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
MAAS	Fix Committed	High	Jacopo Rota	MAAS 3.6.0
3.2	Fix Committed	High	Jacopo Rota	MAAS 3.2.x
3.3	Fix Committed	High	Björn Tillenius	MAAS 3.3.x
3.4	Fix Released	High	Jacopo Rota	MAAS 3.4.3
3.5	Fix Released	High	Jacopo Rota	MAAS 3.5.0