MAAS 3.2.9 creates for Calico Interfaces 80.000 fabrics

Bug #2043970 reported by Erik Dyka
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
High
Björn Tillenius
3.3
Won't Fix
High
Björn Tillenius
3.4
Fix Committed
High
Björn Tillenius
3.5
Fix Committed
High
Björn Tillenius

Bug Description

I could no longer customise the network configuration of individual servers or display the list of fabrics over the UI. After some searching I was able to find out that fabrics were created in the database every 15 minutes. As a result, the database table fabrics contains about 80t data records and the maas UI crashes on loading. BTW the event table has 14.000.000 entrys

Maas manages four VMs running a K8 cluster, with a master and worker node running on two machines each. Apparently the hardware sync recognizes calico interfaces from the K8 cluster every 15 minutes and creates a fabric for each interface.

Is it possible to avoid this behaviour?

Name Version Rev Tracking Publisher Notes
maas 3.2.9-12055-g.c3d5597a7 31022 3.2/stable canonical✓

maas runs with regiond/rackd on a 22.04

Related branches

Revision history for this message
Erik Dyka (erikdy) wrote :
Erik Dyka (erikdy)
information type: Public → Public Security
information type: Public Security → Public
Erik Dyka (erikdy)
description: updated
Revision history for this message
Alberto Donato (ack) wrote :

Could you please describe your network setup (where is calico and maas running, how are networks defined). Also please attach regiond/rackd.log from MAAS

Changed in maas:
status: New → Incomplete
Revision history for this message
Erik Dyka (erikdy) wrote (last edit ):

We have a head node and a storage server. VMs are running on both. see picture. The maas with rackd and regiond runs as a VM on the head node. All other VMs are registered in the maas via qemu+ssh. In addition to the two machines, we also have 5 nodes that are simply provisioned via the maas. However, these are not configured and are currently irrelevant.
Also we have to Switches, one for the management Network, in this network operates the maas system.
The second switch is for the internal network. This can also be used to access the storage via 25 GBit or for the machines to communicate with each other.

Unfortunately, I have inherited this cluster and the documentation is complete.

Unfortunately, I don't know exactly how the K8 cluster is configured. I can only see the Calico, vxlan and kube-ipvs interfaces in Maas. I know that the K8 CLuster should be used with Kubeflow and a DGX, but I can't say to what extent the DGX and Kubeflow have already been integrated.

P.S.: ilka is the project name

Revision history for this message
Alberto Donato (ack) wrote :

There is a case where MAAS can create empty fabrics when an interface doesn't have an IP (so it's not possible to tell if it's connected to a known fabric through its VLAN/subnet).

If the MAAS server or a machine where periodic hardware reporting has such a case, where new interfaces are created and removed, but don't have an IP, it's possible that MAAS creates fabric for those and then leaves them around.

Could you please confirm if this could be your case (e.g. Calico creating/removing interfaces)

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Erik Dyka (erikdy) wrote :

I can confirm this assumption.
It describes the behaviour pretty accurately.

Revision history for this message
Alberto Donato (ack) wrote :

When removing interfaces, MAAS should check if that interface is the last one connected to a fabric, in which case the fabric should also be removed.

We could also consider allowing interfaces not to be connected to any fabric, although that might be a bigger change.

Changed in maas:
status: Incomplete → Triaged
importance: Undecided → High
milestone: none → 3.5.0
Revision history for this message
Erik Dyka (erikdy) wrote :
Revision history for this message
Erik Dyka (erikdy) wrote :
Revision history for this message
Erik Dyka (erikdy) wrote :

yesterday I got the okay from my supervisor to publish the logs

Update: We have two separate clusters running, both with a maas and both with the same fault pattern.
the only difference is that the other side does not have 5 nodes, see Network_ilka.png

Revision history for this message
SCORE Lab (score) wrote :

This affects us too.
some 30 machines and 14000 fabrics.
(half the nets at our location cannot be managed by maas and thus might trigger this)

Revision history for this message
Richard Elling (richardelling) wrote (last edit ):

Seeing this with some nodes, but not all. The regiond.log shows messages like:
```
2024-01-22 00:30:22 maasserver.region_controller: [info] Reloaded DNS configuration:
  * node noide-1 renamed interface calif1636cdb238 to calif66d4835c51
  * node noide-1 renamed interface calie23afdd6c7a to calif1636cdb238
```

This is coming from the `/usr/bin/maas-run-scripts` running in a timer service on the node. As a workaround it can be disabled.

tags: added: bug-council
Revision history for this message
Björn Tillenius (bjornt) wrote (last edit ):

I think we should change things so that we don't automatically create a new fabric/vlan if we can't detect which fabric/vlan the interface is connected to. It doesn't really provide any value, since if MAAS sees ten interfaces that might be connected to the same fabric/vlan, MAAS would create ten new fabrics for them. The user then doesn't have any way of knowing what they can trust. Did MAAS detect the right fabric/vlan, or did MAAS not know and created a new one?

MAAS should set the vlan of a newly discovered interface, only if it has an IP, or if there's some other hint as to what the right vlan/fabric would be. If MAAS can't autodetect the right vlan/fabric, it's better to leave it empty, so that the user can see that they need to take action to make it right. MAAS shouldn't lie and make something up. It's better to be honest and admit that the vlan/fabric couldn't be deduced.

Now, this might be a backwards-incompatible change. The model itself actually allows this. But in the documentation and the UI, we say that interface is "Disconnected" if the vlan/fabric is None. However, we also have another way of saying that the interface is disconnected, through the link_connected attribute. This in itself is already confusing, since the two concepts are not tied together. For example, in the UI you have the option to mark an interface as "disconnected". That action touches only the link_connected attribute, but not the vlan/fabric.

So I would argue that this is more of a bug fix. We clarify that the link_connected attribute indicates whether the interface is connected to anything, and if the vlan/fabric is None, it means that MAAS doesn't know which vlan/fabric the interface is connected to.

Both before and after this fix, it would hold that if the interface doesn't have a vlan/fabric, it's not directly usable by MAAS. The only change would be that for unknown interfaces that you want to use, you would have to explicitly a vlan and/or fabric. So it's a slight change in behavior.

The alternative is that we do this change in 4.0 and for 3.X we implement the change to delete empty vlans/fabrics as the last interface get removed.

Although, we might get away if we make the change to not create new fabrics only of hardware-sync, but keep it for commissioning until 4.0. That way, if someone depends on this behavior it would still work. Interfaces that are created through hardware-sync can't be configured anyway, since they get removed when the machine is released. So it's for informational purposes.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote (last edit ):

MAAS tries to detect the VLAN for a discovered interface and associate it with the appropriate fabric. In this case, when the interface has no IP address, MAAS tries to put an interface into a VLAN and into a fabric before it has all the information, so it creates a new default vlan and a new fabric. It makes more sense to delay that action until the operator indicates their choice.

We will disable creation of new vlans and fabrics during hardware sync.
A similar situation happens during commissioning, when MAAS discovers new interfaces. As we are unsure if there are users who rely on that behavior, we will not disable automatic creation of vlans and fabrics in that stage. If there's community feedback that this function is not really useful, we will remove it in a consecutive major version of MAAS.

tags: added: networking
removed: bug-council
tags: added: data-model
Revision history for this message
SCORE Lab (score) wrote :

This sounds reasonable! Thanks for looking into it.

Changed in maas:
milestone: 3.5.0 → 3.6.0
assignee: nobody → Alexsander de Souza (alexsander-souza)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm reopening this bug, since it turned out that the fix was a bit too simple. With the current patch, if you have a setup like this for a deployed machine, MAAS breaks:

  eth0:
    no address
  eth0.123:
    10.10.10.10/24

This will fail in different ways, depending on whether MAAS already knows about subnet, or not.

I'm adding a couple of test cases and will work on a fix.

Changed in maas:
status: Fix Committed → In Progress
assignee: Alexsander de Souza (alexsander-souza) → Björn Tillenius (bjornt)
Revision history for this message
Björn Tillenius (bjornt) wrote :

I put up a new MP to fix this again, now taking more cases into consideration. All tests pass, but the fix is a bit more involved, so we should test it properly before backporting and releasing it.

Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Solution for 3.3 requires more significant changes, so unless it's critical for certain deployments that cannot be upgraded to 3.4, we're marking this as "won't fix" for 3.3.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.