MAAS

Deployment fails when node doesnt have interface on first subnet in fabric

Bug #2022006 reported by Alexander Balderson on 2023-05-31

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
MAAS	Status tracked in 3.6
3.5	Won't Fix	Medium	Unassigned
3.6	Triaged	Medium	Unassigned	MAAS 3.6.x

Bug Description

I have a case with my a MAAS host on 2 subnets on a single fabric. One subnet is the "internal" subnet, the other is a "public" VLAN. Most nodes in the MAAS have interfaces on both the internal and public subnets, however, using a node with interfaces on both the internal and public subnet, i stood up a kvm host and then added a handful of kvms on that node with only one interface on public subnet. The kvm's (but also any node) with out an interface on the internal subnet fails to deploy because it fails to report to MAAS that the deployment has completed. This is because the report script called by cloud-init is choosing to report to MAAS over the internal interface instead of the public interface.

The deployment of the node, does finish, and since it is a kvm i was able to connect to the virtual console and import my ssh key to the node and extract the logs. From the cloud-init-output.log there are many instances of:

2023-05-23 14:45:48,836 - handlers.py[WARNING]: Failed posting event: {"name": "init-network/check-cache", "description": "attempting to read from cache [trust]", "event_type": "start", "origin": "cloudinit", "timestamp": 1684853019.4502654}. This was caused by: HTTPConnectionPool(host='10.1.10.3', port=5248): Max retries exceeded with url: /MAAS/metadata/status/dhdrmt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f637aa64790>, 'Connection to 10.1.10.3 timed out. (connect timeout=None)'))
or
2023-05-23 15:01:08,388 - handlers.py[WARNING]: Failed posting event: {"name": "modules-final/config-reset_rmc", "description": "running config-reset_rmc with frequency once-per-instance", "event_type": "start", "origin": "cloudinit", "timestamp": 1684853938.31702}. This was caused by: HTTPConnectionPool(host='10.1.10.3', port=5248): Max retries exceeded with url: /MAAS/metadata/status/dhdrmt (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f1ef8699cf0>, 'Connection to 10.1.10.3 timed out. (connect timeout=None)'))

where cloud-init is trying to post the data back to maas on the internal subnet (10.1.10.0) instead of the subnet it has access to on the public subnet.

I think the report jobs should always insure that they choose a subnet to report back on that the node and the rack controller both have access to. You could also block deployments if you try to deploy a node and it has no way of reporting back to MAAS that it succeeded.

I did some digging on to why this is happening and found that when building the report task[1] for cloud-init MAAS is electing to use the first[2] network on the rack controller, even if the node doesnt have an interface on that network.

1) https://git.launchpad.net/maas/tree/src/maasserver/compose_preseed.py#n326
2) https://git.launchpad.net/maas/tree/src/maasserver/models/node.py#n5380