MAAS 3.5 fails to boot machines because the rack is timing out retrieving the images

Bug #2063220 reported by Jacopo Rota
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
Critical
Anton Troyanov
3.5
Fix Committed
Critical
Anton Troyanov

Bug Description

In MAAS 3.5 with the following setup:

1) r00ta-ThinkPad-T420: region+rack with access to the following subnets:
   172.0.2.0/24
   10.14.10.0/24
   fd42:7bdd:adc0:71f4::/64
   10.194.168.0/21

2) novel-mantis: rack with access to the following subnets:
   172.0.2.0/24
   10.35.173.0/24
   fd42:147b:a187:ce53::/64
   20.0.1.0/24

I'm trying to boot some machines that are on the subnet 20.0.1.0/24. These machines most of the time fail to boot because they don't get the bootloader (see screenshot)

looking at the rack logs in novel-mantis I see

Apr 23 15:37:38 novel-mantis maas-rackd[1761]: provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 20.0.1.191
Apr 23 15:37:38 novel-mantis maas-rackd[1761]: provisioningserver.rackdservices.http: [info] /images/bootx64.efi requested by ::1

and in the maas-agent logs in novel-mantis, I see that

Apr 23 15:37:38 novel-mantis maas-agent[13651]: 2024/04/23 15:37:38 http: proxy error: dial tcp [fd42:7bdd:adc0:71f4::1]:5240: connect: network is unreachable

But novel-mantis DOES NOT have access to fd42:7bdd:adc0:71f4. It looks like the agent is trying to fetch the bootloader from subnets that are not accessible and the machines do not boot by consequence

Related branches

Revision history for this message
Jacopo Rota (r00ta) wrote :
Revision history for this message
Jacopo Rota (r00ta) wrote :

I enabled debug logs and I extracted

from the rack
Apr 23 16:00:02 novel-mantis maas-rackd[36406]: tftp.protocol: [debug] Datagram received from ('20.0.1.191', 1845): <RRQDatagram(filename=b'bootx64.efi', mode=b'octet', options=OrderedDict([(b'tsize', b'0'), (b'blksize', b'1468'), (b'windowsize', b'4')]))>
Apr 23 16:00:02 novel-mantis maas-rackd[36406]: provisioningserver.rackdservices.tftp: [info] bootx64.efi requested by 20.0.1.191
Apr 23 16:00:02 novel-mantis maas-rackd[36406]: provisioningserver.rackdservices.http: [info] /images/bootx64.efi requested by ::1
Apr 23 16:00:02 novel-mantis maas-rackd[36406]: tftp.bootstrap: [debug] Got error: <tftp.datagram.ERRORDatagram object at 0x7f2a2837b2e0>
Apr 23 16:00:12 novel-mantis maas-rackd[36406]: tftp.bootstrap: [debug] Timed out during option negotiation proces

from the agent
Apr 23 16:00:02 novel-mantis maas-agent[37170]: 2024/04/23 16:00:02 http: proxy error: dial tcp [fd42:7bdd:adc0:71f4::1]:5240: connect: network is unreachable
Apr 23 16:00:06 novel-mantis maas-agent[37170]: 2024/04/23 16:00:06 http: proxy error: dial tcp 10.194.168.1:5240: i/o timeout
Apr 23 16:01:15 novel-mantis maas-agent[37170]: 2024/04/23 16:01:15 http: proxy error: context canceled

Jacopo Rota (r00ta)
Changed in maas:
milestone: 3.5.0 → 3.6.0
Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Anton Troyanov (troyanov)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.