systemd-resolved failure when commissioning machine with 12 NICs

Bug #1947052 reported by Jeff Lane 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Expired
Undecided
Unassigned

Bug Description

I've got a machine with 12 network devices that we're trying to add to MAAS. The machine successfully does the initial enlistment, but cannot commission. On digging through the logs (attached), I noticed that the machine does PXE and DNS is working at the beginning:

2021-10-13T18:43:42+00:00 maas-enlisting-node cloud-init[2402]: Downloading and extracting http://10-1-10-0--23.maas-internal:5248/MAAS/metadata/2012-03-01/maas-scripts to /tmp/user_data.sh.OqsWki/scripts

As you can see, the node succeeds in obtaining the scripts from the internal URL. From there it beings running the commissioning scripts and eventually brings up the remaining 11 NICs. Each NIC is successfully brought up and obtains a DHCP address from MAAS.

However, at this point, DNS stops working and commissioning fails because the node cannot talk back to the MAAS API:

2021-10-13T18:45:02+00:00 maas-enlisting-node cloud-init[2402]: request to http://10-1-10-0--23.maas-internal:5248/MAAS/api/2.0/machines/ failed. sleeping 1.: <urlopen error [Errno -3] Temporary failure in name resolution>

After looking a bit deeper in the logs, I noticed that after each NIC comes up, systemd-resolved is reloaded. At the end of the sequence where the NICs are all brought up, that final reload of systemd-resolved fails:

2021-10-13T18:43:52+00:00 maas-enlisting-node systemd[1]: systemd-resolved.service: Succeeded.
2021-10-13T18:43:52+00:00 maas-enlisting-node systemd[1]: Stopped Network Name Resolution.
2021-10-13T18:43:52+00:00 maas-enlisting-node systemd[1]: systemd-resolved.service: Start request repeated too quickly.
2021-10-13T18:43:52+00:00 maas-enlisting-node systemd[1]: systemd-resolved.service: Failed with result 'start-limit-hit'.
2021-10-13T18:43:52+00:00 maas-enlisting-node systemd[1]: Failed to start Network Name Resolution.

My hypothesis is that the sheer number of NICs coming up at one time is basically causing that limit to hit due to too many rapid-fire reloads of systemd-resolved

This seems to at least get some validation in that we disconnected 4 of the ports, leaving only 8, and then commissioning immediately succeeded.

Revision history for this message
Jeff Lane  (bladernr) wrote :
Revision history for this message
Christian Grabowski (cgrabowski) wrote :
Revision history for this message
Jeff Lane  (bladernr) wrote :

Interesting... it's fix-released in Focal maybe this is just that but ... more of it :D

Revision history for this message
Jeff Lane  (bladernr) wrote :

Also, as a small followup, As noted in the summary, removing the cables for 4 of the ports (I just asked them to unplug the 4 port 10GbE NIC) allowed the machine to commission properly.

After verifying that, I had them re-attach those four cables and the machine immediately failed to commission again with the same cause (systemd-resolved start limit).

I wonder if something as simple as a 5 second sleep between device bringups would be enough to stop this, giving things a chance to settle before firing off systemd-resolved again.

Revision history for this message
Christian Grabowski (cgrabowski) wrote :

So it's not an issue of calling systemd-resolved to start rapidly, more so, systemd has a counter of how many times a service is started, it has a limit to avoid something from crash looping, but in this case, it's valid. It's possible the commissioning image has not been updated yet though.

Changed in maas:
status: New → Triaged
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Is this still an issue? The commissioning images and systemd have both been updated with bug fixes for related issues.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Revision history for this message
Jeff Lane  (bladernr) wrote :

Just to provide closure, yea, the other updates seem to have resolved it as all 12 NICs properly get probed and no failure occurred.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.