Nodes stuck in Failed Disk Erasing due to wrong ipxe boot file

Bug #2013529 reported by Kevin Reeuwijk
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Igor Brovtsin

Bug Description

Environment: MaaS 3.3.1
Deploying servers with custom (Ubuntu 20.04 based) images.

In our larger scale MaaS environment, when we enable the "Erase nodes' disks prior to releasing" option, we often see nodes ending up in the "Failed disk erasing" state. Every time this happens, we see this when checking the console of the server (also see screenshot):

Loading http://<maas-ip>:5248/images/ubuntu/amd64/ga-22.04/focal/no-such-image/boot-kernel... failed: No such file or directory

As you can see, it is mixing Ubuntu 22.04 with Ubuntu 20.04 paths. I have seen the opposite happen too, when I change the settings to use the 22.04 image for Commissioning and Deployment. Then I see "ga-20.04/jammy" in the path. Something is clearly awry here.

I'm not quite sure what causes this, but we have to disable the "Erase nodes' disks prior to releasing" option to prevent this issue from occurring.

I'd like to get to the bottom of this issue, let me know if there is information I can gather for you.

Related branches

Revision history for this message
Kevin Reeuwijk (kreeuwijk) wrote :
Revision history for this message
Kevin Reeuwijk (kreeuwijk) wrote :
Changed in maas:
status: New → Triaged
Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

Relevant code: https://git.launchpad.net/maas/tree/src/maasserver/rpc/boot.py#n195, probable root cause: https://git.launchpad.net/maas/tree/src/maasserver/rpc/boot.py#n272

For erasing and rescue mode, we use `default_osystem` and `default_distro_series`, but subarch is populated from `machine.hwe_kernel`.

I am currently working on some of the relevant code, so I'll test against this case as well.

Changed in maas:
importance: Undecided → Medium
assignee: nobody → Igor Brovtsin (igor-brovtsin)
importance: Medium → High
Changed in maas:
milestone: none → 3.4.0
Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

Kevin, just to confirm my findings, could you please provide the output of the following MAAS CLI command for one of the affected machines?

maas $PROFILE machine read $MACHINE_ID | grep -E '(osystem|distro_series|hwe_kernel|architecture")'

If you haven't used MAAS CLI before, https://maas.io/docs/try-out-the-maas-cli might be helpful. As for $MACHINE_ID, you can easily get it from the URL of the UI machine details page (/MAAS/r/machine/<MACHINE_ID>).

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Kevin Reeuwijk (kreeuwijk) wrote :

Igor, here you go:

For a server running our custom images:

$ maas admin machine read tpdw4g | grep -E '(osystem|distro_series|hwe_kernel|architecture")'
    "hwe_kernel": "ga-20.04",
    "distro_series": "u-2004-0-k-12410-0",
    "osystem": "custom",
    "min_hwe_kernel": "",
    "architecture": "amd64/generic",

For a server running the builtin Ubuntu 20.04 image from MaaS:

$ maas admin machine read 47dqan | grep -E '(osystem|distro_series|hwe_kernel|architecture")'
    "hwe_kernel": "ga-20.04",
    "distro_series": "focal",
    "osystem": "ubuntu",
    "min_hwe_kernel": "",
    "architecture": "amd64/generic",

For a server running the builtin Ubuntu 22.04 image from MaaS:

$ maas admin machine read rnnknh | grep -E '(osystem|distro_series|hwe_kernel|architecture")'
    "hwe_kernel": "ga-22.04",
    "distro_series": "jammy",
    "osystem": "ubuntu",
    "min_hwe_kernel": "",
    "architecture": "amd64/generic",

Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

Kevin, are the second and the third machines affected by this issue as well? If so, could you please describe the exact steps to reproduce this bug with them? E.g. select this commissioning image as default, deploy the machine with that particular OS and release and so on.

I see the issue with the first one, but if the second and the third are affected as well, we might have another bug somewhere nearby.

Revision history for this message
Kevin Reeuwijk (kreeuwijk) wrote :

Hi Igor,

It only happens with our custom images, not with the builtin images. So the second and third are not affected. We only use custom images in practice.

Revision history for this message
Igor Brovtsin (igor-brovtsin) wrote :

Great, thanks Kevin!

Changed in maas:
status: Incomplete → Triaged
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.0-beta3
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.