Nodes stuck in Failed Disk Erasing due to wrong ipxe boot file
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
High
|
Igor Brovtsin |
Bug Description
Environment: MaaS 3.3.1
Deploying servers with custom (Ubuntu 20.04 based) images.
In our larger scale MaaS environment, when we enable the "Erase nodes' disks prior to releasing" option, we often see nodes ending up in the "Failed disk erasing" state. Every time this happens, we see this when checking the console of the server (also see screenshot):
Loading http://<maas-ip>
As you can see, it is mixing Ubuntu 22.04 with Ubuntu 20.04 paths. I have seen the opposite happen too, when I change the settings to use the 22.04 image for Commissioning and Deployment. Then I see "ga-20.04/jammy" in the path. Something is clearly awry here.
I'm not quite sure what causes this, but we have to disable the "Erase nodes' disks prior to releasing" option to prevent this issue from occurring.
I'd like to get to the bottom of this issue, let me know if there is information I can gather for you.
Related branches
- Björn Tillenius: Approve
- MAAS Lander: Approve
-
Diff: 1082 lines (+474/-115)12 files modifiedsrc/maasserver/clusterrpc/boot_images.py (+6/-0)
src/maasserver/clusterrpc/tests/test_boot_images.py (+27/-12)
src/maasserver/forms/tests/test_helpers.py (+9/-3)
src/maasserver/models/bootresource.py (+23/-2)
src/maasserver/models/tests/test_bootresource.py (+23/-1)
src/maasserver/node_action.py (+2/-0)
src/maasserver/preseed.py (+67/-30)
src/maasserver/rpc/boot.py (+51/-20)
src/maasserver/rpc/tests/test_boot.py (+205/-29)
src/maasserver/tests/test_preseed.py (+52/-10)
src/maasserver/websockets/handlers/tests/test_general.py (+6/-6)
src/provisioningserver/testing/boot_images.py (+3/-2)
Changed in maas: | |
status: | New → Triaged |
Changed in maas: | |
milestone: | none → 3.4.0 |
Changed in maas: | |
status: | Triaged → In Progress |
Changed in maas: | |
status: | In Progress → Fix Committed |
Changed in maas: | |
milestone: | 3.4.0 → 3.4.0-beta3 |
Changed in maas: | |
status: | Fix Committed → Fix Released |
Relevant code: https:/ /git.launchpad. net/maas/ tree/src/ maasserver/ rpc/boot. py#n195, probable root cause: https:/ /git.launchpad. net/maas/ tree/src/ maasserver/ rpc/boot. py#n272
For erasing and rescue mode, we use `default_osystem` and `default_ distro_ series` , but subarch is populated from `machine. hwe_kernel` .
I am currently working on some of the relevant code, so I'll test against this case as well.