Bug #2044169 “invalid magic number after custom image deployment...” : Bugs : MAAS

Revision history for this message

Alan Baghumian (alanbach) wrote on 2023-11-21:

#1

Thank you Joao for putting this together.

I'm adding an extra note, in BIOS mode, the machine goes into a kernel panic after the reboot instead of the mentioned GRUB screen above and the "invalid magic number" errors are not observed.

Best,
Alan

Revision history for this message

Alberto Donato (ack) wrote on 2023-11-22:

#2

Could you please try to upload the custom image again adding the `base_image=ubuntu/<codename>` parameter (depending on which ubuntu image you're basing your custom image)?

Changed in maas:
status:	New → Incomplete

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-11-22:

#3

Here is the link for the custom image I used:

https://drive.google.com/file/d/1S-X7jzQOthCNuYwPZFANvqf64jlfoTsO/view?usp=drive_link

maas admin boot-resources create name=custom/my-ubuntu architecture=amd64/generic filetype=ddgz content@=custom-ubuntu-lvm.dd.gz

It was created using the default options from the MAAS docs.

Let me know if you need anything else.

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-11-23:

#4

I uploaded the image using this command:

maas admin boot-resources create name=ubuntu/my-ubuntu-2044169 base_image=ubuntu/jammy architecture=amd64/generic filetype=ddgz content@=custom-ubuntu-lvm.dd.gz

Now, when I try to deploy it, I get the following error in regiond.log

2023-11-23 14:21:46 maasserver.websockets.handlers.machine: [error] Bulk action (deploy) for yw3r3p failed: my-ubuntu-2044169 has no kernels available.

Is ubuntu/jammy the expected codename?

Note that I see the additional images:

And the image does not appear in the Custom list, but under Ubuntu (when using the WebUI).

However, the production environment where the issue is faced does not use a Custom Ubuntu image - I used Ubuntu to make the reproducer easier. So, even if this additional parameter fixes the issue for Ubuntu, it won't fix the customer issue.

I gave an additional try now with the shipped CentOS image, and I see the same behavior. I interrupted the installation process by shutdown the VM, and now I get:

Fetching Netboot Image
Booting under MAAS direction...
error: invalid magic number.
error: you need to load the kernel first.

Press any key to continue...

GNU GRUB version 2.06

��Ŀ
�setparams 'Ephemeral' �
� �
� echo 'Booting under MAAS direction...' �
� linux (http,192.168.50.1:5248)/images/centos/amd64/ga-20.04/centos70/n\�
�o-such-image/boot-kernel nomodeset ro root=squash:http://192.168.50.1:5248/\�
�images/centos/amd64/ga-20.04/centos70/no-such-image/squashfs ip=::::client:\�
�BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:\{'datasou\�
�rce_list': ['MAAS']\}end_cc cloud-config-url=http://192-168-50-0--24.maas-i\�
�nternal:5248/MAAS/metadata/latest/by-id/yw3r3p/?op=get_preseed apparmor=0 l\�
�og_host=192.168.50.1 log_port=5247 --- BOOTIF=01-${net_default_mac} �
� initrd (http,192.168.50.1:5248)/images/centos/amd64/ga-20.04/centos70/n\�
�o-such-image/boot-initrd �
� �
��

And the reason is the same - only the xinstall is available for centos:

Image: centos | centos70 | amd64 | xinstall

So when purpose=commissioning it fails to find a suitable image and fall back to the no-such-image.

I uploaded the image using this command:

maas admin boot-resources create name=ubuntu/my-ubuntu-2044169 base_image=ubuntu/jammy architecture=amd64/generic filetype=ddgz content@=custom-ubuntu-lvm.dd.gz

Now, when I try to deploy it, I get the following error in regiond.log

2023-11-23 14:21:46 maasserver.websockets.handlers.machine: [error] Bulk action (deploy) for yw3r3p failed: my-ubuntu-2044169 has no kernels available.

Is ubuntu/jammy the expected codename?

Note that I see the additional images:

And the image does not appear in the Custom list, but under Ubuntu (when using the WebUI).

However, the production environment where the issue is faced does not use a Custom Ubuntu image - I used Ubuntu to make the reproducer easier. So, even if this additional parameter fixes the issue for Ubuntu, it won't fix the customer issue.

I gave an additional try now with the shipped CentOS image, and I see the same behavior. I interrupted the installation process by shutdown the VM, and now I get:

Fetching Netboot Image
Booting under MAAS direction...
error: invalid magic number.
error: you need to load the kernel first.

Press any key to continue...

GNU GRUB  version 2.06

����������������������������������������������������������������������������Ŀ
 �setparams 'Ephemeral'                                                       � 
 �                                                                            �
 �    echo   'Booting under MAAS direction...'                                �
 �    linux  (http,192.168.50.1:5248)/images/centos/amd64/ga-20.04/centos70/n\�
 �o-such-image/boot-kernel nomodeset ro root=squash:http://192.168.50.1:5248/\�
 �images/centos/amd64/ga-20.04/centos70/no-such-image/squashfs ip=::::client:\�
 �BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:\{'datasou\�
 �rce_list': ['MAAS']\}end_cc cloud-config-url=http://192-168-50-0--24.maas-i\�
 �nternal:5248/MAAS/metadata/latest/by-id/yw3r3p/?op=get_preseed apparmor=0 l\�
 �og_host=192.168.50.1 log_port=5247 ---   BOOTIF=01-${net_default_mac}       �
 �    initrd (http,192.168.50.1:5248)/images/centos/amd64/ga-20.04/centos70/n\�
 �o-such-image/boot-initrd                                                    �
 �                                                                            � 
 ������������������������������������������������������������������������������

And the reason is the same - only the xinstall is available for centos:

Image: centos | centos70 | amd64 | xinstall

So when purpose=commissioning it fails to find a suitable image and fall back to the no-such-image.

Revision history for this message

Alberto Donato (ack) wrote on 2023-11-23:

#5

The base_image parameter should match the name of the official image from which the custom one is derived.

Can you please confirm that the issue doesn't happen when you deploy the official image?

Changed in maas:
status:	Incomplete → New
status:	New → Incomplete

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-11-23:

#6

Deployment of Ubuntu / Ubuntu 22.04 LTS "Jammy Jellyfish" with "No minimum kernel"
works without any issue. Here is the last event:

        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8212,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:17:06",
            "type": "Deployed",
            "description": ""
        },

And if I interrupt the installation, the same way I'm doing with the custom / centos images
I don't see the magic number error.

When deploying the CentOS image, I can see the following events:

        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7974,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 14:36:33",
            "type": "Marking node failed",
            "description": "Missing boot image centos/amd64/ga-20.04/centos70."
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7972,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 14:36:33",
            "type": "Performing PXE boot",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7961,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 14:35:20",
            "type": "Marking node failed",
            "description": "Node operation 'Deploying' timed out after 5 minutes."
        },

Notice the deployment failure and the additional message with the Missing boot image.

Now, with a standard Ubuntu image, these are the events:

        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8357,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:33:41",
            "type": "Loading ephemeral",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8353,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:33:29",
            "type": "Performing PXE boot",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8339,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 15:26:20",
            "type": "Marking node failed",
            "description": "Node operation 'Deploying' timed out after 5 minutes."
        },

After the deployment failure, the boot happens and I can see the Loading ephemeral event.

Deployment of Ubuntu / Ubuntu 22.04 LTS "Jammy Jellyfish" with "No minimum kernel"
works without any issue. Here is the last event:

{
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8212,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:17:06",
            "type": "Deployed",
            "description": ""
        },

And if I interrupt the installation, the same way I'm doing with the custom / centos images
I don't see the magic number error.

When deploying the CentOS image, I can see the following events:

{
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7974,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 14:36:33",
            "type": "Marking node failed",
            "description": "Missing boot image centos/amd64/ga-20.04/centos70."
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7972,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 14:36:33",
            "type": "Performing PXE boot",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7961,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 14:35:20",
            "type": "Marking node failed",
            "description": "Node operation 'Deploying' timed out after 5 minutes."
        },

Notice the deployment failure and the additional message with the Missing boot image.

Now, with a standard Ubuntu image, these are the events:

{
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8357,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:33:41",
            "type": "Loading ephemeral",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8353,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:33:29",
            "type": "Performing PXE boot",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8339,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 15:26:20",
            "type": "Marking node failed",
            "description": "Node operation 'Deploying' timed out after 5 minutes."
        },

After the deployment failure, the boot happens and I can see the Loading ephemeral event.

Revision history for this message

Alberto Donato (ack) wrote on 2023-11-24:

#7

I see a "Missing boot image centos/amd64/ga-20.04/centos70." message , do you have the Centos 7 (official) image downloaded as well?

Also can you please provide the exact MAAS version you're using?

Changed in maas:
status:	Incomplete → New
status:	New → Incomplete

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-11-24 (last edit on 2023-11-24):

#8

The CentOS image was downloaded using the WebUI options and can be deployed. It's the official one.

If I interrupt the installation, the machine will go into "Failed deployment" state,
and subsequent boots will show the "invalid magic number" and that event is generated before that.

Here is the output of maas $PROFILE boot-resources read:

https://paste.ubuntu.com/p/z6jY9kBgZR/

I believe that message is generated by the tftp daemon here:

https://github.com/maas/maas/blob/master/src/provisioningserver/rackdservices/tftp.py#L211

Because centos and custom images don't have a commissioning image, only an xinstall one,
and purpose on the next boot is set to commissioning.

https://github.com/maas/maas/blob/master/src/provisioningserver/rackdservices/tftp.py#L289

This problem was reported in MAAS 3.2.9/debs and my reproducer is using MAAS 3.3.4/debs.

I just launched a new instance with MAAS 3.4.0~rc2 using snaps and I can reproduce the issue.

Revision history for this message

Alberto Donato (ack) wrote on 2023-11-24:

#9

Could you please provide the exact steps used for the Centos image?
There's a bit of mixture of Centos- and Ubuntu-related commands in the thread.

Basically:
- which command was used to upload the custom centos image?
- what's the output from the boot where the issue happens?

Changed in maas:
status:	Incomplete → New
status:	New → Incomplete

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-11-24:

#10

Hi Alberto,

I initially flagged the bug with a custom image because this is how it was reported in the production environment (using a custom image).

While working to provide you with the extra data requested, I noticed the same issue is seen with the official CentOS images. There is no upload involved in this scenario.

Steps to reproduce with CentOS image:

- Install a fresh MAAS environment
- Using the WebUI go to Images, Other Images, and Enable the CentOS 7 Image
- Enlist and commission a machine. I'm using a UEFI x86 QEMU VM in the reproducer.
- Start the deployment the machine using the CentOS image, but interrupt the installation outside MAAS (I run a virsh destroy vm)
- After the Deployment timeout (usually 30 minutes, as defined by node_timeout), the machine will go to "Failed deployment" state
- Power on the VM - it'll get stuck in the "invalid magic number" screen, and the ephemeral grub menu will show the wrong path with the no-such-image URL

Revision history for this message

Alberto Donato (ack) wrote on 2023-11-24:

#11

So, given that the deployment was interrupted, MAAS is correctly setting the machine as "deployment failed" after the timeout.

The grub configuration that's provided after that is indeed quite weird (it seems to be broken for any non-ubuntu distro), but the behavuour in this case is to be considered unspecified.

If you would like to propose a different behaviour for this case, could you please open a discussion in our discourse (https://discourse.maas.io/c/features/) as a feature request?

Changed in maas:
status:	Incomplete → Invalid

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2023-11-24:

#12

No worries,

I've raised:

https://discourse.maas.io/t/improve-grub-menu-after-failed-deployments-for-non-ubuntu-images-invalid-magic-number/7638

To track it as a feature request.

MAAS

invalid magic number after custom image deployment failure

Bug Description

Other bug subscribers

Remote bug watches