invalid magic number after custom image deployment failure

Bug #2044169 reported by Joao Andre Simioni
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned

Bug Description

[Problem]

When a deployment using a custom image fails, subsequent boots from the server will show:

Fetching Netboot Image
Booting under MAAS direction...
error: invalid magic number.
error: you need to load the kernel first.

Press any key to continue...

Inspecting the GRUB menu, after the key press, there are URLs with a /no-such-image/ path:

� linux (http,192.168.50.1:5248)/images/custom/amd64/ga-20.04/my-ubuntu/\�
�no-such-image/boot-kernel nomodeset ro root=squash:http://192.168.50.1:5248\�
�/images/custom/amd64/ga-20.04/my-ubuntu/no-such-image/squashfs ip=::::clien\�
�t:BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:\{'datas\�
�ource_list': ['MAAS']\}end_cc cloud-config-url=http://192-168-50-0--24.maas\�
�-internal:5248/MAAS/metadata/latest/by-id/yw3r3p/?op=get_preseed apparmor=0\�
� log_host=192.168.50.1 log_port=5247 --- BOOTIF=01-${net_default_mac} �
� initrd (http,192.168.50.1:5248)/images/custom/amd64/ga-20.04/my-ubuntu/\�
�no-such-image/boot-initrd

Investigating the issue, the grub menu is brought using tftp and is dynamically
generated based on the mac address:

tftp get /grub/grub.cfg-<mac address>

Based on the mac address, maas tries to find a boot image that matches:

image["osystem"] == params["osystem"]
and image["release"] == params["release"]
and image["architecture"] == params["arch"]
and image["purpose"] == purpose

where purpose == commissioning. These are the variables taken for the Failed machine:

'arch': 'amd64'
'osystem': 'custom'
'release': 'my-ubuntu'
'purpose': 'commissioning'

But since it's a custom image, there is no commissioning image available, only a xinstall one:

Image: custom | my-ubuntu | amd64 | xinstall

And it falls back to the no-such-image situation.

[Workaround]

The machine is marked with "Failed deployment", and the Release Action in the Web UI
will bring the machine to a bootable state, where the expected commission boot process will happen.

[Reproducer]

- Create a custom image [https://maas.io/docs/create-custom-images]
- Deploy a server using the custom image - my reproducer uses a VM with UEFI boot
- Interrupt the installation (I shutdown the VM during the image extraction)
- Wait for the deployment to be marked as failed
- Restart the server and follow the boot process

[Alternatives]

- Can this issue be handled in a more informative way, instead of falling into this condition? The
message: "Booting under MAAS direction..." could be changed to provide some hints on the issue

- Instead of picking a no-such-image, a regular Ubuntu commissioning image could be used. This way
the server hostname appears in the boot process and it's easy to relate to an already enlisted machine.

Revision history for this message
Alan Baghumian (alanbach) wrote :

Thank you Joao for putting this together.

I'm adding an extra note, in BIOS mode, the machine goes into a kernel panic after the reboot instead of the mentioned GRUB screen above and the "invalid magic number" errors are not observed.

Best,
Alan

Revision history for this message
Alberto Donato (ack) wrote :

Could you please try to upload the custom image again adding the `base_image=ubuntu/<codename>` parameter (depending on which ubuntu image you're basing your custom image)?

Changed in maas:
status: New → Incomplete
Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Here is the link for the custom image I used:

https://drive.google.com/file/d/1S-X7jzQOthCNuYwPZFANvqf64jlfoTsO/view?usp=drive_link

maas admin boot-resources create name=custom/my-ubuntu architecture=amd64/generic filetype=ddgz content@=custom-ubuntu-lvm.dd.gz

It was created using the default options from the MAAS docs.

Let me know if you need anything else.

Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

I uploaded the image using this command:

maas admin boot-resources create name=ubuntu/my-ubuntu-2044169 base_image=ubuntu/jammy architecture=amd64/generic filetype=ddgz content@=custom-ubuntu-lvm.dd.gz

Now, when I try to deploy it, I get the following error in regiond.log

2023-11-23 14:21:46 maasserver.websockets.handlers.machine: [error] Bulk action (deploy) for yw3r3p failed: my-ubuntu-2044169 has no kernels available.

Is ubuntu/jammy the expected codename?

Note that I see the additional images:

Image: ubuntu | my-ubuntu-2044169 | amd64 | commissioning
Image: ubuntu | my-ubuntu-2044169 | amd64 | install
Image: ubuntu | my-ubuntu-2044169 | amd64 | xinstall
Image: ubuntu | my-ubuntu-2044169 | amd64 | diskless

And the image does not appear in the Custom list, but under Ubuntu (when using the WebUI).

However, the production environment where the issue is faced does not use a Custom Ubuntu image - I used Ubuntu to make the reproducer easier. So, even if this additional parameter fixes the issue for Ubuntu, it won't fix the customer issue.

I gave an additional try now with the shipped CentOS image, and I see the same behavior. I interrupted the installation process by shutdown the VM, and now I get:

Fetching Netboot Image
Booting under MAAS direction...
error: invalid magic number.
error: you need to load the kernel first.

Press any key to continue...

                            GNU GRUB version 2.06

 ����������������������������������������������������������������������������Ŀ
 �setparams 'Ephemeral' �
 � �
 � echo 'Booting under MAAS direction...' �
 � linux (http,192.168.50.1:5248)/images/centos/amd64/ga-20.04/centos70/n\�
 �o-such-image/boot-kernel nomodeset ro root=squash:http://192.168.50.1:5248/\�
 �images/centos/amd64/ga-20.04/centos70/no-such-image/squashfs ip=::::client:\�
 �BOOTIF ip6=off overlayroot=tmpfs overlayroot_cfgdisk=disabled cc:\{'datasou\�
 �rce_list': ['MAAS']\}end_cc cloud-config-url=http://192-168-50-0--24.maas-i\�
 �nternal:5248/MAAS/metadata/latest/by-id/yw3r3p/?op=get_preseed apparmor=0 l\�
 �og_host=192.168.50.1 log_port=5247 --- BOOTIF=01-${net_default_mac} �
 � initrd (http,192.168.50.1:5248)/images/centos/amd64/ga-20.04/centos70/n\�
 �o-such-image/boot-initrd �
 � �
 ������������������������������������������������������������������������������

And the reason is the same - only the xinstall is available for centos:

Image: centos | centos70 | amd64 | xinstall

So when purpose=commissioning it fails to find a suitable image and fall back to the no-such-image.

Revision history for this message
Alberto Donato (ack) wrote :

The base_image parameter should match the name of the official image from which the custom one is derived.

Can you please confirm that the issue doesn't happen when you deploy the official image?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Deployment of Ubuntu / Ubuntu 22.04 LTS "Jammy Jellyfish" with "No minimum kernel"
works without any issue. Here is the last event:

        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8212,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:17:06",
            "type": "Deployed",
            "description": ""
        },

And if I interrupt the installation, the same way I'm doing with the custom / centos images
I don't see the magic number error.

When deploying the CentOS image, I can see the following events:

        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7974,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 14:36:33",
            "type": "Marking node failed",
            "description": "Missing boot image centos/amd64/ga-20.04/centos70."
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7972,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 14:36:33",
            "type": "Performing PXE boot",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 7961,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 14:35:20",
            "type": "Marking node failed",
            "description": "Node operation 'Deploying' timed out after 5 minutes."
        },

Notice the deployment failure and the additional message with the Missing boot image.

Now, with a standard Ubuntu image, these are the events:

        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8357,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:33:41",
            "type": "Loading ephemeral",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8353,
            "level": "INFO",
            "created": "Thu, 23 Nov. 2023 15:33:29",
            "type": "Performing PXE boot",
            "description": ""
        },
        {
            "username": "unknown",
            "node": "yw3r3p",
            "hostname": "client",
            "id": 8339,
            "level": "ERROR",
            "created": "Thu, 23 Nov. 2023 15:26:20",
            "type": "Marking node failed",
            "description": "Node operation 'Deploying' timed out after 5 minutes."
        },

After the deployment failure, the boot happens and I can see the Loading ephemeral event.

Revision history for this message
Alberto Donato (ack) wrote :

I see a "Missing boot image centos/amd64/ga-20.04/centos70." message , do you have the Centos 7 (official) image downloaded as well?

Also can you please provide the exact MAAS version you're using?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Joao Andre Simioni (jasimioni) wrote (last edit ):

The CentOS image was downloaded using the WebUI options and can be deployed. It's the official one.

If I interrupt the installation, the machine will go into "Failed deployment" state,
and subsequent boots will show the "invalid magic number" and that event is generated before that.

Here is the output of maas $PROFILE boot-resources read:

https://paste.ubuntu.com/p/z6jY9kBgZR/

I believe that message is generated by the tftp daemon here:

https://github.com/maas/maas/blob/master/src/provisioningserver/rackdservices/tftp.py#L211

Because centos and custom images don't have a commissioning image, only an xinstall one,
and purpose on the next boot is set to commissioning.

https://github.com/maas/maas/blob/master/src/provisioningserver/rackdservices/tftp.py#L289

This problem was reported in MAAS 3.2.9/debs and my reproducer is using MAAS 3.3.4/debs.

I just launched a new instance with MAAS 3.4.0~rc2 using snaps and I can reproduce the issue.

Revision history for this message
Alberto Donato (ack) wrote :

Could you please provide the exact steps used for the Centos image?
There's a bit of mixture of Centos- and Ubuntu-related commands in the thread.

Basically:
- which command was used to upload the custom centos image?
- what's the output from the boot where the issue happens?

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Joao Andre Simioni (jasimioni) wrote :

Hi Alberto,

I initially flagged the bug with a custom image because this is how it was reported in the production environment (using a custom image).

While working to provide you with the extra data requested, I noticed the same issue is seen with the official CentOS images. There is no upload involved in this scenario.

Steps to reproduce with CentOS image:

- Install a fresh MAAS environment
- Using the WebUI go to Images, Other Images, and Enable the CentOS 7 Image
- Enlist and commission a machine. I'm using a UEFI x86 QEMU VM in the reproducer.
- Start the deployment the machine using the CentOS image, but interrupt the installation outside MAAS (I run a virsh destroy vm)
- After the Deployment timeout (usually 30 minutes, as defined by node_timeout), the machine will go to "Failed deployment" state
- Power on the VM - it'll get stuck in the "invalid magic number" screen, and the ephemeral grub menu will show the wrong path with the no-such-image URL

Revision history for this message
Alberto Donato (ack) wrote :

So, given that the deployment was interrupted, MAAS is correctly setting the machine as "deployment failed" after the timeout.

The grub configuration that's provided after that is indeed quite weird (it seems to be broken for any non-ubuntu distro), but the behavuour in this case is to be considered unspecified.

If you would like to propose a different behaviour for this case, could you please open a discussion in our discourse (https://discourse.maas.io/c/features/) as a feature request?

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Joao Andre Simioni (jasimioni) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.