Unbootable system after installation

Bug #1671605 reported by Ante Karamatić on 2017-03-09
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
High
Unassigned
curtin
Undecided
Unassigned

Bug Description

On 2.2b2, installing nodes sometimes ends with unbootable system. Commissioning goes just fine, installation finishes and then on next boot (into installed system), console shows:

Intel(R) Boot Agent XE v2.3.11
Copyright (C) 1997-2013, Intel Corporation

CLIENT MAC ADDR: 2C 60 0C CD 0F CF GUID: 7F82C921 3D4A 11E5 A482 2C600CCD0FD1
CLIENT IP: 172.16.7.22 MASK: 255.255.255.0 DHCP IP: 172.16.7.2
GATEWAY IP: 172.16.7.1

PXELINUX 6.03 PXE 20151222 Copyright (C) 1994-2014 H. Peter Anvin et al
Booting local disk ...
WARN: No MBR magic, treating disk as raw.
Booting...

Rebooting the system and booting it from the disk boots the installed system just fine.

It seems that PXE image is offloading booting to one of the disks on the machine that doesn't have system installed. Installation in this case is done on sdf (MAAS automatically selected this disk, without user interaction), and it seems that PXE offloads booting to sda or some other disk.

Ante Karamatić (ivoks) wrote :
Andres Rodriguez (andreserl) wrote :

Hi Ante,

This seems like the root device is different from the boot device in the BIOS. In the MAAS UI, did you select the 'boot' device for the disk that is the default disk in the BIOS??

Changed in maas:
status: New → Incomplete
Ante Karamatić (ivoks) wrote :

I did not select anything. I commissioned the nodes and then hit deploy. MAAS automatically selected the root device.

When I try to change root device in UI, MAAS creates GPT partition table and I can't tell it to create MBR. This then again results in unbootable system, of course.

Chris Gregan (cgregan) wrote :

We've been tracking what seems like the same bug here: https://bugs.launchpad.net/juju-core/+bug/1670499

tags: added: cdo-qa-blocker
Chris Gregan (cgregan) wrote :

Please note that this issue may be fixed in

MAAS Version 2.2.0 (beta3+bzr5815)

Andres Rodriguez (andreserl) wrote :

@Ante,

As stated before, it sounds that the 'Boot' device selected on the BIOS vs the first disk identified by the OS are different. In other words:

1. The OS identifies hdX as sda, which is /not/ set as the boot device in the BIOS.
2. The BIOS has hdY as the boot device, identified as sd[b...x].

So, while MAAS installs in 'sda', the BIOS attempts to boot from 'sd[b...x]'. For that, you can select in MAAS which device is the *boot* device.

Ante, in the UI, can you please select the 'Boot' option to a different disk other than 'sda'. That being the disk where the BIOS is booting from!

Andres

If you reread again, you'll notice that MAAS installs to sdf (BIOS boots
from sda), and selecting anything else is impossible because MAAS decides
to use GPT, instead of MBR.

Problem doesn't exist on 2.1. This is a regression.

pon, 13. ožu 2017. 22:50 Andres Rodriguez <email address hidden> je
napisao:

> @Ante,
>
> As stated before, it sounds that the 'Boot' device selected on the BIOS
> vs the first disk identified by the OS are different. In other words:
>
> 1. The OS identifies hdX as sda, which is /not/ set as the boot device in
> the BIOS.
> 2. The BIOS has hdY as the boot device, identified as sd[b...x].
>
> So, while MAAS installs in 'sda', the BIOS attempts to boot from
> 'sd[b...x]'. For that, you can select in MAAS which device is the *boot*
> device.
>
> Ante, in the UI, can you please select the 'Boot' option to a different
> disk other than 'sda'. That being the disk where the BIOS is booting
> from!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1671605
>
> Title:
> Unbootable system after installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1671605/+subscriptions
>
--
Ante Karamatić
<email address hidden>
Canonical

Ante Karamatić (ivoks) wrote :

Just noticed I didn't wrote anywhere that MAAS installs to sdf. It creates
MBR.

Selecting other disks as boot creates GPT and makes system unbootable.

pon, 13. ožu 2017. 23:23 Ante Karamatić <email address hidden> je
napisao:

> Andres
>
> If you reread again, you'll notice that MAAS installs to sdf (BIOS boots
> from sda), and selecting anything else is impossible because MAAS decides
> to use GPT, instead of MBR.
>
> Problem doesn't exist on 2.1. This is a regression.
>
> pon, 13. ožu 2017. 22:50 Andres Rodriguez <email address hidden> je
> napisao:
>
> @Ante,
>
> As stated before, it sounds that the 'Boot' device selected on the BIOS
> vs the first disk identified by the OS are different. In other words:
>
> 1. The OS identifies hdX as sda, which is /not/ set as the boot device in
> the BIOS.
> 2. The BIOS has hdY as the boot device, identified as sd[b...x].
>
> So, while MAAS installs in 'sda', the BIOS attempts to boot from
> 'sd[b...x]'. For that, you can select in MAAS which device is the *boot*
> device.
>
> Ante, in the UI, can you please select the 'Boot' option to a different
> disk other than 'sda'. That being the disk where the BIOS is booting
> from!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1671605
>
> Title:
> Unbootable system after installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1671605/+subscriptions
>
> --
> Ante Karamatić
> <email address hidden>
> Canonical
>
--
Ante Karamatić
<email address hidden>
Canonical

Ante Karamatić (ivoks) wrote :

Selecting sda as boot device, while leaving sdf to be root device (as selected by MAAS) also doesn't work:

 + grub-install /dev/sda
 Installing for i386-pc platform.
 grub-install: error: unable to identify a filesystem in hostdisk//dev/sda; safety check can't be performed.
 + exit
 failed to install grub!

Andres Rodriguez (andreserl) wrote :

Looking at the config I see the following, which seems to be provided the correct configuration.

  - grub_device: true
    id: sda
    model: INTEL SSDSC2BX48
    name: sda
    ptable: msdos
    serial: BTHC63800AK2480MGN
    type: disk
    wipe: superblock
  - device: sdf
    id: sdf-part1
    name: sdf-part1
    number: 1
    offset: 4194304B
    size: 1000198897664B
    type: partition
    uuid: c4fbf7b8-58fe-44a7-b7dc-697d9968ae16
    wipe: superblock
  - fstype: ext4
    id: sdf-part1_format
    label: root
    type: format
    uuid: a5940775-aaec-4b0c-a10f-eb98f65122be
    volume: sdf-part1
  - device: sdf-part1_format
    id: sdf-part1_mount
    path: /
    type: mount

@Ante, please also attache the full installation log for the curtin developers.

Ante Karamatić (ivoks) wrote :

In this run I set sdb as root device. I kept sda as boot device.

Blake Rouse (blake-rouse) wrote :

Ante,

Change the boot device in the UI to be the correct boot disk. Then over the API reset the storage layout.

maas admin machine set-storage-layout <node-id> storage_layout=flat

Ryan Harper (raharper) wrote :

They're no boot partition; something has to hold /boot

On Tue, Mar 14, 2017 at 5:57 AM, Ante Karamatić <
<email address hidden>> wrote:

> In this run I set sdb as root device. I kept sda as boot device.
>
> ** Attachment added: "curtin.log"
> https://bugs.launchpad.net/maas/+bug/1671605/+attachment/
> 4837513/+files/curtin.log
>
> --
> You received this bug notification because you are subscribed to curtin.
> Matching subscriptions: curtin-bugs-all
> https://bugs.launchpad.net/bugs/1671605
>
> Title:
> Unbootable system after installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1671605/+subscriptions
>

Scott Moser (smoser) wrote :

The config shown in comment 10, and attachment 'curtin data' is bad.
it says that 'sda' (BTHC63800AK2480MGN) is the grub drive (boot disk), but that it should be wiped and not partitioned at all.
The installation is done to 'sdf' (1000198897664B) which is partitioned with dos partition table.

Note, these letters don't mean anything, what matters is the serial on the device.

I'm pretty sure that is just busted configuration. Grub cannot install to an un-partitioned disk, and grub installation fails, and curtin reports this failure.

Andres Rodriguez (andreserl) wrote :

Hey Ante,

Scott looked at the issue and confirmed there's a missing piece here. It seems I also mislead you.

For 'sda' to work as a 'boot' device, it needs a partition. In the meantime can you do:

1. create an empty partition in 'sda'.
2. select 'sda' as 'boot'.

That should cause grub to install in the partition. I'll check with my team whether we should be handling this automatically, although there could be the case that's not desirable if other partitions are created in the 'boot' device.

Scott Moser (smoser) wrote :

To demonstrate the issue, I booted a openstack vm, with 2 disks (vda, vdb). This is booted bios (not uefi).

## wipe the disk
$ disk="/dev/vdb"
$ sudo umount $disk
$ sudo dd if=/dev/zero of=$disk bs=1M count=1
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00783073 s, 134 MB/s

$ sudo udevadm settle

## attempt installation of grub to /dev/vdb
$ sudo grub-install "$disk"
Installing for i386-pc platform.
grub-install: error: unable to identify a filesystem in hostdisk//dev/vdb; safety check can't be performed.

## now partition it and try
$ (echo unit: sectors; echo label: dos; echo 2048,) | sudo sfdisk --force $disk
Checking that no-one is using this disk right now ... OK

Disk /dev/vdb: 40 GiB, 42949672960 bytes, 83886080 sectors
...
Device Boot Start End Sectors Size Id Type
/dev/vdb1 2048 83886079 83884032 40G 83 Linux
...

$ sudo udevadm settle
$ sudo grub-install $disk
Installing for i386-pc platform.
Installation finished. No error reported.

Ante Karamatić (ivoks) wrote :

Uhm... I'm quite sure it doesn't need partition, but it does need MBR.

Changed in maas:
milestone: none → 2.2.0
importance: Undecided → High
Chris Gregan (cgregan) wrote :

I'd like to bump this to Critical as it blocks deploys in our CI and is now a Customer blocker. This will gate the release if it is not fixed.

Scott Moser (smoser) wrote :

I would have thought this would work, but grub is definitely insisting on there being a partition on the disk that you grub-install to, not just a partition table.

See example here... 'parted' is what curtin uses for partitioning in msdos.

## wipe the disk (first 2M)
$ disk="/dev/vdb"
$ sudo dd if=/dev/zero of=$disk bs=1M count=2
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00783073 s, 134 MB/s
$ sudo udevadm settle

## show failure of installation with no partition table
$ sudo grub-install "$disk"
Installing for i386-pc platform.
grub-install: error: unable to identify a filesystem in hostdisk//dev/vdb; safety check can't be performed.

## put a partition table on disk, but no partitions
$ sudo parted $disk --script mklabel msdos
$ sudo blkid $disk
/dev/vdb: PTUUID="502420fa" PTTYPE="dos"

## attempt grub install
$ sudo grub-install "$disk" ; echo $?
Installing for i386-pc platform.
grub-install: error: unable to identify a filesystem in hostdisk//dev/vdb; safety check can't be performed.
1

## Try harder
$ sudo grub-install --skip-fs-probe $disk ; echo $?
Installing for i386-pc platform.
grub-install: warning: Attempting to install GRUB to a partitionless disk or to a partition. This is a BAD idea..
grub-install: error: embedding is not possible, but this is required for cross-disk install.
1

Ryan Harper (raharper) wrote :

On Wed, Mar 15, 2017 at 12:44 PM, Scott Moser <email address hidden> wrote:

> I would have thought this would work, but grub is definitely insisting
> on there being a partition on the disk that you grub-install to, not
> just a partition table.
>

In particular, unless one creates a real partition, the 'MBR Gap' that grub
uses
cannot be calculated. The definition of the gap is the space between the
end of the MBR and the start of the first partition.

Without a first partition, grub cannot calculate this space.

Ryan Harper (raharper) wrote :

Marking curtin task invalid at this time.

grub requires at least one partition to determine the MBR gap, or a bios_boot partition for GPT (UEFI systems use /boot/efi partition to hold grub data); Grub upstream does not support using blocklists as any updates to the filesystem may leave the system unbootable again. curtin is doing as-told; please re-open if you find new information indicating that curtin need to do something different.

Changed in curtin:
status: New → Invalid
Ante Karamatić (ivoks) wrote :

With MAAS 2.1, I do not have this problem. Whichever disk I select as boot/root, it boots just fine. I've attached outputs of get-curtin-config and 07-block-devices.out.

Ryan Harper (raharper) wrote :

The curtin config in this case consistently uses sdb, I assume that's the
disk you selected as 'boot/root'

sdb is marked with grub_device: True, using msdos partition table, there's
at least one partition on sdb (sdb-part1, which is also root).

On Fri, Mar 17, 2017 at 5:44 AM, Ante Karamatić <
<email address hidden>> wrote:

> With MAAS 2.1, I do not have this problem. Whichever disk I select as
> boot/root, it boots just fine. I've attached outputs of get-curtin-
> config and 07-block-devices.out.
>
> ** Attachment added: "2.1.tar"
> https://bugs.launchpad.net/maas/+bug/1671605/+attachment/
> 4839337/+files/2.1.tar
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1671605
>
> Title:
> Unbootable system after installation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1671605/+subscriptions
>

Andres Rodriguez (andreserl) wrote :

I discussed this further with Ante, and he did suggest we mark this bug as 'Won't Fix', provided that it is confirmed that the install disk is different from the boot disk. That is not an issue with MAAS itself, but rather a configuration issue.

That, however, doesn't change the fact that there's still a bug for the selection of a boot device.

i do have same issue on maas 2.1.5, it affects all my machine with more than 1 disk (i mean, after hw raid config). i didn't understand how to workaround it.

Zoltan Arnold Nagy (zoltan) wrote :

I'm hitting the same issue although even deploy fails. This happens on a system with a single SATA SSD and NVMe drives.

Removing the NVMe drives physically fixes the issue.

The boot disk and the install disk would be the same but please note that I don't even get to deploy a system in my case.

Daniel Souza (danielsouzasp) wrote :

Hello guys,

I'm running MAAS version: 2.3.5 (ubuntu1~16.04.1), and I am facing the same issue with 1 NVMe driver + 2x SATA SSD, MaaS installs the OS on nvme0n1-part1 no error, but when it tries to boot from local after PXE load, I see the error no "mbr magic treating disk as raw", and if I boot manually from NVMe disk, the deploy process finishes normally but it wont work in next reboot.

Is this a MaaS bug?

Daniel Souza (danielsouzasp) wrote :

additional info, I can see "APPEND hd0" at
/usr/lib/python3/dist-packages/provisioningserver/templates/pxe/config.local.amd64.template
maybe we need some conditional here for these cases.

Michael Cowart (evtmcowart) wrote :

I'm seeing this same issue on 2.4. Have a server with a single NVME drive + 8 SATA SSDs configured in software RAID. On initial commission MAAS wanted to install / to one of the SATA drives. Installing root/boot partitions to the NVME will not boot after installation.

Dylan Wang (hyuwang) wrote :

have the save issue on 2.5, it only happen to one specific server among hundreds.

When I do enlist -> commision -> config -> deploy, everything went well.

But that one, I did enlist -> config disk & network -> failed commision, then I try mark broken/rescue/exit rescue/... eventually I fixed it by re-commission, then deploy.

Deploy works, it just never able to finish booting from local disk.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers