Ubuntu
syslinux package

Nodes cannot boot after a storage disk replacement

Bug #1488594 reported by james beedy on 2015-08-25

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Invalid	Undecided	Unassigned
	syslinux (Ubuntu)	New	Undecided	Unassigned

Bug Description

I'm experiencing this issue when I replace any osd disk on any ceph storage node and then reboot it. Immediatly after the node pxe boots, the node will hang at a "booting local disk" message and fails to timeout or boot. A work-around I've found to get a node to boot after a storage disk replacement is to momentarilly disable maas from managing the network after the power on of a node who's disk has been replaced; following that, after the node pxe boot times out and it results to booting from local disk into the os, I re-enable maas management on that network so the node gets an ip and continues the boot process and eventually successfully boots.

It would be nice to get some feedback on what is going on here, and also a best practice for what/how to proceed in the case when you need to swap storage disks.

Thanks!

maas.log <-- http://paste.ubuntu.com/12193844/

clusterd.log <-- http://paste.ubuntu.com/12193842/

maas - 1.8.0+bzr4001-0ubuntu2~trusty1
trusty - 14.04.3

See original description

Tags:

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

IMG_6175.jpg Edit (1.9 MiB, image/jpeg)

Here is a shot of the console of a node experiencing the issue.

description:

updated

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2015-08-25:

Do it just sit at the console prompt? Or does an error appear?

Looks like that the BIOS or PXELINUX for that matter might enumerate the block devices in a different order and the first disk is no longer the boot disk.

Changed in maas:
status:	New → Incomplete

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

It just sits.....I let her sit overnight even....no timeout....nothing.

Blake Rouse (blake-rouse) on 2015-08-25

Changed in maas:
status:	Incomplete → Confirmed
milestone:	none → next

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

Update:

After a few reboots and swapping back and fourth of storage disks....the node I'm experimenting on now neglects to boot with the original disk too.

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

sp *forth

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

From what I can gather... this issue seems to exists because of stale entries in maasserver_physicalblockdevice, and/or stale entries in maasserver_blockdevice which are inconsistent with the current resources/state of the node.

Might I enquire if/where maas might verify node resources upon power on?

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2015-08-25:

Hi James,

Have you tried re-commissioning your node. A recommissioning should update the storage model., MAAS does not yet provide the ability to update the information about disks/NIC's of currently deployed devices, however, if you were to re-commission and re-deploy this would potentially be fixed.

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2015-08-25:

MAAS does not affect the boot process at all. It just tells PXELINUX to boot from the first disk, MAAS does not identify which disk is the first disk, this is done by the BIOS at boot time.

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

Andres -

Yeah....a re-commissioning will solve the issue....to the extent that I could essentially get my node back and re-deploy ceph-osd and nova-compute IF juju would properly destroy the associated services and machine....but unfortunately no amount or combination of {service, unit, machine}-destroy commands will get rid of the unit, services or machine (see http://paste.ubuntu.com/12194988/).

This is all beside the point that I only need to replace a single disk. It is a far greater task to evacuate the host, re-commission, and redeploy and configure all services, when essentially all I should need to do is swap a disk and run a series of < 5 ceph commands to be back up from a disk failure.

Blake - The disks position in the bios and on the hba card do not change.

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

#10

How might this functionality be implemented? Possibly a resource diff upon poweron; following that, some kind of conditional/partial commissioning so a node's resources could be current? Should I feature request for this? Hmmmm, I have a feeling I'm barking up the wrong tree...per^^, but I can't seem to make sense of this any other way.

Thanks

Revision history for this message

james beedy (jamesbeedy) wrote on 2015-08-25:

#11

Update

I was able to bring my juju env current and finally delete the machine and services from the environment with a combination of "juju resolved <unit>" and "juju destroy-{unit, machine, service} --force <service,machine,unit>

Revision history for this message

Gavin Panella (allenap) wrote on 2016-01-25:

#12

Even when a node has been deployed, the node still attempts to PXE boot
from MAAS each time it's rebooted. MAAS knows it should boot locally and
gives the following configuration to PXELINUX:

DEFAULT local

LABEL local
LOCALBOOT 0

It appears that this does not do the right thing for your hardware. Put
another way, it does not do the same thing as your machine's BIOS does
when the network is unavailable.

I suspect this is a bug in PXELINUX and/or your hardare. There may be
something that MAAS can do to help, but I don't think it's the cause, so
I'll target this bug at PXELINUX and mark it Invalid in MAAS for now.

Changed in maas:
status:	Confirmed → Invalid

Revision history for this message

james beedy (jamesbeedy) wrote on 2016-01-25: Re: [Bug 1488594] Re: Nodes cannot boot after a storage disk replacement

#14

I really appreciate the input everyone. I guess I was a little overwhelmed
dealing with a few different issues at once .... I didn't mean to place the
blame on MAAS. That being said, node disk replacement under the direction
of MAAS is still a rugged process for me. I understand that pxelinux/bios
may be the root cause of my issue ... I guess I feel like MAAS had more to
do with this due to MAAS not being able to recognize new disk after
replacement w/o recommissioning. I feel like despite the boot issue, I
would still need to recommission and down the node for MAAS to take
inventory of the new disk after a replacement. Is this being looked into
for 1.9?

Thanks again,

James

On Mon, Jan 25, 2016 at 6:41 AM, Gavin Panella <email address hidden>
wrote:

> Even when a node has been deployed, the node still attempts to PXE boot
> from MAAS each time it's rebooted. MAAS knows it should boot locally and
> gives the following configuration to PXELINUX:
>
> DEFAULT local
>
> LABEL local
> LOCALBOOT 0
>
> It appears that this does not do the right thing for your hardware. Put
> another way, it does not do the same thing as your machine's BIOS does
> when the network is unavailable.
>
> I suspect this is a bug in PXELINUX and/or your hardare. There may be
> something that MAAS can do to help, but I don't think it's the cause, so
> I'll target this bug at PXELINUX and mark it Invalid in MAAS for now.
>
>
> ** Also affects: syslinux (Ubuntu)
> Importance: Undecided
> Status: New
>
> ** Changed in: maas
> Status: Confirmed => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1488594
>
> Title:
> Nodes cannot boot after a storage disk replacement
>
> Status in MAAS:
> Invalid
> Status in syslinux package in Ubuntu:
> New
>
> Bug description:
> I'm experiencing this issue when I replace any osd disk on any ceph
> storage node and then reboot it. Immediatly after the node pxe boots,
> the node will hang at a "booting local disk" message and fails to
> timeout or boot. A work-around I've found to get a node to boot after
> a storage disk replacement is to momentarilly disable maas from
> managing the network after the power on of a node who's disk has been
> replaced; following that, after the node pxe boot times out and it
> results to booting from local disk into the os, I re-enable maas
> management on that network so the node gets an ip and continues the
> boot process and eventually successfully boots.
>
> It would be nice to get some feedback on what is going on here, and
> also a best practice for what/how to proceed in the case when you need
> to swap storage disks.
>
> Thanks!
>
> maas.log <-- http://paste.ubuntu.com/12193844/
>
> clusterd.log <-- http://paste.ubuntu.com/12193842/
>
> maas - 1.8.0+bzr4001-0ubuntu2~trusty1
> trusty - 14.04.3
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1488594/+subscriptions
>

I really appreciate the input everyone. I guess I was a little overwhelmed
dealing with a few different issues at once .... I didn't mean to place the
blame on MAAS. That being said, node disk replacement under the direction
of MAAS is still a rugged process for me. I understand that pxelinux/bios
may be the root cause of my issue ... I guess I feel like MAAS had more to
do with this due to MAAS not being able to recognize new disk after
replacement w/o recommissioning. I feel like despite the boot issue, I
would still need to recommission and down the node for MAAS to take
inventory of  the new disk after a replacement. Is this being looked into
for 1.9?

Thanks again,

James

On Mon, Jan 25, 2016 at 6:41 AM, Gavin Panella <gavin.panella@canonical.com>
wrote:

> Even when a node has been deployed, the node still attempts to PXE boot
> from MAAS each time it's rebooted. MAAS knows it should boot locally and
> gives the following configuration to PXELINUX:
>
>   DEFAULT local
>
>   LABEL local
>     LOCALBOOT 0
>
> It appears that this does not do the right thing for your hardware. Put
> another way, it does not do the same thing as your machine's BIOS does
> when the network is unavailable.
>
> I suspect this is a bug in PXELINUX and/or your hardare. There may be
> something that MAAS can do to help, but I don't think it's the cause, so
> I'll target this bug at PXELINUX and mark it Invalid in MAAS for now.
>
>
> ** Also affects: syslinux (Ubuntu)
>    Importance: Undecided
>        Status: New
>
> ** Changed in: maas
>        Status: Confirmed => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1488594
>
> Title:
>   Nodes cannot boot after a storage disk replacement
>
> Status in MAAS:
>   Invalid
> Status in syslinux package in Ubuntu:
>   New
>
> Bug description:
>   I'm experiencing this issue when I replace any osd disk on any ceph
>   storage node and then reboot it. Immediatly after the node pxe boots,
>   the node will hang at a "booting local disk" message and fails to
>   timeout or boot. A work-around I've found to get a node to boot after
>   a storage disk replacement is to momentarilly disable maas from
>   managing the network after the power on of a node who's disk has been
>   replaced; following that, after the node pxe boot times out and it
>   results to booting from local disk into the os, I re-enable maas
>   management on that network so the node gets an ip and continues the
>   boot process and eventually successfully boots.
>
>   It would be nice to get some feedback on what is going on here, and
>   also a best practice for what/how to proceed in the case when you need
>   to swap storage disks.
>
>   Thanks!
>
>   maas.log <-- http://paste.ubuntu.com/12193844/
>
>   clusterd.log <-- http://paste.ubuntu.com/12193842/
>
>   maas - 1.8.0+bzr4001-0ubuntu2~trusty1
>   trusty - 14.04.3
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1488594/+subscriptions
>

Revision history for this message

james beedy (jamesbeedy) wrote on 2016-01-25:

#13

I really appreciate the input everyone. I guess I was a little overwhelmed dealing with a few different issues at once .... I didn't mean to place the blame on MAAS. That being said, node disk replacement under the direction of MAAS is still a rugged process for me. I understand that pxelinux/bios may be the root cause of my issue ... I guess I feel like MAAS had more to do with this due to MAAS not being able to recognize new disk after replacement w/o recommissioning. I feel like despite the boot issue, I would still need to recommission and down the node for MAAS to take inventory of the new disk after a replacement. Is this being looked into for 1.9?

Thanks again,

James

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2016-01-25:

#15

If you know which is the old disk and you have the fully information for the new disk using the API you could update that disk with all the new disk information. You would need to be very sure about the data or the deployment would fail, that is why its recommended to re-commission.

maas my-maas-session block-device update 1 model= serial= size= block_size=

Björn Tillenius (bjornt) on 2021-08-24

Changed in maas:
milestone:	next → none

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

IMG_6175.jpg Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntusyslinux package

Nodes cannot boot after a storage disk replacement

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
syslinux package