MAAS

cloud-init sometimes fails to run the part-001 script

Bug #1273296 reported by Björn Tillenius on 2014-01-27

This bug report is a duplicate of: Bug #1237215: maas and curtin do not indicate failure reasonably. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Incomplete	Undecided	Unassigned

Bug Description

I deployed 12 machines using the maas juju provider, and one of those machines failed to start juju. Looking in the cloud-init log for that machine, it said that the part-001 script failed to run, error code 3. Unfortunately, I don't have the log file anymore, but I'll attach one when I run into this next time. There was no other error message indicating what failed.

I tried to run /var/lib/cloud/instance/scripts/part-001 manually to find out how it failed, but that time it succeeded, rebooting the machine and deploying the service.

rvba asked me to attach the result of 'sudo maas dumpdata metadataserver.NodeCommissionResult'. The node that failed was maas-1-09.

ii maas 1.4+bzr1820+ all Ubuntu MAAS Server
ii maas-cli 1.4+bzr1820+ all Ubuntu MAAS Client Tool
ii maas-cluster-c 1.4+bzr1820+ all Ubuntu MAAS Cluster Controller
ii maas-commissio 0.4+bzr36-0u all MAAS commissioning tools.
ii maas-common 1.4+bzr1820+ all Ubuntu MAAS Server
ii maas-dhcp 1.4+bzr1820+ all Ubuntu MAAS Server - DHCP Configu
ii maas-dns 1.4+bzr1820+ all Ubuntu MAAS Server - DNS configur
ii maas-enlist 0.4+bzr36-0u amd64 MAAS enlistment tool
ii maas-region-co 1.4+bzr1820+ all Ubuntu MAAS Server
ii maas-region-co 1.4+bzr1820+ all Ubuntu MAAS Server
ii python-django- 1.4+bzr1820+ all Ubuntu MAAS Server - (django file
ii python-maas-cl 1.4+bzr1820+ all Ubuntu MAAS API Client - (python
ii python-maas-pr 1.4+bzr1820+ all Ubuntu MAAS Server

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-27:

node-commmision-result Edit (1.3 MiB, text/html)

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-01-28:

cloud-init is nothing to do with the commissioning cycle. Can you please attach the logs that show the part-001 failure? I need to ascertain whether it's a cloud-init bug or a juju preseed bug.

Either way it's not a maas bug.

Changed in maas:
status:	New → Incomplete

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-28:

cloud-init.log Edit (54.0 KiB, text/plain)

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-28:

part-001 Edit (32.6 KiB, text/plain)

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-28:

maas.log Edit (2.3 KiB, text/plain)

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-28:

access.log Edit (4.5 MiB, text/plain)

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-28:

error.log Edit (305 bytes, text/plain)

Revision history for this message

Raphaël Badin (rvb) wrote on 2014-01-28:

The attached script "part-001" is the curtin script responsible for installing the node. I'm going to ping smoser as he might have an idea on how to debug this further.

Revision history for this message

Scott Moser (smoser) wrote on 2014-01-28:

I looked at the console log of the system that failed installation, and found the following. This would have also been seen in /var/log/cloud-init-output.log.

So the issue here I think is that something had the first disk busy (possibly mounted) so th BLKRRPART ioctl could not complete.

Cloud-init v. 0.7 running 'modules:final' at Tue, 28 Jan 2014 08:34:12 +0000. Up 16.87 seconds.
failed to partition /dev/sda [
Disk /dev/sda: 364801 cylinders, 255 heads, 63 sectors/track

sfdisk: ERROR: sector 0 does not have an msdos signature
/dev/sda: unrecognized partition table type
Old situation:
No partitions found
New situation:
Units = sectors of 512 bytes, counting from 0

Device Boot Start End #sectors Id System
/dev/sda1 * 2048 5860533167 5860531120 83 Linux
/dev/sda2 0 - 0 0 Empty
/dev/sda3 0 - 0 0 Empty
/dev/sda4 0 - 0 0 Empty
Successfully wrote the new partition table

Re-reading the partition table ...
BLKRRPART: Device or resource busy
The command to re-read the partition table failed.
Run partprobe(8), kpartx(8) or reboot your system now,
before using mkfs
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)]
Unexpected error while running command.
Command: ('partition', '/dev/sda')
Exit code: 1
Reason: -
Stdout: ''
Stderr: ''
Unexpected error while running command.
Command: ['curtin', 'block-meta', 'simple']
Exit code: 3
Reason: -
Stdout: ''
Stderr: ''
2014-01-28 08:34:13,528 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [3]

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-01-29:

#10

Based on this feedback I've duped to a generic "not reporting curtin failures" bug.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-29: Re: [Bug 1273296] Re: cloud-init sometimes fails to run the part-001 script

#11

On Wed, Jan 29, 2014 at 01:17:38AM -0000, Julian Edwards wrote:
> *** This bug is a duplicate of bug 1237215 ***
> https://bugs.launchpad.net/bugs/1237215
>
> Based on this feedback I've duped to a generic "not reporting curtin
> failures" bug.

Considering the fix is to ssh to the node and re-run the script, I would
much rather have MAAS/curtin do that for me, than telling me that
something went wrong.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-01-29:

#12

On Wednesday 29 Jan 2014 06:23:14 you wrote:
> Considering the fix is to ssh to the node and re-run the script, I would
> much rather have MAAS/curtin do that for me, than telling me that
> something went wrong.

I think that's the wrong thing to do - if a script went wrong it needs to be
reported, not blindly retried. The report could include the option to retry
for you though.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2014-01-29:

#13

On Wed, Jan 29, 2014 at 06:43:25AM -0000, Julian Edwards wrote:
> *** This bug is a duplicate of bug 1237215 ***
> https://bugs.launchpad.net/bugs/1237215
>
> On Wednesday 29 Jan 2014 06:23:14 you wrote:
> > Considering the fix is to ssh to the node and re-run the script, I would
> > much rather have MAAS/curtin do that for me, than telling me that
> > something went wrong.
>
> I think that's the wrong thing to do - if a script went wrong it needs to be
> reported, not blindly retried. The report could include the option to retry
> for you though.

Agreed, it shouldn't just blindly retry. I'm not proposing re-running
the whole script. The fix should go to the part-001 script, to retry the
specific part that failed in this case. I.e. if the disk is busy, wait a
bit and retry.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-01-29:

#14

On Wednesday 29 Jan 2014 06:58:26 you wrote:
> Agreed, it shouldn't just blindly retry. I'm not proposing re-running
> the whole script. The fix should go to the part-001 script, to retry the
> specific part that failed in this case. I.e. if the disk is busy, wait a
> bit and retry.

Yep, agree. That is all buried in cloud-init/curtin so I suspect that action
needs to be filed as a bug there, in this case.

Revision history for this message

Mark Duncan (eattheapple) wrote on 2014-04-08:

#15

I've been testing in 14.04 and this still seems to be a problem. I have the exact same issue with the part-001 script failing. I have tried deploying both saucy and trusty and both fail at the same spot.

Revision history for this message

Mark Duncan (eattheapple) wrote on 2014-04-08:

#16

I'm sorry, perhaps mine is not failing at the same spot. Mine actually partitions the drive and installs the system, but something else fails leaving it stuck at a login prompt. I can SSH in to the system and reboot it and as long as I have it boot from hard disk first, it will actually boot although Grub is improperly configured. I have already started reinstalling with the default installer, but I will try to post a log tomorrow when I try again.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1237215 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.