MAAS

commissioning fails silently if a node can't reach the region controller

Bug #1303925 reported by James Troup on 2014-04-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	High	Julian Edwards	MAAS 1.7.1
	cloud-init	Expired	Low	Unassigned

Bug Description

We recently had a node which completely refused to commission in MAAS.
After (literally) several man days of debugging, we figured out that
it was because the node couldn't talk to the region controller over
HTTP.

Obviously, that's ultimately our mistake/problem, but MAAS could have
been a lot better at helping us to help ourselves; currently, there's
absolutely no indication from the boot process that the HTTP
connection to the region controller is the problem.

Attached is the serial console output (from the point of boot) for the
node that was failing to commission. 91.189.94.35 is the MAAS region
controller and 91.189.88.20 is the MAAS cluster controller.

Tags:

Related branches

lp:~julian-edwards/maas/commission-monitor-bug-1303925

Merged into lp:~maas-committers/maas/trunk at revision 3332

Jeroen T. Vermeulen (community): Approve on 2014-11-04

Revision history for this message

James Troup (elmo) wrote on 2014-04-07:

Console output from node that wouldn't commission Edit (60.3 KiB, text/plain)

tags:

added: canonical-is

Revision history for this message

Graham Binns (gmb) wrote on 2014-04-07:

Calling this critical since it's a costly failure state to get into, and targeting it for 14.10.

Changed in maas:
status:	New → Triaged
importance:	Undecided → Critical
milestone:	none → 14.10

Julian Edwards (julian-edwards) on 2014-04-08

Changed in maas:
importance:	Critical → High

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-04-08:

James, was it hanging or shutting down after that error in the log?

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-04-08:

MAAS is not in direct control at this point, I think cloud-init needs to do better here and have a last-ditch catch of exceptions before running a piece of failsafe code that would report something back to MAAS.

tags:

added: provisioning

Revision history for this message

Scott Moser (smoser) wrote on 2014-04-08:

cloud-init is executing code that maas told it to execute.
so maas needs to tell it to execute code that has some "last ditch catch".

to be clear, cloud-init got data from maas (via kernel cmdline) that told it to tell get some code from the metadata server to execute. It then executed it. That code failed. *that* is the code that needs to be more resilient. cloud-init is, by design, very much doing exactly what maas tells it to do.

no longer affects:

cloud-init

Revision history for this message

Gavin Panella (allenap) wrote on 2014-04-08:

I assume cloud-init doesn't crash if the code it downloads from MAAS breaks... so the reason it's hanging is because the instructions about what to do next were in that downloaded, crashy, piece? If so, I further assume that we therefore need to get some fail-safe command into the first user-data file that cloud-init processes; is that right?

Revision history for this message

Gavin Panella (allenap) wrote on 2014-04-08:

I made a mistake: it couldn't actually download the code. However, the question stands: what does cloud-init do if it can't download from a data source? Does it process the next directive in the user-data that it does have?

Revision history for this message

James Troup (elmo) wrote on 2014-04-09: Re: [Bug 1303925] Re: commissioning fails silently if a node can't reach the region controller

Julian Edwards <email address hidden> writes:

> James, was it hanging or shutting down after that error in the log?

It hung.

--
James

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-04-09:

Thanks James.

Scott, cloud-init is hanging without getting any data from MAAS. It seems to me that there should be at least a last-ditch way of reporting the failure back somewhere?

This is possibly a dupe of bug 1237215

no longer affects:

curtin

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-04-09:

#10

On 09/04/14 00:11, Gavin Panella wrote:
> I made a mistake: it couldn't actually download the code. However, the
> question stands: what does cloud-init do if it can't download from a
> data source? Does it process the next directive in the user-data that it
> does have?

There is no user data at this point though is there? It's trying to get
it from the metadata server, as MAAS just passes the URL to that on the
kernel cmd line.

I don't know if there's room on the kernel command line to add much
more. If there's something simple MAAS can do here then great, but I'm
concerned that cloud-init hangs.

Revision history for this message

Scott Moser (smoser) wrote on 2014-04-09:

#11

[ 0.000000] Command line: nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010 iscsi_target_ip=91.189.88.20 iscsi_target_port=3260 iscsi_initiator=rubay ip=::::rubay:BOOTIF ro root=/dev/disk/by-path/ip-91.189.88.20:3260-iscsi-iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010-lun-1 overlayroot=tmpfs cloud-config-url=http://91.189.94.35/MAAS/metadata/latest/by-id/node-0d287828-be5e-11e3-a0d3-0019bbccd75c/?op=get_preseed log_host=91.189.94.35 log_port=514 console=tty0 console=ttyS1,38400 nosplash initrd=amd64/generic/precise/commissioning/initrd.gz BOOT_IMAGE=amd64/generic/precise/commissioning/linux BOOTIF=01-2c-44-fd-81-23-e8

I did wrongly diagnose this previously. cloud-init could / should warn more loudly that it couldn't get the url for 'cloud-config-url'.

However, here is what happened as I understand it:

1. maas cluster controller sent the above kernel command line to a commissioning node (enlistment or commissioning wouldn't really matter, the ephemeral environment is the key).
2. node was unable to reach the maas region controller at 91.189.94.35
In a happy path, cloud-init would have gotten that url, and stored it in /etc/cloud/cloud.cfg.d/ . The content of that url would have then told cloud-init that it should:
a.) only enable the maas datasource (disabling the ec2 datasource)
b.) attempt to get data from the maas datasource on the region controller.

3.) cloud-init failed to get any configuration on the kernel cmdline, so it went on its way looking for all configured datasources, which included the EC2 datasource.
Note, that the timeout on the EC2 datasource is quite annoying, but was at least historically required as the EC2 datasource might just not have been there for some time, so polling and retry was necessary. Anyway, that wouldnt' have changed anything, the failure path inevitable given '2' above.

cloud-init probably should have cried more loudly when the request in '2' failed. It is possible that even if it did do that, such a warn would have been lost due to other bugs like bug 1235231. But it should at least WARN, and i'll make sure it does that.

To me the most general problem here is the requirement for a node's boot to contact the region controller, and the lack of documentation of that requiment (or failure of the user to know that, I'm not sure whether or not it is documented).

at http://91.189.94.35/MAAS/metadata/...
early in its process, cloud-init probably tried and failed to get that url.

[    0.000000] Command line: nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010 iscsi_target_ip=91.189.88.20 iscsi_target_port=3260 iscsi_initiator=rubay ip=::::rubay:BOOTIF ro root=/dev/disk/by-path/ip-91.189.88.20:3260-iscsi-iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010-lun-1 overlayroot=tmpfs cloud-config-url=http://91.189.94.35/MAAS/metadata/latest/by-id/node-0d287828-be5e-11e3-a0d3-0019bbccd75c/?op=get_preseed log_host=91.189.94.35 log_port=514 console=tty0 console=ttyS1,38400 nosplash initrd=amd64/generic/precise/commissioning/initrd.gz BOOT_IMAGE=amd64/generic/precise/commissioning/linux BOOTIF=01-2c-44-fd-81-23-e8

I did wrongly diagnose this previously.  cloud-init could / should warn more loudly that it couldn't get the url for 'cloud-config-url'.

However, here is what happened as I understand it:

1. maas cluster controller sent the above kernel command line to a commissioning node (enlistment or commissioning wouldn't really matter, the ephemeral environment is the key).
2. node was unable to reach the maas region controller at 91.189.94.35
In a happy path, cloud-init would have gotten that url, and stored it in /etc/cloud/cloud.cfg.d/ .  The content of that url would have then told cloud-init that it should:
 a.) only enable the maas datasource (disabling the ec2 datasource)
 b.) attempt to get data from the maas datasource on the region controller.

3.) cloud-init failed to get any configuration on the kernel cmdline, so it went on its way looking for all configured datasources, which included the EC2 datasource.
   Note, that the timeout on the EC2 datasource is quite annoying, but was at least historically required as the EC2 datasource might just not have been there for some time, so polling and retry was necessary.  Anyway, that wouldnt' have changed anything, the failure path inevitable given '2' above.

cloud-init probably should have cried more loudly when the request in '2' failed.  It is possible that even if it did do that, such a warn would have been lost  due to other bugs like bug 1235231.  But it should at least WARN, and i'll make sure it does that.

at http://91.189.94.35/MAAS/metadata/...
early in its process, cloud-init probably tried and failed to get that url.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-04-09:

#12

Thanks for the analysis Scott, I concur.

MAAS in general at the moment is a "fire and forget" model which is pretty naive, and we're going to work on making this stuff more robust in the coming weeks.

It seems that cloud-init could help a little if we could provide some other way, via the kernel params, of a "failure" API point which it could do a POST on (with data about the failure) if there is any problem. Is this something you'd consider implementing in cloud-init?

Revision history for this message

Mark Shuttleworth (sabdfl) wrote on 2014-04-10:

#13

Let's rather think about how MAAS itself could make this feel more like a managed experience.

* MAAS could let the user know that the node came and asked for PXE config, and what was sent
    => avoids having to get the serial output from the node, which is a pain involving SOL
* MAAS could let the user know that the node asked for cloud-init data (and what was passed, with fingerprint)
   => tells us that PXE happened and cloud-init is getting data from MAAS
* cloud-init could report that it successfully retrieved that data (and the fingerprint) before processing it
   => confirms the above, from the node perspective
* cloud-init could report back to MAAS that it successfully completed.
   => suggests that things are working OK

All of the above could be held by the cluster controller and only fed to the region controller on demand (i.e. when debugging). That avoids DOS'ing the region controller when PXE-booting the DC.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-04-11:

#14

Mark, this is pretty much the plan and I've already asked Gavin to look into the changes required to support a more granular boot reporting mechanism like this.

Revision history for this message

Scott Moser (smoser) wrote on 2014-04-11:

#15

regarding failure post path, we could look at that. it really seems like overloading the kernel cmdline though.

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-04-22:

#16

On Friday 11 Apr 2014 17:10:13 you wrote:
> regarding failure post path, we could look at that. it really seems like
> overloading the kernel cmdline though.

Well ultimately MAAS will time out the node and try elsewhere, but if the
node is able to pre-empt this while providing valuable debug info, then it's
worth it.

Julian Edwards (julian-edwards) on 2014-05-13

tags:

added: node-lifecycle

Raphaël Badin (rvb) on 2014-06-05

tags:

added: robustness

Raphaël Badin (rvb) on 2014-06-06

tags:

removed: node-lifecycle

Julian Edwards (julian-edwards) on 2014-07-07

Changed in maas:
milestone:	1.6.0 → none

Revision history for this message

Scott Moser (smoser) wrote on 2014-08-25:

#17

I'm marking this triaged for cloud-init.
At least 1 solution is understood.

Changed in cloud-init:
importance:	Undecided → Low
status:	New → Triaged

Christian Reis (kiko) on 2014-10-18

Changed in maas:
milestone:	none → next

Christian Reis (kiko) on 2014-10-30

Changed in maas:
milestone:	next → 1.7.1

Julian Edwards (julian-edwards) on 2014-11-04

Changed in maas:
status:	Triaged → In Progress
assignee:	nobody → Julian Edwards (julian-edwards)

MAAS Lander (maas-lander) on 2014-11-05

Changed in maas:
status:	In Progress → Fix Committed

Andres Rodriguez (andreserl) on 2015-02-05

Changed in maas:
status:	Fix Committed → Fix Released

Revision history for this message

James Falcon (falcojr) wrote on 2023-05-10:

#18

Tracked in Github Issues as https://github.com/canonical/cloud-init/issues/2439

Changed in cloud-init:
status:	Triaged → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Console output from node that wouldn't commission Edit

Add attachment

Remote bug watches

auto-github-canonical-cloud-init #2439
[open bug launchpad] Edit

Bug watches keep track of this bug in other bug trackers.