commissioning fails silently if a node can't reach the region controller

Bug #1303925 reported by James Troup on 2014-04-07
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Julian Edwards
cloud-init
Low
Unassigned

Bug Description

We recently had a node which completely refused to commission in MAAS.
After (literally) several man days of debugging, we figured out that
it was because the node couldn't talk to the region controller over
HTTP.

Obviously, that's ultimately our mistake/problem, but MAAS could have
been a lot better at helping us to help ourselves; currently, there's
absolutely no indication from the boot process that the HTTP
connection to the region controller is the problem.

Attached is the serial console output (from the point of boot) for the
node that was failing to commission. 91.189.94.35 is the MAAS region
controller and 91.189.88.20 is the MAAS cluster controller.

Related branches

James Troup (elmo) wrote :
tags: added: canonical-is
Graham Binns (gmb) wrote :

Calling this critical since it's a costly failure state to get into, and targeting it for 14.10.

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 14.10
Changed in maas:
importance: Critical → High
Julian Edwards (julian-edwards) wrote :

James, was it hanging or shutting down after that error in the log?

Julian Edwards (julian-edwards) wrote :

MAAS is not in direct control at this point, I think cloud-init needs to do better here and have a last-ditch catch of exceptions before running a piece of failsafe code that would report something back to MAAS.

tags: added: provisioning
Scott Moser (smoser) wrote :

cloud-init is executing code that maas told it to execute.
so maas needs to tell it to execute code that has some "last ditch catch".

to be clear, cloud-init got data from maas (via kernel cmdline) that told it to tell get some code from the metadata server to execute. It then executed it. That code failed. *that* is the code that needs to be more resilient. cloud-init is, by design, very much doing exactly what maas tells it to do.

no longer affects: cloud-init
Gavin Panella (allenap) wrote :

I assume cloud-init doesn't crash if the code it downloads from MAAS breaks... so the reason it's hanging is because the instructions about what to do next were in that downloaded, crashy, piece? If so, I further assume that we therefore need to get some fail-safe command into the first user-data file that cloud-init processes; is that right?

Gavin Panella (allenap) wrote :

I made a mistake: it couldn't actually download the code. However, the question stands: what does cloud-init do if it can't download from a data source? Does it process the next directive in the user-data that it does have?

Julian Edwards <email address hidden> writes:

> James, was it hanging or shutting down after that error in the log?

It hung.

--
James

Julian Edwards (julian-edwards) wrote :

Thanks James.

Scott, cloud-init is hanging without getting any data from MAAS. It seems to me that there should be at least a last-ditch way of reporting the failure back somewhere?

This is possibly a dupe of bug 1237215

no longer affects: curtin

On 09/04/14 00:11, Gavin Panella wrote:
> I made a mistake: it couldn't actually download the code. However, the
> question stands: what does cloud-init do if it can't download from a
> data source? Does it process the next directive in the user-data that it
> does have?

There is no user data at this point though is there? It's trying to get
it from the metadata server, as MAAS just passes the URL to that on the
kernel cmd line.

I don't know if there's room on the kernel command line to add much
more. If there's something simple MAAS can do here then great, but I'm
concerned that cloud-init hangs.

Scott Moser (smoser) wrote :

[ 0.000000] Command line: nomodeset iscsi_target_name=iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010 iscsi_target_ip=91.189.88.20 iscsi_target_port=3260 iscsi_initiator=rubay ip=::::rubay:BOOTIF ro root=/dev/disk/by-path/ip-91.189.88.20:3260-iscsi-iqn.2004-05.com.ubuntu:maas:maas-precise-12.04-amd64-20131010-lun-1 overlayroot=tmpfs cloud-config-url=http://91.189.94.35/MAAS/metadata/latest/by-id/node-0d287828-be5e-11e3-a0d3-0019bbccd75c/?op=get_preseed log_host=91.189.94.35 log_port=514 console=tty0 console=ttyS1,38400 nosplash initrd=amd64/generic/precise/commissioning/initrd.gz BOOT_IMAGE=amd64/generic/precise/commissioning/linux BOOTIF=01-2c-44-fd-81-23-e8

I did wrongly diagnose this previously. cloud-init could / should warn more loudly that it couldn't get the url for 'cloud-config-url'.

However, here is what happened as I understand it:

1. maas cluster controller sent the above kernel command line to a commissioning node (enlistment or commissioning wouldn't really matter, the ephemeral environment is the key).
2. node was unable to reach the maas region controller at 91.189.94.35
In a happy path, cloud-init would have gotten that url, and stored it in /etc/cloud/cloud.cfg.d/ . The content of that url would have then told cloud-init that it should:
 a.) only enable the maas datasource (disabling the ec2 datasource)
 b.) attempt to get data from the maas datasource on the region controller.

3.) cloud-init failed to get any configuration on the kernel cmdline, so it went on its way looking for all configured datasources, which included the EC2 datasource.
   Note, that the timeout on the EC2 datasource is quite annoying, but was at least historically required as the EC2 datasource might just not have been there for some time, so polling and retry was necessary. Anyway, that wouldnt' have changed anything, the failure path inevitable given '2' above.

cloud-init probably should have cried more loudly when the request in '2' failed. It is possible that even if it did do that, such a warn would have been lost due to other bugs like bug 1235231. But it should at least WARN, and i'll make sure it does that.

To me the most general problem here is the requirement for a node's boot to contact the region controller, and the lack of documentation of that requiment (or failure of the user to know that, I'm not sure whether or not it is documented).

at http://91.189.94.35/MAAS/metadata/...
early in its process, cloud-init probably tried and failed to get that url.

Thanks for the analysis Scott, I concur.

MAAS in general at the moment is a "fire and forget" model which is pretty naive, and we're going to work on making this stuff more robust in the coming weeks.

It seems that cloud-init could help a little if we could provide some other way, via the kernel params, of a "failure" API point which it could do a POST on (with data about the failure) if there is any problem. Is this something you'd consider implementing in cloud-init?

Mark Shuttleworth (sabdfl) wrote :

Let's rather think about how MAAS itself could make this feel more like a managed experience.

 * MAAS could let the user know that the node came and asked for PXE config, and what was sent
    => avoids having to get the serial output from the node, which is a pain involving SOL
 * MAAS could let the user know that the node asked for cloud-init data (and what was passed, with fingerprint)
   => tells us that PXE happened and cloud-init is getting data from MAAS
 * cloud-init could report that it successfully retrieved that data (and the fingerprint) before processing it
   => confirms the above, from the node perspective
 * cloud-init could report back to MAAS that it successfully completed.
   => suggests that things are working OK

All of the above could be held by the cluster controller and only fed to the region controller on demand (i.e. when debugging). That avoids DOS'ing the region controller when PXE-booting the DC.

Mark, this is pretty much the plan and I've already asked Gavin to look into the changes required to support a more granular boot reporting mechanism like this.

Scott Moser (smoser) wrote :

regarding failure post path, we could look at that. it really seems like overloading the kernel cmdline though.

On Friday 11 Apr 2014 17:10:13 you wrote:
> regarding failure post path, we could look at that. it really seems like
> overloading the kernel cmdline though.

Well ultimately MAAS will time out the node and try elsewhere, but if the
node is able to pre-empt this while providing valuable debug info, then it's
worth it.

tags: added: node-lifecycle
Raphaël Badin (rvb) on 2014-06-05
tags: added: robustness
Raphaël Badin (rvb) on 2014-06-06
tags: removed: node-lifecycle
Changed in maas:
milestone: 1.6.0 → none
Scott Moser (smoser) wrote :

I'm marking this triaged for cloud-init.
At least 1 solution is understood.

Changed in cloud-init:
importance: Undecided → Low
status: New → Triaged
Christian Reis (kiko) on 2014-10-18
Changed in maas:
milestone: none → next
Christian Reis (kiko) on 2014-10-30
Changed in maas:
milestone: next → 1.7.1
Changed in maas:
status: Triaged → In Progress
assignee: nobody → Julian Edwards (julian-edwards)
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers