[2.2] MAAS doesn't report cloud-init failures post-deployment

Bug #1707850 reported by Nobuto Murata
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Medium
Andres Rodriguez

Bug Description

A node has been marked as "Deployed" in MAAS, but it actually failed.

1. Deploy -> curtin succeeded with no error
2. Reboot at the end of curtin process
3. Boot up with installed OS, then cloud-init will be run

At the moment, MAAS marked the node as Deployed, but cloud-init is still running and it has some failures (in my case, failure in cloudinit.config.cc_apt_configure because an external http proxy was temporarily unavailable). Deeper status check on cloud-init would be nice to have here.

Cloud-init output of phase 3. above:

[ 26.808179] cloud-init[4574]: 2017-08-01 06:56:06,869 - util.py[WARNING]: Running module apt-configure (<module 'cloudinit.config.cc_apt_configure' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_apt_configure.py'>) failed
[ 49.828301] cloud-init[4574]: Ign:1 http://archive.ubuntu.com/ubuntu xenial InRelease
[ 49.828437] cloud-init[4574]: Ign:2 http://archive.ubuntu.com/ubuntu xenial-updates InRelease
[ 49.828522] cloud-init[4574]: Ign:3 http://archive.ubuntu.com/ubuntu xenial-backports InRelease
[ 49.828605] cloud-init[4574]: Ign:4 http://archive.ubuntu.com/ubuntu xenial-security InRelease
[ 49.829045] cloud-init[4574]: Ign:5 http://archive.ubuntu.com/ubuntu xenial Release
[ 49.829565] cloud-init[4574]: Ign:6 http://archive.ubuntu.com/ubuntu xenial-updates Release
[ 49.829922] cloud-init[4574]: Ign:7 http://archive.ubuntu.com/ubuntu xenial-backports Release
[ 49.830477] cloud-init[4574]: Ign:8 http://archive.ubuntu.com/ubuntu xenial-security Release
...
[ 52.823275] cloud-init[4574]: Reading package lists...
[ 52.823343] cloud-init[4574]: W: The repository 'http://archive.ubuntu.com/ubuntu xenial Release' does not have a Release file.
[ 52.823412] cloud-init[4574]: W: The repository 'http://archive.ubuntu.com/ubuntu xenial-updates Release' does not have a Release file.
[ 52.823481] cloud-init[4574]: W: The repository 'http://archive.ubuntu.com/ubuntu xenial-backports Release' does not have a Release file.
[ 53.194853] cloud-init[4574]: W: The repository 'http://archive.ubuntu.com/ubuntu xenial-security Release' does not have a Release file.
[ 53.194961] cloud-init[4574]: E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-backports/restricted/binary-amd64/Packages Unable to connect to 10.10.X.Y:8080:
[ 53.195039] cloud-init[4574]: E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial/main/binary-amd64/Packages Unable to connect to 10.10.X.Y:8080:
[ 53.195113] cloud-init[4574]: E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-updates/main/binary-amd64/Packages Unable to connect to 10.10.X.Y:8080:
[ 53.195185] cloud-init[4574]: E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/xenial-security/main/binary-amd64/Packages Unable to connect to 10.10.X.Y:8080:
[ 53.195272] cloud-init[4574]: E: Some index files failed to download. They have been ignored, or old ones used instead.
[ 70.444562] cloud-init[4574]: 2017-08-01 06:56:50,523 - handlers.py[WARNING]: failed posting event: finish: modules-config/config-ntp: FAIL: running config-ntp with frequency once-per-instance

Thus, no ntp setup with MAAS as a result.

Related branches

Nobuto Murata (nobuto)
summary: - MAAS should not mark a node as "Deployed" when cloud-init has some
+ [2.2] MAAS should not mark a node as "Deployed" when cloud-init has some
failures
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [2.2] MAAS should not mark a node as "Deployed" when cloud-init has some failures

The machine is marked deployed once it access the metadata which ensures the SSH key gets imported. If other cloud-init failures happen, like yours, it is ok because as you said, it could have been just random issues that happen.

Changed in maas:
status: New → Won't Fix
Revision history for this message
Ante Karamatić (ivoks) wrote :

Shouldn't MAAS call back home on successful cloud-init finish? That would help with observability. The fact that metadata was served doesn't mean deployment was successful. What if requested network config cannot be implemented? Surely MAAS needs to keep an eye on deployment until the very end of cloud-config?

Revision history for this message
Christian Reis (kiko) wrote :

Hang on, from our perspective (trying to deploy an application using MAAS) it's not OK -- the box is not correctly deployed.

It /should/ have failed deployment, and we will then waste a ton of time as Juju tries to deploy, and fails somewhere random because the node picked doesn't fully work.

Revision history for this message
Dean Henrichsmeyer (dean) wrote :

The challenge becomes how does MAAS know what is an expected cloud-init "failure" versus an unexpected failure. Those change with different use cases, hardware, software, etc.

Servicing information up to the user is important but I don't see how MAAS itself can be in a position to interpret that data as intended success and/or failure.

Changed in maas:
status: Won't Fix → Incomplete
Revision history for this message
Christian Reis (kiko) wrote :

So to take an example from this specific case, we consider modules-config/config-ntp critical, and perhaps more broadly, the ability to pull packages, which fails in the case above.

Could we get MAAS to track that and tell us if it succeeded?

Revision history for this message
Christian Reis (kiko) wrote :

Can you clarify relative to cloud-init failures what sort of false positives occur regularly so we can understand what the challenge is from the MAAS side?

Changed in maas:
status: Incomplete → New
Revision history for this message
Nobuto Murata (nobuto) wrote :

In my opinion, MAAS could handle cloud-init result as follows:

1. No error -> All green of course

$ cat /run/cloud-init/result.json
{
 "v1": {
  "datasource": "DataSourceMAAS [http://10.10.20.X:Y/MAAS/metadata/]",
  "errors": []
 }
}

2. Any error with the modules vendor-data using -> Marked as red

Currently with 2.2, only ntp module is used by default. But I guess MAAS 2.3 will use network related module as well.

$ cat /var/lib/cloud/instance/vendor-data.txt
#cloud-config
ntp:
  pools: []
  servers: [10.10.X.Y]

3. Any error with the module user-data using -> Arguable to mark it as red or just a warning

By the cases above, I don't think MAAS needs to distinguish a real error and an intermittent(unexpected) failure. Both failures can be marked as just errors, so users can try "retry-provisioning" to see if it's reproducible or not.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Nobuto,

I think we need to take a close look at the messages cloud-init sends to MAAS. At the moment I do not believe the even messages that MAAS recieves from cloud-init says where the run module is running from vendor data vs. user-data.

In any case vendor-data should be processed before user-data, so cloud-init reporting a failure should be before its requests user-data.

I think one thing of note is with MAAS running the ntp server this should never fail. So even if we handle cloud-init errors better, ntp configuration should never fail on the machine (unless its a complete external ntp server).

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Following discussion with Ante, MAAS won't mark a node 'failed deployment' if the vendor data fails to run. It should, however, surface that there was an error.

As such, MAAS already should be surfacing errors from cloud-init in the node event log.
There are UX plan improvements to display these extra events on better places in the UI.

As such, we will address this then. IN the meantime, I'm targetting this to 2.3.

Changed in maas:
milestone: none → 2.3.0
importance: Undecided → Medium
status: New → Triaged
Changed in maas:
assignee: nobody → Andres Rodriguez (andreserl)
summary: - [2.2] MAAS should not mark a node as "Deployed" when cloud-init has some
- failures
+ [2.2] MAAS doesn't report cloud-init failures post-deployment
Revision history for this message
Andres Rodriguez (andreserl) wrote :

FWIW, I'm fixing this bug to show failed events from cloud-init (as cloud-init reports it) in the 'Events' tab on the node details page.

As discussed before, UX team is working on a better solution to display better messages and events in different places which will better highlight this. So, while in the meantime the fix is as the above, in the near future we will have these events better displayed around the UI.

Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
milestone: 2.3.0 → 2.3.0alpha3
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.