[2.8] Juju fails to connect to instance after juju-clean-shutdown.service timeout in cloud-init

Bug #1878639 reported by Joshua Genet
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips
cloud-init
Expired
Undecided
Unassigned

Bug Description

AWS does spin up an instance and assigns an IP, but Juju stays stuck in Pending.
There's a bunch of EC2RoleRequest EC2Metadata Errors in the controller logs.

Here's a link to the logs/artifacts:
https://oil-jenkins.canonical.com/artifacts/5e61db53-50f0-4b82-9bb1-957bd0085d46/index.html

Joshua Genet (genet022)
tags: added: cdo-release-blocker
removed: cdo-qa-blocker
tags: added: foundations-engine
removed: cpe-foundation
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

I spun up a happy aws controller and deployed a unit. Nothing like a k8s config, but I see the same EC2 errors in the /var/log/amazon/ssm files.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Machine 18 is the one stuck in pending.

Revision history for this message
Pen Gale (pengale) wrote :

@genet022: can you get us logs from the machine that Juju was having a hard time talking to? Its logs didn't make it into the crash dump, and that's the most interesting machine, from a troubleshooting standpoint.

Revision history for this message
Joshua Genet (genet022) wrote :

@petevg Unfortunately because this was an automated run in our CI, the crashdump is all we have. And like you said, the machine 18 logs are empty.

Revision history for this message
Tim Penhey (thumper) wrote :

Is this a one off? Or is it happening every time?

Revision history for this message
Tim Penhey (thumper) wrote :

Grabbed the logs from the crashdumps. As mentioned by @petevg there is nothing we can use here for diagnosis.

The problem is on the machine that we have no information for. The controller logs show that the machine-18 in the kubernetes model never tried to connect. This normally indicates some networking or cloud-init issue on the started instance.

Without access to the instance that has had the problem, there is nothing we can do.

Changed in juju:
status: New → Incomplete
Revision history for this message
John George (jog) wrote :

We hit something similar on vsphere, and were able to get cloud-init-output.log.
It's available in the artifacts of this run:
https://solutions.qa.canonical.com/#/qa/testRun/0a3705fe-3357-486e-a61f-01abfffe3c58

There is a failure from juju-clean-shutdown.service

+ /bin/systemctl enable /etc/systemd/system/juju-clean-shutdown.service
Failed to enable unit: Connection timed out
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 running 'modules:final' at Thu, 14 May 2020 16:30:15 +0000. Up 226.09 seconds.
2020-05-14 16:43:15,256 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
2020-05-14 16:43:18,676 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2020-05-14 16:43:18,677 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed
Cloud-init v. 19.4-33-gbb4131a2-0ubuntu1~18.04.1 finished at Thu, 14 May 2020 16:43:18 +0000. Datasource DataSourceOVF [seed=iso]. Up 1009.50 seconds

John George (jog)
Changed in juju:
status: Incomplete → New
summary: - [2.8] Juju fails to connect to AWS instance
+ [2.8] Juju fails to connect to instance after juju-clean-
+ shutdown.service timeout in cloud-init
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

With the vsphere config machine-4:

May 14 16:43:18 juju-60990d-4 cloud-init[1562]: + /bin/systemctl enable /etc/systemd/system/juju-clean-shutdown.service
May 14 16:43:18 juju-60990d-4 cloud-init[1562]: Failed to enable unit: Connection timed out

The systectl command failed, causing cloud-init to fail and exit before jujud-machine-4.service cloud be enabled on that machine.

Revision history for this message
Pen Gale (pengale) wrote :

Per convo w/ Juju team, this is likely a systemd bug. Juju can't cleanly do much about it -- if a machine fails to cloud-init, it's never going to get to the point where it can talk to Juju. The "correct" next steps would involve further investigation on the failed machine, and a bug filed against systemd.

That said, there is some investigation that we might do from the Juju end of things. For example, we could queue up the service to be started, rather than blocking on start. This might cause other issues later on in the unit's life cycle, however.

Revision history for this message
Tim Penhey (thumper) wrote :

Isn't this a cloud-init issue? Not a Juju issue?

Revision history for this message
Pen Gale (pengale) wrote :

This is not a regression, and isn't a bug with the Juju service being started. There might be some longer term work to make Juju behave better when a piece of the pipeline fails like this. But this doesn't make sense as a release blocker -- any fixes we did in the release window would be partial, and wouldn't address the underlying bug in cloud-init.

Tim Penhey (thumper)
Changed in juju:
status: New → Invalid
Revision history for this message
Paride Legovini (paride) wrote :

Hi,

I think this is unlikely to be a bug in cloud-init, as the cloud-init failure is a consequence of the failure starting the juju-clean-shutdown service, as noted already. We could get better understanding on what happens on the cloud-init side from the logs tarball generated by running

  cloud-init collect-logs

on the failed machine. For the moment I'm marking the cloud-init task as Incomplete.

Changed in cloud-init:
status: New → Incomplete
Changed in juju:
status: Invalid → New
Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Joseph Phillips (manadart)
milestone: none → 2.8.1
Revision history for this message
Joseph Phillips (manadart) wrote :

This service is no longer created on machines using systemd.
https://github.com/juju/juju/pull/11717

Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Michael Skalka (mskalka)
tags: removed: cdo-release-blocker
Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.