juju 2.3.4: vsphere units automatically remove jujud across reboot when cloud-init /var/lib/cloud is removed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Medium
|
Unassigned | ||
juju-core |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
If /var/lib/cloud disappears in vsphere deployed units, subsequent reboots cause jujud to uninstall itself.
The relevant issue seems to be that jujud can't access the vsphere endpoint because of an absent configuration file in under /var/lib/cloud and as a result is queues up its own uninstallation.
This occurs during testing of cloud-init development packages. I need to remove /var/lib/cloud so cloud-init thinks the system is 'greenfield' or 'new' and re-runs through all cloud-init boot stages. The issue appears to be the user-data that juju provides writes out systemd unit and service files which don't wait on the completion of cloud-init. So, in cloud-init's 'fresh boot' scenario, jujud beats cloud-init setup and is missing some user-data it needs to properly get access vsphere api. As a result, it thinks it's improperly configured and removes itself.
Steps to reproduce:
juju bootstrap yourvspherecloud
juju add-unit ubuntu
juju ssh ubuntu/1 'sudo rm -rf /var/lib/cloud; sudo reboot'
juju status # node ubuntu/1 jujud agent will never report active again
juju ssh ubuntu/1 # won't ever connect
juju's user-data link lines 198-203: https:/
If jujud has a dependency on files in /var/lib/cloud, the fix would be to order jujud systemd services/units after cloud-init completes this would prevent jujud from running before cloud-init has written user-data/metadata artifacts in /var/lib/cloud.
I was able to validate that juju units don't remove themselves across clean-reboot by adding
"After=
/var/lib/
Excerpt of logs obtained via vsphere system console of ubuntu unit in error:
2018-02-22 17:28:21 DEBUG juju.api apiclient.go:843 successfully dialed "wss://
2018-02-22 17:28:21 INFO juju.api apiclient.go:597 connection established to "wss://
2018-02-22 17:28:21 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 INFO juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 INFO juju.worker runner.go:483 stopped "engine", err: agent should be terminated
2018-02-22 17:28:22 DEBUG juju.cmd.jujud introspection.go:64 engine stopped, stopping introspection
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.worker.
2018-02-22 17:28:22 DEBUG juju.cmd.jujud introspection.go:67 introspection stopped
2018-02-22 17:28:22 DEBUG juju.worker runner.go:332 "engine" done: agent should be terminated
2018-02-22 17:28:22 ERROR juju.worker runner.go:381 fatal "engine": agent should be terminated
2018-02-22 17:28:22 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall
2018-02-22 17:28:22 INFO juju.cmd.jujud machine.go:1712 uninstalling agent
2018-02-22 17:28:22 DEBUG juju.service discovery.go:115 discovered init system "systemd" from local host
2018-02-22 17:28:22 DEBUG juju.service.
2018-02-22 17:28:22 DEBUG juju.service discovery.go:115 discovered init system "systemd" from local host
2018-02-22 17:28:22 DEBUG juju.service.
2018-02-22 17:28:22 DEBUG juju.service.
ERROR uninstall failed: [remove /var/lib/
2018-02-22 17:28:22 DEBUG cmd supercommand.go:459 error stack:
github.
2018-02-22 17:28:22 DEBUG juju.cmd.jujud main.go:187 jujud complete, code 0, err <nil>
2018-02-22 17:28:31 INFO juju.cmd supercommand.go:56 running jujud [2.3.4 gc go1.9.2]
Full logs:
machine http://
unit http://
BTW: using lxd provider for juju won't reproduce this issue as it goes down a slightly different path. lxd provider uses cloud-init's NoCloud datasource so it pools user-data/meta-data earlier in cloud-init which will not exhibit this race condition.