Comment 0 for bug 1751153

Revision history for this message
Chad Smith (chad.smith) wrote :

If /var/lib/cloud disappears in vsphere deployed units, subsequent reboots cause jujud to uninstall itself.

The relevant issue seems to be that jujud can't access the vsphere endpoint because of an absent configuration file in under /var/lib/cloud and as a result is queues up it's on uninstallation.

This occurs during testing of cloud-init development packages. I need to remove /var/lib/cloud so cloud-init thinks the system is 'greenfield' or 'new' and re-runs through all cloud-init boot stages. The issue appears to be the user-data that juju provides writes out systemd unit and service files which don't wait on the completion of cloud-init. So, in cloud-init's 'fresh boot' scenario, jujud beats cloud-init setup and is missing some user-data it needs to properly get access vsphere api. As a result, it thinks it's improperly configured and removes itself.

Steps to reproduce:
  juju bootstrap yourvspherecloud
  juju add-unit ubuntu
  juju ssh ubuntu/1 'sudo rm -rf /var/lib/cloud; sudo reboot'
  juju status # node ubuntu/1 jujud agent will never report active again
  juju ssh ubuntu/1 # won't ever connect

juju's user-data link lines 198-203: https://paste.ubuntu.com/p/r3TjJ6Ssvs/

If jujud has a dependency on files in /var/lib/cloud, the fix would be to order jujud systemd services/units after cloud-init completes this would prevent jujud from running before cloud-init has written user-data/metadata artifacts in /var/lib/cloud.

I was able to validate that juju units don't remove themselves across clean-reboot by adding
"After=cloud-final.service" to the [Unit] section of the systemd service/unit files
 /var/lib/juju/init/jujud-machine-11/jujud-machine-11.service.

Excerpt of logs obtained via vsphere system console of ubuntu unit in error:

2018-02-22 17:28:21 DEBUG juju.api apiclient.go:843 successfully dialed "wss://10.245.201.51:17070/model/c559be05-e095-4eab-8cf9-bd904e587896/api"
2018-02-22 17:28:21 INFO juju.api apiclient.go:597 connection established to "wss://10.245.201.51:17070/model/c559be05-e095-4eab-8cf9-bd904e587896/api"
2018-02-22 17:28:21 DEBUG juju.worker.dependency engine.go:504 "migration-fortress" manifold worker stopped: "upgrade-check-flag" not set: dependency not available
2018-02-22 17:28:22 DEBUG juju.worker.apicaller connect.go:152 failed to connect
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "api-caller" manifold worker stopped: cannot open api: connection permanently impossible
2018-02-22 17:28:22 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "api-config-watcher" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "clock" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-check-flag" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-check-gate" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-steps-gate" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "agent" manifold worker stopped: <nil>
2018-02-22 17:28:22 INFO juju.worker.stateconfigwatcher manifold.go:122 tomb dying
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "state-config-watcher" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-steps-flag" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "termination-signal-handler" manifold worker stopped: <nil>
2018-02-22 17:28:22 INFO juju.worker runner.go:483 stopped "engine", err: agent should be terminated
2018-02-22 17:28:22 DEBUG juju.cmd.jujud introspection.go:64 engine stopped, stopping introspection
2018-02-22 17:28:22 DEBUG juju.worker.introspection socket.go:125 stats worker closing listener
2018-02-22 17:28:22 DEBUG juju.worker.introspection socket.go:119 stats worker serving finished
2018-02-22 17:28:22 DEBUG juju.worker.introspection socket.go:129 stats worker finished
2018-02-22 17:28:22 DEBUG juju.cmd.jujud introspection.go:67 introspection stopped
2018-02-22 17:28:22 DEBUG juju.worker runner.go:332 "engine" done: agent should be terminated
2018-02-22 17:28:22 ERROR juju.worker runner.go:381 fatal "engine": agent should be terminated
2018-02-22 17:28:22 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall
2018-02-22 17:28:22 INFO juju.cmd.jujud machine.go:1712 uninstalling agent
2018-02-22 17:28:22 DEBUG juju.service discovery.go:115 discovered init system "systemd" from local host
2018-02-22 17:28:22 DEBUG juju.service.systemd service.go:377 service "jujud-machine-11" successfully removed
2018-02-22 17:28:22 DEBUG juju.service discovery.go:115 discovered init system "systemd" from local host
2018-02-22 17:28:22 DEBUG juju.service.systemd service.go:329 service "juju-db" not running
2018-02-22 17:28:22 DEBUG juju.service.systemd service.go:371 service "juju-db" not installed
ERROR uninstall failed: [remove /var/lib/juju/agents/unit-ubuntu-14: directory not empty]
2018-02-22 17:28:22 DEBUG cmd supercommand.go:459 error stack:
github.com/juju/juju/cmd/jujud/agent/machine.go:1755: uninstall failed: [remove /var/lib/juju/agents/unit-ubuntu-14: directory not empty]
2018-02-22 17:28:22 DEBUG juju.cmd.jujud main.go:187 jujud complete, code 0, err <nil>
2018-02-22 17:28:31 INFO juju.cmd supercommand.go:56 running jujud [2.3.4 gc go1.9.2]

Full logs:
machine http://paste.ubuntu.com/p/2HWR94JFhh/
unit http://paste.ubuntu.com/p/rDXqz2ZWNS/