juju 2.3.4: vsphere units automatically remove jujud across reboot when cloud-init /var/lib/cloud is removed

Bug #1751153 reported by Chad Smith on 2018-02-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Medium
Unassigned
juju-core
Undecided
Unassigned

Bug Description

If /var/lib/cloud disappears in vsphere deployed units, subsequent reboots cause jujud to uninstall itself.

The relevant issue seems to be that jujud can't access the vsphere endpoint because of an absent configuration file in under /var/lib/cloud and as a result is queues up its own uninstallation.

This occurs during testing of cloud-init development packages. I need to remove /var/lib/cloud so cloud-init thinks the system is 'greenfield' or 'new' and re-runs through all cloud-init boot stages. The issue appears to be the user-data that juju provides writes out systemd unit and service files which don't wait on the completion of cloud-init. So, in cloud-init's 'fresh boot' scenario, jujud beats cloud-init setup and is missing some user-data it needs to properly get access vsphere api. As a result, it thinks it's improperly configured and removes itself.

Steps to reproduce:
  juju bootstrap yourvspherecloud
  juju add-unit ubuntu
  juju ssh ubuntu/1 'sudo rm -rf /var/lib/cloud; sudo reboot'
  juju status # node ubuntu/1 jujud agent will never report active again
  juju ssh ubuntu/1 # won't ever connect

juju's user-data link lines 198-203: https://paste.ubuntu.com/p/r3TjJ6Ssvs/

If jujud has a dependency on files in /var/lib/cloud, the fix would be to order jujud systemd services/units after cloud-init completes this would prevent jujud from running before cloud-init has written user-data/metadata artifacts in /var/lib/cloud.

I was able to validate that juju units don't remove themselves across clean-reboot by adding
"After=cloud-final.service" to the [Unit] section of the systemd service/unit files
 /var/lib/juju/init/jujud-machine-11/jujud-machine-11.service.

Excerpt of logs obtained via vsphere system console of ubuntu unit in error:

2018-02-22 17:28:21 DEBUG juju.api apiclient.go:843 successfully dialed "wss://10.245.201.51:17070/model/c559be05-e095-4eab-8cf9-bd904e587896/api"
2018-02-22 17:28:21 INFO juju.api apiclient.go:597 connection established to "wss://10.245.201.51:17070/model/c559be05-e095-4eab-8cf9-bd904e587896/api"
2018-02-22 17:28:21 DEBUG juju.worker.dependency engine.go:504 "migration-fortress" manifold worker stopped: "upgrade-check-flag" not set: dependency not available
2018-02-22 17:28:22 DEBUG juju.worker.apicaller connect.go:152 failed to connect
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "api-caller" manifold worker stopped: cannot open api: connection permanently impossible
2018-02-22 17:28:22 INFO juju.agent uninstall.go:36 marking agent ready for uninstall
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "api-config-watcher" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "clock" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-check-flag" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-check-gate" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-steps-gate" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "agent" manifold worker stopped: <nil>
2018-02-22 17:28:22 INFO juju.worker.stateconfigwatcher manifold.go:122 tomb dying
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "state-config-watcher" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "upgrade-steps-flag" manifold worker stopped: <nil>
2018-02-22 17:28:22 DEBUG juju.worker.dependency engine.go:504 "termination-signal-handler" manifold worker stopped: <nil>
2018-02-22 17:28:22 INFO juju.worker runner.go:483 stopped "engine", err: agent should be terminated
2018-02-22 17:28:22 DEBUG juju.cmd.jujud introspection.go:64 engine stopped, stopping introspection
2018-02-22 17:28:22 DEBUG juju.worker.introspection socket.go:125 stats worker closing listener
2018-02-22 17:28:22 DEBUG juju.worker.introspection socket.go:119 stats worker serving finished
2018-02-22 17:28:22 DEBUG juju.worker.introspection socket.go:129 stats worker finished
2018-02-22 17:28:22 DEBUG juju.cmd.jujud introspection.go:67 introspection stopped
2018-02-22 17:28:22 DEBUG juju.worker runner.go:332 "engine" done: agent should be terminated
2018-02-22 17:28:22 ERROR juju.worker runner.go:381 fatal "engine": agent should be terminated
2018-02-22 17:28:22 INFO juju.agent uninstall.go:47 agent already marked ready for uninstall
2018-02-22 17:28:22 INFO juju.cmd.jujud machine.go:1712 uninstalling agent
2018-02-22 17:28:22 DEBUG juju.service discovery.go:115 discovered init system "systemd" from local host
2018-02-22 17:28:22 DEBUG juju.service.systemd service.go:377 service "jujud-machine-11" successfully removed
2018-02-22 17:28:22 DEBUG juju.service discovery.go:115 discovered init system "systemd" from local host
2018-02-22 17:28:22 DEBUG juju.service.systemd service.go:329 service "juju-db" not running
2018-02-22 17:28:22 DEBUG juju.service.systemd service.go:371 service "juju-db" not installed
ERROR uninstall failed: [remove /var/lib/juju/agents/unit-ubuntu-14: directory not empty]
2018-02-22 17:28:22 DEBUG cmd supercommand.go:459 error stack:
github.com/juju/juju/cmd/jujud/agent/machine.go:1755: uninstall failed: [remove /var/lib/juju/agents/unit-ubuntu-14: directory not empty]
2018-02-22 17:28:22 DEBUG juju.cmd.jujud main.go:187 jujud complete, code 0, err <nil>
2018-02-22 17:28:31 INFO juju.cmd supercommand.go:56 running jujud [2.3.4 gc go1.9.2]

Full logs:
machine http://paste.ubuntu.com/p/2HWR94JFhh/
unit http://paste.ubuntu.com/p/rDXqz2ZWNS/

Chad Smith (chad.smith) wrote :

BTW: using lxd provider for juju won't reproduce this issue as it goes down a slightly different path. lxd provider uses cloud-init's NoCloud datasource so it pools user-data/meta-data earlier in cloud-init which will not exhibit this race condition.

John A Meinel (jameinel) wrote :

we are unlikely to make a change in juju-1.X to address this.

Changed in juju-core:
status: New → Won't Fix
description: updated
John A Meinel (jameinel) wrote :

We seem to be doing:
    defer func() {
        cause := errors.Cause(err)
        switch {
        case cause == apiagent.ErrDenied:
        case cause == errAgentEntityDead:
        case params.IsCodeUnauthorized(cause):
        case params.IsCodeNotProvisioned(cause):
        default:
            return
        }
        err = ErrConnectImpossible
    }()

So if we get an explicit known reason why we were denied access to the controller, we treat it as fatal and flag ourselves for removal. (ErrConnectImpossible is supposed to mean that we know there is an API server out there, but we don't have any way to log in to it. The expectation is probably that we've been configured, someone else has destroyed our original controller, but somehow this machine was left behind.)
Unfortunately, we also seem to not debug the original error, so it is hard to know *why* we have decided that this is impossible.

I don't know what data we would need from /var/lib/cloud, as we should only need /var/lib/juju/*
So it seems surprising that this would cause us to fail to connect.
I'll at least propose a patch that includes logging of what the original error is, before we decide that it is a permanently fatal error.

John A Meinel (jameinel) wrote :

Is there a reason you want to delete /var/lib/cloud instead of just provisioning a new instance? It seems like you're already in a virtual machine, and it might be cleaner to just stand up another one.

Changed in juju:
importance: Undecided → Medium
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers