Destroying a model with k8s apps hangs

Bug #1772179 reported by Cory Johns on 2018-05-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
Ian Booth

Bug Description

I juju deployed a k8s cluster on GCE, created a k8s model with that cluster, and deployed two caas charms into that model. I then tried to destroy the controller without first destroying the k8s model or removing the caas applications. It seems to have torn down the k8s cluster before cleaning up the k8s model or applications and now appears to be in a stuck state waiting for those applications.

The root cause is that the model destruction fails, and hence so does the controller teardown.

Ian Booth (wallyworld) on 2018-05-20
Changed in juju:
milestone: none → 2.4-beta3
status: New → Triaged
importance: Undecided → High
Ian Booth (wallyworld) wrote :

Could you provide more detail, eg what charms etc? Even a juju status output.
I tried to reproduce on a lxd model without luck. I'll try on GCE next.

A deployed a CAAS mysql charm:

$ juju status
Model Controller Cloud/Region Version SLA
caas ian mycaas 2.4-beta3.1 unsupported

App Version Status Scale Charm Store Rev OS Address Notes
mysql mysql-server:5.7 active 1 mysql local 0 kubernetes 10.152.183.56

Unit Workload Agent Address Ports Message
mysql/0* waiting idle 3306/TCP waiting for container

$ juju destroy-controller --destroy-all-models ian -y
Destroying controller
Waiting for hosted model resources to be reclaimed
Waiting on 2 models, 3 machines, 6 applications
Waiting on 2 models, 3 machines, 6 applications
Waiting on 2 models, 3 machines, 5 applications
Waiting on 1 model, 3 machines, 5 applications
Waiting on 1 model, 3 machines, 5 applications
Waiting on 1 model, 3 machines, 5 applications
Waiting on 1 model, 3 machines, 5 applications
Waiting on 1 model, 3 machines, 4 applications
Waiting on 1 model, 3 machines, 1 application
Waiting on 1 model, 3 machines
Waiting on 1 model, 2 machines
Waiting on 1 model, 2 machines
Waiting on 1 model, 2 machines
Waiting on 1 model, 2 machines
Waiting on 1 model, 1 machine
Waiting on 1 model, 1 machine
All hosted models reclaimed, cleaning up controller machines
$

Ian Booth (wallyworld) wrote :

Note that Juju model destruction will, in general, not complete if there are any units in error state. This is not CAAS specific. If a unit is in error, its parent application can't/won't be destroyed and manual action is needed to unblock. There's no --force option currently.

Ian Booth (wallyworld) wrote :

I managed to reproduce on a caas model with 2 related applications

$ juju destroy-controller ian --destroy-all-models -y
Destroying controller
Waiting for hosted model resources to be reclaimed
Waiting on 2 models, 3 machines, 7 applications
Waiting on 2 models, 3 machines, 7 applications
Waiting on 2 models, 3 machines, 6 applications
Waiting on 2 models, 3 machines, 2 applications
Waiting on 2 models, 3 machines, 2 applications
Waiting on 2 models, 3 machines, 2 applications
Waiting on 2 models, 3 machines, 2 applications
...

Ian Booth (wallyworld) wrote :

Part of the issue appears to be that there's an error running the relation-departed hook, which means then that the relation-broken hook is never run. This means that the unit is never recorded as leaving scope and so will not be cleaned up from the model, and will prevent the relation and hence application from being removed.

Probably a separate bug, status should show the units as being in error but they are cleaned up and so nothing shows except the now orphaned relations and applications. The log below shows the uniter as recording the relation-departed hook error and awaiting resolution but that resolution never comes because the unit is already removed and the user can't run juju resolved.

DEBUG juju.worker.uniter runlistener.go:139 juju-run listener stopped
DEBUG server-relation-departed Traceback (most recent call last):
DEBUG server-relation-departed File "/var/lib/juju/agents/unit-mysql-0/charm/hooks/server-relation-departed", line 8, in <module>
DEBUG server-relation-departed basic.init_config_states()
DEBUG server-relation-departed File "lib/charms/layer/basic.py", line 33, in init_config_states
DEBUG server-relation-departed config = hookenv.config()
DEBUG server-relation-departed File "/usr/local/lib/python3.6/dist-packages/charmhelpers/core/hookenv.py", line 386, in config
DEBUG server-relation-departed subprocess.check_output(config_cmd_line).decode('UTF-8'))
DEBUG server-relation-departed File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
DEBUG server-relation-departed **kwargs).stdout
DEBUG server-relation-departed File "/usr/lib/python3.6/subprocess.py", line 418, in run
DEBUG server-relation-departed output=stdout, stderr=stderr)
DEBUG server-relation-departed subprocess.CalledProcessError: Command '['config-get', '--all', '--format=json']' returned non-zero exit status 1.
DEBUG juju.worker.caasoperator.remotestate watcher.go:124 got application change
DEBUG juju.worker runner.go:324 stop "mysql/0"
DEBUG juju.worker runner.go:456 killing "mysql/0"
ERROR juju.worker.uniter.operation runhook.go:114 hook "server-relation-departed" failed: exit status 1
DEBUG juju.worker.uniter.operation executor.go:75 lock released
INFO juju.worker.uniter resolver.go:102 awaiting error resolution for "relation-departed" hook

Ian Booth (wallyworld) wrote :

The relation-departed hook looks to be failing because it calls config-get which fails with permission denied error:

2018-05-21 02:28:29 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS] executing: running server-relation-departed hook
2018-05-21 02:28:29 DEBUG worker.uniter.jujuc server.go:181 running hook tool "config-get"
2018-05-21 02:28:29 DEBUG server-relation-departed ERROR permission denied

Ian Booth (wallyworld) wrote :

Looks like Juju is removing the unit too early before it leaves scope and the relation is cleaned up. Fix should hopefully be simple.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Ian Booth (wallyworld)
summary: - Destroying a controller with k8s apps hangs
+ Destroying a model with k8s apps hangs
description: updated
Ian Booth (wallyworld) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers