kill-controller should always work to bring down a controller

Bug #1566426 reported by Cheryl Jennings
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey
juju-ci-tools
Fix Released
Undecided
Andrew James Beach

Bug Description

There may be cases where kill-controller is used to bring down a broken controller and it never completes. In my case, I had a service which never had a machine provisioned for it (see bug #1566420). I tried to kill the controller, but it stuck in a loop, waiting for the service to be removed (which it never would be):

ubuntu@ip-172-31-18-222:~$ juju2 kill-controller lxd
WARNING! This command will destroy the "local.lxd" controller.
This includes all machines, services, data and other resources.

Continue [y/N]? y
Destroying controller "local.lxd"
Waiting for resources to be reclaimed
Waiting on 1 model, 2 services
Waiting on 1 model, 2 services
Waiting on 1 model, 2 services
Waiting on 1 model, 2 services
Waiting on 1 model, 2 services
Waiting on 1 model, 2 services
Waiting on 1 model, 2 services
Waiting on 1 model, 2 services
<repeated until I Ctrl+C>

It would be nice if kill-controller had some sort of timeout when destroying through the API and would fall back to destroying through the provider if the "nice" way didn't complete in a timely manner.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

For others hitting this issue - a workaround is to terminate the controller instance through the provider, then try to kill-controller again.

Example with lxd:
# Find the instance ID for the controller machine:
ubuntu@ip-172-31-18-222:~$ juju2 status -m lxd:admin
[Services]
NAME STATUS EXPOSED CHARM
ubuntu unknown false cs:trusty/ubuntu-7

[Units]
ID WORKLOAD-STATUS JUJU-STATUS VERSION MACHINE PORTS PUBLIC-ADDRESS MESSAGE
ubuntu/0 unknown allocating Waiting for agent initialization to finish

[Machines]
ID STATE DNS INS-ID SERIES AZ
0 started 10.0.3.189 juju-7ff01eeb-9bfd-4d01-8be6-34d9bbd089b8-machine-0 xenial

# Stop the instance
$ lxc stop juju-7ff01eeb-9bfd-4d01-8be6-34d9bbd089b8-machine-0

# Retry kill-controller
$ juju2 kill-controller lxd
WARNING! This command will destroy the "local.lxd" controller.
This includes all machines, services, data and other resources.

Continue [y/N]? y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

A timeout may also be in order, but I think it's more important to resolve the cause (lp:1566420) of this particular bug. If you run "kill-controller", it should remove that service if the machine couldn't be provisioned. That, or we need to find a way to destroy hosted models without relying on a working controller.

Changed in juju-core:
milestone: 2.0-beta4 → 2.0.1
tags: added: rc1 usability
Changed in juju-core:
milestone: 2.0.1 → 2.0-beta7
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta7 → 2.0-beta8
Changed in juju-core:
milestone: 2.0-beta8 → 2.0-beta9
Revision history for this message
Matt Bruzek (mbruzek) wrote :

I also found a situation where I could not destroy or kill controller: https://bugs.launchpad.net/juju-core/+bug/1588898

A timeout or some similar retry mechanism would be desirable here so this method would not loop forever. Please consider the use. Operations people love to script things. The predecessor command "destroy-environment" could be forced and a script could safely attempt a destroy-controller without having to worry about looping for ever, making this command impossible to script. That is not the case with "destroy-controller" in 2.0-beta7.

Regardless if you think these commands _should_ be scripted, we should not prevent the possibility by retrying forever.

Changed in juju-core:
assignee: nobody → Anastasia (anastasia-macmood)
assignee: Anastasia (anastasia-macmood) → nobody
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta9 → 2.0-beta10
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta10 → 2.0-beta11
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta11 → 2.0-beta12
Changed in juju-core:
milestone: 2.0-beta12 → 2.0-beta13
tags: added: oil
tags: added: oil-2.0
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We're hitting this about once a day in OIL, against the MAAS provider.

At the end of an otherwise successful run, we hit this and then hang until the controller node is released manually in MAAS.

It's a real problem for us because it hangs our test automation.

tags: added: 2.0
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this on 2.0 beta 13. I've attached a tarball with /var/log from the controller machine.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta14 → 2.0-beta15
Changed in juju-core:
milestone: 2.0-beta15 → 2.0-beta16
Changed in juju-core:
milestone: 2.0-beta16 → 2.0-beta17
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta17 → none
milestone: none → 2.0-beta17
Changed in juju:
milestone: 2.0-beta17 → 2.0-beta18
Revision history for this message
Larry Michel (lmic) wrote :

We have been able to recreate this with beta15.

Changed in juju:
importance: High → Critical
Changed in juju:
assignee: nobody → Tim Penhey (thumper)
Tim Penhey (thumper)
Changed in juju:
importance: Critical → High
Changed in juju:
milestone: 2.0-beta18 → 2.0-rc1
Revision history for this message
Tim Penhey (thumper) wrote :

The problem with papering over this issue is that there is clearly a bug somewhere, and if we don't fix the bug it is likely to bite us in other places, like when trying to upgrade or migrate an existing model.

I'd much rather work out where the blockage is and fix it than paper over the problem and just kill the models with a big hammer.

Revision history for this message
Tim Penhey (thumper) wrote :

Jason, can I get some of your time to go over this please?

Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (19.3 KiB)

I think I have recreated a scenario where the model can be destroyed but the controllers think that it still exists. This was the case of deploying with vsphere as provider. The original issue was that the vSphere provider has been returning 503. First trying to destroy-controller failed with 503. I then tried to remove the model which seemed to work, but then trying to destroy the controller was stuck waiting for model but there was no model!

Step 1:
===========================================================================================
jenkins@lmic-s9-instance:~$ juju controllers
CONTROLLER MODEL USER ACCESS+ CLOUD/REGION MODELS+ MACHINES+ VERSION+
mycontroller default admin@local superuser larry 2 1 2.0-beta18
vspherecontroller-beta18* default admin@local superuser vsphere/dc0 2 1 2.0-beta18

+ these are the last known values, run with --refresh to see the latest information.
===========================================================================================

Step 2 (503 error):
===========================================================================================

jenkins@lmic-s9-instance:~$ juju destroy-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
ERROR getting controller environ: getting environ using bootstrap config from client store: failed to create new client: 503 Service Unavailable
jenkins@lmic-s9-instance:~$ juju status
MODEL CONTROLLER CLOUD/REGION VERSION
default vspherecontroller-beta18 vsphere/dc0 2.0-beta18

APP VERSION STATUS SCALE CHARM STORE REV OS NOTES
cinder error 1 cinder jujucharms 255 ubuntu
glance waiting 0/1 glance jujucharms 251 ubuntu
keystone waiting 1 keystone jujucharms 256 ubuntu
mongodb unknown 1 mongodb jujucharms 37 ubuntu
neutron-api error 1 neutron-api jujucharms 0 ubuntu
neutron-gateway active 1 neutron-gateway jujucharms 230 ubuntu
neutron-openvswitch unknown 0 neutron-openvswitch jujucharms 236 ubuntu
nova-cloud-controller error 1 nova-cloud-controller jujucharms 290 ubuntu
nova-vmware unknown 1 nova-compute-vmware jujucharms 1 ubuntu
nsx-transport-node unknown 1 nsx-transport-node jujucharms 0 ubuntu
openstack-dashboard waiting 1 openstack-dashboard jujucharms 241 ubuntu
percona-cluster active 0/1 percona-cluster jujucharms 244 ubuntu
rabbitmq-server active 1 rabbitmq-server jujucharms 49 ubuntu
swift-proxy blocked 0/1 swift-proxy jujucharms 54...

Revision history for this message
Larry Michel (lmic) wrote :

Forgot to add the output of juju models:

enkins@lmic-s9-instance:~$ juju models
CONTROLLER: vspherecontroller-beta18

MODEL OWNER STATUS MACHINES CORES ACCESS LAST CONNECTION
controller admin@local destroying 1 2 admin 6 seconds ago
default admin@local destroying 0 - admin 33 minutes ago

jenkins@lmic-s9-instance:~$ juju models
CONTROLLER: vspherecontroller-beta18

MODEL OWNER STATUS MACHINES CORES ACCESS LAST CONNECTION
controller admin@local destroying 1 2 admin 6 seconds ago
default admin@local destroying 0 - admin 34 minutes ago

jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Destroying controller "vspherecontroller-beta18"
Waiting for resources to be reclaimed
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
^C

Revision history for this message
Larry Michel (lmic) wrote :

logs

Revision history for this message
Larry Michel (lmic) wrote :

I was finally able to kill the controller by powering it off manually. The first kill-controller failed saying that it was powered off. Then the next kill-controller seemed to go through.

jenkins@lmic-s9-instance:~$ juju status
ERROR no model in focus

Please use "juju models" to see models available to you.
You can set current model by running "juju switch"
or specify any other model on the command line using the "-m" flag.

jenkins@lmic-s9-instance:~$ juju switch
vspherecontroller-beta18
jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.
ERROR The attempted operation cannot be performed in the current state (Powered off).
ERROR destroying instances: failed to remowe instances: The attempted operation cannot be performed in the current state (Powered off).
jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.
jenkins@lmic-s9-instance:~$ juju models
error: no controller

Please either create your own new controller using "juju bootstrap" or
connect to another controller that you have been given access to using "juju register".

jenkins@lmic-s9-instance:~$ juju controllers
CONTROLLER MODEL USER ACCESS+ CLOUD/REGION MODELS+ MACHINES+ VERSION+
mycontroller default admin@local superuser larry 2 1 2.0-beta18

+ these are the last known values, run with --refresh to see the latest information.

jenkins@lmic-s9-instance:~$

Changed in juju:
status: Incomplete → New
Changed in juju:
status: New → Triaged
Changed in juju:
milestone: 2.0-rc1 → 2.0-rc2
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (4.3 KiB)

I hit a scenario where my model could not be destroyed for over a day. I thought that it was a recreate since the model was stuck in destroying but kill-controller worked in that case.

I am including the data since it could still be a data point. Note that I had waited to try the kill-controller because of other another model I was still using.

controller was bootstrapped following beta18 upgrade so both controller and model were at the same level. Before hitting the destroy-model issue, I had been doing destroy-model; add-model; deploy bundle.yaml then back to destroy-model a number of time while troubleshooting some deployment issue. Once I could not destroy the model, then I had to add a new model.

ubuntu@lmic-s9-instance:~$ sudo su - jenkins
sudo: unable to resolve host lmic-s9-instance
jenkins@lmic-s9-instance:~$ juju models
CONTROLLER: mycontroller

MODEL OWNER STATUS MACHINES CORES ACCESS LAST CONNECTION
controller admin@local available 1 8 admin 18 seconds ago
default admin@local destroying 0 - admin 2016-09-18
nova* admin@local available 11 24 admin 14 hours ago

The errors I saw from the time the destroy-model "default" was started were:

0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:31:54 ERROR juju.rpc server.go:510 error writing response: write tcp 10.245.0.183:17070->10.245.0.189:51791: write: broken pipe
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:31:54 ERROR juju.rpc server.go:510 error writing response: write tcp 10.245.0.183:17070->10.245.0.189:51791: write: broken pipe
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:06 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:10 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:11 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings ce6d85b2-b433-48cc-8395-d4062779c1e0:r#1#peer#cinder/0} for applying transaction 57dedda6b2f95d160ae6e16e_77c34c68
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:14 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:14 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings ce6d85b2-b433-48cc-8395-d4062779c1e0:r#1#peer#cinder/0} for applying transaction 57dedda6b2f95d160ae6e16e_77c34c68
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:17 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resu...

Read more...

Revision history for this message
Tom Barber (spicule) wrote :

Here you go folks, I have a completely wedged set of boxes in EC2 I can't switch off

Tim Penhey (thumper)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Tim Penhey (thumper) wrote :

Unfortunately the logs attached were not able to shed any light on the underlying problem.

The fact that some models get "wedged" is a annoying and something that we need to endeavour to fix over time, however having that block kill-controller is also sub-optimal.

After discussions with some of the team, I think we have come up with a good enough solution.

Going to add a flag --timeout (-t for short) that accepts a duration. Defaults to 5m (five minutes), but can be overridden to any valid duration.

When watching the model summary as they should be coming down, we'll reset a timer every time the summary changes. If there is no change in the summary after the timeout, kill-controller switches modes, and attempts to kill the models in a much more direct approach by downloading the model config and then using the provider calls to destroy the models.

Initially I had a concern about getting all of the model configuration sent to the client in order to have the client kill the models, but since it is a controller admin doing the destruction, that user has authority and ability to get on to the controller machines anyway and we are giving them no more access to data than they already have.

For the first 30s of any timer reset, we will not show any additional output. After 30s of no change from the controller, additional output will be shown. For example:

Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines (direct destruction in 4m30s)
Waiting on 1 model, 4 machines (direct destruction in 4m25s)
Waiting on 1 model, 4 machines (direct destruction in 4m20s)

and so on. If there is a change, this message is removed until there is no change for 30s.

Waiting on 1 model, 4 machines (direct destruction in 4m25s)
Waiting on 1 model, 4 machines (direct destruction in 4m20s)
Waiting on 1 model, 2 machines
Waiting on 1 model, 2 machines

etc.

This is the approach I'm going to start working on now.

Tim Penhey (thumper)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Christian Muirhead (2-xtian) wrote :

I think the underlying cause here is the same as for http://pad.lv/1611093 and http://pad.lv/1611159
PR with a fix for that here: https://github.com/juju/juju/pull/6351

Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
Curtis Hovey (sinzui)
tags: added: gap
tags: added: eda
Curtis Hovey (sinzui)
Changed in juju-ci-tools:
assignee: nobody → Andrew James Beach (andrewjbeach)
status: New → Fix Released
Revision history for this message
Katja (katja-decuir) wrote :

hey, i submitted a but a while ago kinda similar to this but like, a month ago in 2017 and was told its already solved.

my issue is that i have a controller i just cant connect to anymore. it says some random IP that doesn't exist and that it can't connect to it because the DHCP server can't name it or anything. ie: its using 10.0.4.xx and the other controllers are using 192.168.0.xx.. it cant connect. it can't be destroyed because it times out.. and it can't be killed because it also just times out.

how do i delete this either through juju or litterally sudo rm some file or from lxd itself?

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1566426] Re: kill-controller should always work to bring down a controller

'juju unregister' will remove the records but not clean up any resources
that are running if anything was left.

John
=:->

On Apr 2, 2017 9:45 AM, "Katja" <email address hidden> wrote:

> hey, i submitted a but a while ago kinda similar to this but like, a
> month ago in 2017 and was told its already solved.
>
> my issue is that i have a controller i just cant connect to anymore. it
> says some random IP that doesn't exist and that it can't connect to it
> because the DHCP server can't name it or anything. ie: its using
> 10.0.4.xx and the other controllers are using 192.168.0.xx.. it cant
> connect. it can't be destroyed because it times out.. and it can't be
> killed because it also just times out.
>
> how do i delete this either through juju or litterally sudo rm some file
> or from lxd itself?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1566426
>
> Title:
> kill-controller should always work to bring down a controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1566426/+subscriptions
>

Revision history for this message
Katja (katja-decuir) wrote :

if i unregister the controllers and then reboot, will it stop lxd from resuming whatever virtual machines it was using for that controller? ie: will juju not tell lxd to start them on reboot? or do i have to comb through lxd and figure out which devices need to be manually deleted? thank you for a straight answer.

Revision history for this message
Katja (katja-decuir) wrote :

nvm. i fixed it manually. went into /var/lib/lxd and umounted everything (apparently the stuff that cannot be deleted aren't even unmounted or anything, so i did that and removed everything except what was running in juju status. then i unregistered the controllers.) and i think that's all i need to do to clean up all the resources and all.

thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.