Bug #1566426 “kill-controller should always work to bring down a...” : Bugs : Canonical Juju

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-04-05:

#1

For others hitting this issue - a workaround is to terminate the controller instance through the provider, then try to kill-controller again.

Example with lxd:
# Find the instance ID for the controller machine:
ubuntu@ip-172-31-18-222:~$ juju2 status -m lxd:admin
[Services]
NAME STATUS EXPOSED CHARM
ubuntu unknown false cs:trusty/ubuntu-7

[Units]
ID WORKLOAD-STATUS JUJU-STATUS VERSION MACHINE PORTS PUBLIC-ADDRESS MESSAGE
ubuntu/0 unknown allocating Waiting for agent initialization to finish

[Machines]
ID STATE DNS INS-ID SERIES AZ
0 started 10.0.3.189 juju-7ff01eeb-9bfd-4d01-8be6-34d9bbd089b8-machine-0 xenial

# Stop the instance
$ lxc stop juju-7ff01eeb-9bfd-4d01-8be6-34d9bbd089b8-machine-0

# Retry kill-controller
$ juju2 kill-controller lxd
WARNING! This command will destroy the "local.lxd" controller.
This includes all machines, services, data and other resources.

Continue [y/N]? y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2016-04-06:

#2

A timeout may also be in order, but I think it's more important to resolve the cause (lp:1566420) of this particular bug. If you run "kill-controller", it should remove that service if the machine couldn't be provisioned. That, or we need to find a way to destroy hosted models without relying on a working controller.

Cheryl Jennings (cherylj) on 2016-04-07

Changed in juju-core:
milestone:	2.0-beta4 → 2.0.1

Cheryl Jennings (cherylj) on 2016-05-04

tags:	added: rc1 usability
Changed in juju-core:
milestone:	2.0.1 → 2.0-beta7

Curtis Hovey (sinzui) on 2016-05-13

Changed in juju-core:
milestone:	2.0-beta7 → 2.0-beta8

Cheryl Jennings (cherylj) on 2016-05-26

Changed in juju-core:
milestone:	2.0-beta8 → 2.0-beta9

Revision history for this message

Matt Bruzek (mbruzek) wrote on 2016-06-03:

#3

I also found a situation where I could not destroy or kill controller: https://bugs.launchpad.net/juju-core/+bug/1588898

A timeout or some similar retry mechanism would be desirable here so this method would not loop forever. Please consider the use. Operations people love to script things. The predecessor command "destroy-environment" could be forced and a script could safely attempt a destroy-controller without having to worry about looping for ever, making this command impossible to script. That is not the case with "destroy-controller" in 2.0-beta7.

Regardless if you think these commands _should_ be scripted, we should not prevent the possibility by retrying forever.

Anastasia (anastasia-macmood) on 2016-06-16

Changed in juju-core:
assignee:	nobody → Anastasia (anastasia-macmood)
assignee:	Anastasia (anastasia-macmood) → nobody

Curtis Hovey (sinzui) on 2016-06-16

Changed in juju-core:
milestone:	2.0-beta9 → 2.0-beta10

Curtis Hovey (sinzui) on 2016-06-24

Changed in juju-core:
milestone:	2.0-beta10 → 2.0-beta11

Curtis Hovey (sinzui) on 2016-07-01

Changed in juju-core:
milestone:	2.0-beta11 → 2.0-beta12

Cheryl Jennings (cherylj) on 2016-07-01

Changed in juju-core:
milestone:	2.0-beta12 → 2.0-beta13

Greg Lutostanski (lutostag) on 2016-07-11

tags:

added: oil

Jason Hobbs (jason-hobbs) on 2016-07-13

tags:

added: oil-2.0

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-13:

#4

We're hitting this about once a day in OIL, against the MAAS provider.

At the end of an otherwise successful run, we hit this and then hang until the controller node is released manually in MAAS.

It's a real problem for us because it hangs our test automation.

Cheryl Jennings (cherylj) on 2016-07-14

tags:

added: 2.0

Curtis Hovey (sinzui) on 2016-07-22

Changed in juju-core:
milestone:	2.0-beta13 → 2.0-beta14

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2016-07-24:

#5

logs.tgz Edit (311.0 KiB, application/x-tar)

We hit this on 2.0 beta 13. I've attached a tarball with /var/log from the controller machine.

Curtis Hovey (sinzui) on 2016-08-04

Changed in juju-core:
milestone:	2.0-beta14 → 2.0-beta15

Anastasia (anastasia-macmood) on 2016-08-10

Changed in juju-core:
milestone:	2.0-beta15 → 2.0-beta16

Anastasia (anastasia-macmood) on 2016-08-16

Changed in juju-core:
milestone:	2.0-beta16 → 2.0-beta17

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-beta17 → none
milestone:	none → 2.0-beta17

Anastasia (anastasia-macmood) on 2016-09-01

Changed in juju:
milestone:	2.0-beta17 → 2.0-beta18

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-01:

#6

We have been able to recreate this with beta15.

Anastasia (anastasia-macmood) on 2016-09-02

Changed in juju:
importance:	High → Critical

Anastasia (anastasia-macmood) on 2016-09-04

Changed in juju:
assignee:	nobody → Tim Penhey (thumper)

Tim Penhey (thumper) on 2016-09-04

Changed in juju:
importance:	Critical → High

Torsten Baumann (torbaumann) on 2016-09-09

Changed in juju:
milestone:	2.0-beta18 → 2.0-rc1

Revision history for this message

Tim Penhey (thumper) wrote on 2016-09-14:

#7

The problem with papering over this issue is that there is clearly a bug somewhere, and if we don't fix the bug it is likely to bite us in other places, like when trying to upgrade or migrate an existing model.

I'd much rather work out where the blockage is and fix it than paper over the problem and just kill the models with a big hammer.

Revision history for this message

Tim Penhey (thumper) wrote on 2016-09-14:

#8

Jason, can I get some of your time to go over this please?

Changed in juju:
status:	Triaged → Incomplete

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-16:

#9

Download full text (19.3 KiB)

I think I have recreated a scenario where the model can be destroyed but the controllers think that it still exists. This was the case of deploying with vsphere as provider. The original issue was that the vSphere provider has been returning 503. First trying to destroy-controller failed with 503. I then tried to remove the model which seemed to work, but then trying to destroy the controller was stuck waiting for model but there was no model!

Step 1:
===========================================================================================
jenkins@lmic-s9-instance:~$ juju controllers
CONTROLLER MODEL USER ACCESS+ CLOUD/REGION MODELS+ MACHINES+ VERSION+
mycontroller default admin@local superuser larry 2 1 2.0-beta18
vspherecontroller-beta18* default admin@local superuser vsphere/dc0 2 1 2.0-beta18

+ these are the last known values, run with --refresh to see the latest information.
===========================================================================================

Step 2 (503 error):
===========================================================================================

jenkins@lmic-s9-instance:~$ juju destroy-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
ERROR getting controller environ: getting environ using bootstrap config from client store: failed to create new client: 503 Service Unavailable
jenkins@lmic-s9-instance:~$ juju status
MODEL CONTROLLER CLOUD/REGION VERSION
default vspherecontroller-beta18 vsphere/dc0 2.0-beta18

APP VERSION STATUS SCALE CHARM STORE REV OS NOTES
cinder error 1 cinder jujucharms 255 ubuntu
glance waiting 0/1 glance jujucharms 251 ubuntu
keystone waiting 1 keystone jujucharms 256 ubuntu
mongodb unknown 1 mongodb jujucharms 37 ubuntu
neutron-api error 1 neutron-api jujucharms 0 ubuntu
neutron-gateway active 1 neutron-gateway jujucharms 230 ubuntu
neutron-openvswitch unknown 0 neutron-openvswitch jujucharms 236 ubuntu
nova-cloud-controller error 1 nova-cloud-controller jujucharms 290 ubuntu
nova-vmware unknown 1 nova-compute-vmware jujucharms 1 ubuntu
nsx-transport-node unknown 1 nsx-transport-node jujucharms 0 ubuntu
openstack-dashboard waiting 1 openstack-dashboard jujucharms 241 ubuntu
percona-cluster active 0/1 percona-cluster jujucharms 244 ubuntu
rabbitmq-server active 1 rabbitmq-server jujucharms 49 ubuntu
swift-proxy blocked 0/1 swift-proxy jujucharms 54...

I think I have recreated a scenario where the model can be destroyed but the controllers think that it still exists. This was the case of deploying with vsphere as provider. The original issue was that the vSphere provider has been returning 503. First trying to destroy-controller failed with 503. I then tried to remove the model which seemed to work, but then trying to destroy the controller was stuck waiting for model but there was no model!

Step 1:
===========================================================================================
jenkins@lmic-s9-instance:~$ juju controllers
CONTROLLER                 MODEL    USER         ACCESS+    CLOUD/REGION  MODELS+  MACHINES+  VERSION+
mycontroller               default  admin@local  superuser  larry               2          1  2.0-beta18  
vspherecontroller-beta18*  default  admin@local  superuser  vsphere/dc0         2          1  2.0-beta18

+ these are the last known values, run with --refresh to see the latest information.
===========================================================================================

Step 2 (503 error):
===========================================================================================

jenkins@lmic-s9-instance:~$ juju destroy-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
ERROR getting controller environ: getting environ using bootstrap config from client store: failed to create new client: 503 Service Unavailable
jenkins@lmic-s9-instance:~$ juju status
MODEL    CONTROLLER                CLOUD/REGION  VERSION
default  vspherecontroller-beta18  vsphere/dc0   2.0-beta18

APP                    VERSION  STATUS   SCALE  CHARM                  STORE       REV  OS      NOTES
cinder                          error        1  cinder                 jujucharms  255  ubuntu  
glance                          waiting    0/1  glance                 jujucharms  251  ubuntu  
keystone                        waiting      1  keystone               jujucharms  256  ubuntu  
mongodb                         unknown      1  mongodb                jujucharms   37  ubuntu  
neutron-api                     error        1  neutron-api            jujucharms    0  ubuntu  
neutron-gateway                 active       1  neutron-gateway        jujucharms  230  ubuntu  
neutron-openvswitch             unknown      0  neutron-openvswitch    jujucharms  236  ubuntu  
nova-cloud-controller           error        1  nova-cloud-controller  jujucharms  290  ubuntu  
nova-vmware                     unknown      1  nova-compute-vmware    jujucharms    1  ubuntu  
nsx-transport-node              unknown      1  nsx-transport-node     jujucharms    0  ubuntu  
openstack-dashboard             waiting      1  openstack-dashboard    jujucharms  241  ubuntu  
percona-cluster                 active     0/1  percona-cluster        jujucharms  244  ubuntu  
rabbitmq-server                 active       1  rabbitmq-server        jujucharms   49  ubuntu  
swift-proxy                     blocked    0/1  swift-proxy            jujucharms   54  ubuntu  
swift-storage                   waiting      1  swift-storage          jujucharms  230  ubuntu

RELATION                 PROVIDES               CONSUMES               TYPE
cluster                  cinder                 cinder                 peer
image-service            cinder                 glance                 regular
identity-service         cinder                 keystone               regular
cinder-volume-service    cinder                 nova-cloud-controller  regular
shared-db                cinder                 percona-cluster        regular
amqp                     cinder                 rabbitmq-server        regular
cluster                  glance                 glance                 peer
identity-service         glance                 keystone               regular
image-service            glance                 nova-cloud-controller  regular
image-service            glance                 nova-vmware            regular
shared-db                glance                 percona-cluster        regular
object-store             glance                 swift-proxy            regular
cluster                  keystone               keystone               peer
identity-service         keystone               neutron-api            regular
identity-service         keystone               nova-cloud-controller  regular
identity-service         keystone               openstack-dashboard    regular
shared-db                keystone               percona-cluster        regular
identity-service         keystone               swift-proxy            regular
replica-set              mongodb                mongodb                peer
cluster                  neutron-api            neutron-api            peer
neutron-plugin-api       neutron-api            neutron-gateway        regular
neutron-plugin-api       neutron-api            neutron-openvswitch    regular
neutron-api              neutron-api            nova-cloud-controller  regular
shared-db                neutron-api            percona-cluster        regular
amqp                     neutron-api            rabbitmq-server        regular
cluster                  neutron-gateway        neutron-gateway        peer
quantum-network-service  neutron-gateway        nova-cloud-controller  regular
juju-info                neutron-gateway        nsx-transport-node     subordinate
amqp                     neutron-gateway        rabbitmq-server        regular
amqp                     neutron-openvswitch    rabbitmq-server        regular
cluster                  nova-cloud-controller  nova-cloud-controller  peer
cloud-compute            nova-cloud-controller  nova-vmware            regular
shared-db                nova-cloud-controller  percona-cluster        regular
amqp                     nova-cloud-controller  rabbitmq-server        regular
amqp                     nova-vmware            rabbitmq-server        regular
cluster                  openstack-dashboard    openstack-dashboard    peer
cluster                  percona-cluster        percona-cluster        peer
cluster                  rabbitmq-server        rabbitmq-server        peer
cluster                  swift-proxy            swift-proxy            peer
swift-storage            swift-proxy            swift-storage          regular

UNIT                     WORKLOAD  AGENT      MACHINE  PUBLIC-ADDRESS  PORTS                                    MESSAGE
cinder/0                 error     idle       0        10.245.61.58    8776/tcp                                 hook failed: "shared-db-relation-changed" for percona-cluster:shared-db
glance/0                 unknown   lost       0/lxd/0  10.0.0.209      9292/tcp                                 agent lost, see 'juju status-history glance/0'
keystone/0               waiting   executing  2/lxd/0  10.0.0.8        5000/tcp                                 Incomplete relations: database
mongodb/0                unknown   idle       3/lxd/0  10.0.0.155      27017/tcp,27019/tcp,27021/tcp,28017/tcp  
neutron-api/0            error     idle       1/lxd/0  10.0.0.137      9696/tcp                                 hook failed: "shared-db-relation-changed" for percona-cluster:shared-db
neutron-gateway/0        active    idle       1        10.0.0.1                                                 Unit is ready
  nsx-transport-node/0   unknown   idle                10.0.0.1                                                 
nova-cloud-controller/0  error     idle       2        10.0.0.128      8774/tcp                                 hook failed: "shared-db-relation-changed" for percona-cluster:shared-db
nova-vmware/0            unknown   idle       3        10.0.0.1                                                 
openstack-dashboard/0    waiting   idle       2/lxd/1  10.0.0.128      80/tcp,443/tcp                           Incomplete relations: identity
percona-cluster/0        unknown   lost       0/lxd/1  10.0.0.21                                                agent lost, see 'juju status-history percona-cluster/0'
rabbitmq-server/0        active    idle       3/lxd/1  10.0.0.105      5672/tcp                                 Unit is ready
swift-proxy/0            unknown   lost       0/lxd/2  10.0.0.92       8080/tcp                                 agent lost, see 'juju status-history swift-proxy/0'
swift-storage/0          waiting   idle       3/lxd/2  10.0.0.160                                               Incomplete relations: proxy

MACHINE  STATE    DNS           INS-ID               SERIES  AZ
0        started  10.245.61.58  juju-facfb5-0        trusty  
0/lxd/0  started  10.0.0.209    juju-facfb5-0-lxd-0  trusty  
0/lxd/1  started  10.0.0.21     juju-facfb5-0-lxd-1  trusty  
0/lxd/2  started  10.0.0.92     juju-facfb5-0-lxd-2  trusty  
1        started  10.0.0.1      juju-facfb5-1        trusty  
1/lxd/0  started  10.0.0.137    juju-facfb5-1-lxd-0  trusty  
2        started  10.0.0.128    juju-facfb5-2        trusty  
2/lxd/0  started  10.0.0.8      juju-facfb5-2-lxd-0  trusty  
2/lxd/1  started  10.0.0.128    juju-facfb5-2-lxd-1  trusty  
3        started  10.0.0.1      juju-facfb5-3        trusty  
3/lxd/0  started  10.0.0.155    juju-facfb5-3-lxd-0  trusty  
3/lxd/1  started  10.0.0.105    juju-facfb5-3-lxd-1  trusty  
3/lxd/2  started  10.0.0.160    juju-facfb5-3-lxd-2  trusty  
===========================================================================================

Step 3 (destroy-model completes)
===========================================================================================
jenkins@lmic-s9-instance:~$ juju destroy-model default
WARNING! This command will destroy the "default" model.
This includes all machines, applications, data and other resources.

Continue [y/N]? y
jenkins@lmic-s9-instance:~$ juju destroy-model default
ERROR cannot read model info: model vspherecontroller-beta18:admin@local/default not found
===========================================================================================

Step 4:
===========================================================================================
jenkins@lmic-s9-instance:~$ juju destroy-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
ERROR getting controller environ: getting environ using bootstrap config from client store: failed to create new client: 503 Service Unavailable
jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Destroying controller "vspherecontroller-beta18"
Waiting for resources to be reclaimed
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
^C
===========================================================================================

Step 5 (look for missing model)
===========================================================================================
jenkins@lmic-s9-instance:~$ juju destroy-model default
ERROR cannot read model info: model vspherecontroller-beta18:admin@local/default not found
jenkins@lmic-s9-instance:~$ juju models
CONTROLLER: vspherecontroller-beta18

MODEL       OWNER        STATUS      MACHINES  CORES  ACCESS  LAST CONNECTION
controller  admin@local  destroying         1      2  admin   6 seconds ago
default     admin@local  destroying         0      -  admin   8 minutes ago

jenkins@lmic-s9-instance:~$ juju controllers
CONTROLLER                 MODEL    USER         ACCESS+    CLOUD/REGION  MODELS+  MACHINES+  VERSION+
mycontroller               default  admin@local  superuser  larry               2          1  2.0-beta18  
vspherecontroller-beta18*  -        admin@local  superuser  vsphere/dc0         2          1  2.0-beta18

+ these are the last known values, run with --refresh to see the latest information.
===========================================================================================

Step 6 (Destroy again then kill):
===========================================================================================

jenkins@lmic-s9-instance:~$ juju destroy-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Destroying controller
Waiting for hosted model resources to be reclaimed
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
^C
===========================================================================================

and kill-controller
===========================================================================================
enkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Destroying controller "vspherecontroller-beta18"
Waiting for resources to be reclaimed
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
<keeps going>

I will attach the logs from bootstrap node next.

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-16:

#10

Forgot to add the output of juju models:

enkins@lmic-s9-instance:~$ juju models
CONTROLLER: vspherecontroller-beta18

MODEL OWNER STATUS MACHINES CORES ACCESS LAST CONNECTION
controller admin@local destroying 1 2 admin 6 seconds ago
default admin@local destroying 0 - admin 33 minutes ago

jenkins@lmic-s9-instance:~$ juju models
CONTROLLER: vspherecontroller-beta18

MODEL OWNER STATUS MACHINES CORES ACCESS LAST CONNECTION
controller admin@local destroying 1 2 admin 6 seconds ago
default admin@local destroying 0 - admin 34 minutes ago

jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Destroying controller "vspherecontroller-beta18"
Waiting for resources to be reclaimed
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
^C

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-16:

#11

logs.tar.gz Edit (22.8 MiB, application/x-tar)

logs

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-16:

#12

I was finally able to kill the controller by powering it off manually. The first kill-controller failed saying that it was powered off. Then the next kill-controller seemed to go through.

jenkins@lmic-s9-instance:~$ juju status
ERROR no model in focus

Please use "juju models" to see models available to you.
You can set current model by running "juju switch"
or specify any other model on the command line using the "-m" flag.

jenkins@lmic-s9-instance:~$ juju switch
vspherecontroller-beta18
jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.
ERROR The attempted operation cannot be performed in the current state (Powered off).
ERROR destroying instances: failed to remowe instances: The attempted operation cannot be performed in the current state (Powered off).
jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.
jenkins@lmic-s9-instance:~$ juju models
error: no controller

Please either create your own new controller using "juju bootstrap" or
connect to another controller that you have been given access to using "juju register".

jenkins@lmic-s9-instance:~$ juju controllers
CONTROLLER MODEL USER ACCESS+ CLOUD/REGION MODELS+ MACHINES+ VERSION+
mycontroller default admin@local superuser larry 2 1 2.0-beta18

+ these are the last known values, run with --refresh to see the latest information.

jenkins@lmic-s9-instance:~$

I was finally able to kill the controller by powering it off manually. The first kill-controller failed saying that it was powered off. Then the next kill-controller seemed to go through.

jenkins@lmic-s9-instance:~$ juju status
ERROR no model in focus

Please use "juju models" to see models available to you.
You can set current model by running "juju switch"
or specify any other model on the command line using the "-m" flag.

jenkins@lmic-s9-instance:~$ juju switch
vspherecontroller-beta18
jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.
ERROR The attempted operation cannot be performed in the current state (Powered off).
ERROR destroying instances: failed to remowe instances: The attempted operation cannot be performed in the current state (Powered off).
jenkins@lmic-s9-instance:~$ juju kill-controller vspherecontroller-beta18
WARNING! This command will destroy the "vspherecontroller-beta18" controller.
This includes all machines, applications, data and other resources.

Continue? (y/N):y
Unable to open API: open connection timed out
Unable to connect to the API server. Destroying through provider.
jenkins@lmic-s9-instance:~$ juju models
error: no controller

Please either create your own new controller using "juju bootstrap" or
connect to another controller that you have been given access to using "juju register".

jenkins@lmic-s9-instance:~$ juju controllers
CONTROLLER    MODEL    USER         ACCESS+    CLOUD/REGION  MODELS+  MACHINES+  VERSION+
mycontroller  default  admin@local  superuser  larry               2          1  2.0-beta18

+ these are the last known values, run with --refresh to see the latest information.

jenkins@lmic-s9-instance:~$

Changed in juju:
status:	Incomplete → New

Anastasia (anastasia-macmood) on 2016-09-18

Changed in juju:
status:	New → Triaged

Richard Harding (rharding) on 2016-09-19

Changed in juju:
milestone:	2.0-rc1 → 2.0-rc2

Revision history for this message

Larry Michel (lmic) wrote on 2016-09-20:

#13

logs.tar.gz Edit (4.4 MiB, application/x-tar)

Download full text (4.3 KiB)

I hit a scenario where my model could not be destroyed for over a day. I thought that it was a recreate since the model was stuck in destroying but kill-controller worked in that case.

I am including the data since it could still be a data point. Note that I had waited to try the kill-controller because of other another model I was still using.

controller was bootstrapped following beta18 upgrade so both controller and model were at the same level. Before hitting the destroy-model issue, I had been doing destroy-model; add-model; deploy bundle.yaml then back to destroy-model a number of time while troubleshooting some deployment issue. Once I could not destroy the model, then I had to add a new model.

ubuntu@lmic-s9-instance:~$ sudo su - jenkins
sudo: unable to resolve host lmic-s9-instance
jenkins@lmic-s9-instance:~$ juju models
CONTROLLER: mycontroller

MODEL OWNER STATUS MACHINES CORES ACCESS LAST CONNECTION
controller admin@local available 1 8 admin 18 seconds ago
default admin@local destroying 0 - admin 2016-09-18
nova* admin@local available 11 24 admin 14 hours ago

The errors I saw from the time the destroy-model "default" was started were:

0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:31:54 ERROR juju.rpc server.go:510 error writing response: write tcp 10.245.0.183:17070->10.245.0.189:51791: write: broken pipe
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:31:54 ERROR juju.rpc server.go:510 error writing response: write tcp 10.245.0.183:17070->10.245.0.189:51791: write: broken pipe
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:06 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:10 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:11 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings ce6d85b2-b433-48cc-8395-d4062779c1e0:r#1#peer#cinder/0} for applying transaction 57dedda6b2f95d160ae6e16e_77c34c68
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:14 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:14 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings ce6d85b2-b433-48cc-8395-d4062779c1e0:r#1#peer#cinder/0} for applying transaction 57dedda6b2f95d160ae6e16e_77c34c68
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:17 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resu...

I hit a scenario where my model could not be destroyed for over a day. I thought that it was a recreate since the model was stuck in destroying but kill-controller worked in that case.

I am including the data since it could still be a data point. Note that I had waited to try the kill-controller because of other another model I was still using.

controller was bootstrapped following beta18 upgrade so both controller and model were at the same level. Before hitting the destroy-model issue, I had been doing destroy-model; add-model; deploy bundle.yaml then back to destroy-model a number of time while troubleshooting some deployment issue. Once I could not destroy the model, then I had to add a new model.

ubuntu@lmic-s9-instance:~$ sudo su - jenkins
sudo: unable to resolve host lmic-s9-instance
jenkins@lmic-s9-instance:~$ juju models
CONTROLLER: mycontroller

MODEL       OWNER        STATUS      MACHINES  CORES  ACCESS  LAST CONNECTION
controller  admin@local  available          1      8  admin   18 seconds ago
default     admin@local  destroying         0      -  admin   2016-09-18
nova*       admin@local  available         11     24  admin   14 hours ago

The errors I saw from the time the destroy-model "default" was started were:

0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:31:54 ERROR juju.rpc server.go:510 error writing response: write tcp 10.245.0.183:17070->10.245.0.189:51791: write: broken pipe
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:31:54 ERROR juju.rpc server.go:510 error writing response: write tcp 10.245.0.183:17070->10.245.0.189:51791: write: broken pipe
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:06 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:10 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:11 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings ce6d85b2-b433-48cc-8395-d4062779c1e0:r#1#peer#cinder/0} for applying transaction 57dedda6b2f95d160ae6e16e_77c34c68
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:14 ERROR juju.worker.dependency engine.go:539 "undertaker" manifold worker returned unexpected error: cannot remove model: an error occurred, unable to remove model
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:14 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings ce6d85b2-b433-48cc-8395-d4062779c1e0:r#1#peer#cinder/0} for applying transaction 57dedda6b2f95d160ae6e16e_77c34c68
0435a4a6-2429-47c2-8eab-28c494656332 machine-0: 2016-09-18 18:32:17 ERROR juju.worker.dependency engine.go:539 "mgo-txn-resumer" manifold worker returned unexpected error: cannot resume transactions: cannot find document {settings ce6d85b2-b433-48cc-8395-d4062779c1e0:r#1#peer#cinder/0} for applying transaction 57dedda6b2f95d160ae6e16e_77c34c68

I tried to execute the purge procedure per @Anastasia's wiki page instructions for purging mongodb following incomplete transactions but I got an authorization failure (tried with password specified and with password entered at prompt):
ubuntu@varaha:/var/lib/juju/tools/machine-0$ mongo 127.0.0.1:37017/juju--authenticationDatabase admin --ssl --username "admin" --password ""MongoDB shell version: 2.4.9
Enter password: 
connecting to: 127.0.0.1:37017/juju--authenticationDatabase
Tue Sep 20 02:55:28.967 Error: 18 { code: 18, ok: 0.0, errmsg: "auth fails" } at src/mongo/shell/db.js:228
exception: login failed

After restarting the juju service, I verified that things were in the same state and tried the kill-controller step.

Attaching the logs from yesterday..  In retrospect, I should have tried to recollect the logs following the purge attempt unfortunately in case any new error showed up. But since the kill-controller worked, it might not be as useful as I initially thought.

Revision history for this message

Tom Barber (spicule) wrote on 2016-09-21:

#14

log.tgz Edit (192.1 KiB, application/x-tar)

Here you go folks, I have a completely wedged set of boxes in EC2 I can't switch off

Tim Penhey (thumper) on 2016-09-22

Changed in juju:
status:	Triaged → In Progress

Revision history for this message

Tim Penhey (thumper) wrote on 2016-09-22:

#15

Unfortunately the logs attached were not able to shed any light on the underlying problem.

The fact that some models get "wedged" is a annoying and something that we need to endeavour to fix over time, however having that block kill-controller is also sub-optimal.

After discussions with some of the team, I think we have come up with a good enough solution.

Going to add a flag --timeout (-t for short) that accepts a duration. Defaults to 5m (five minutes), but can be overridden to any valid duration.

When watching the model summary as they should be coming down, we'll reset a timer every time the summary changes. If there is no change in the summary after the timeout, kill-controller switches modes, and attempts to kill the models in a much more direct approach by downloading the model config and then using the provider calls to destroy the models.

Initially I had a concern about getting all of the model configuration sent to the client in order to have the client kill the models, but since it is a controller admin doing the destruction, that user has authority and ability to get on to the controller machines anyway and we are giving them no more access to data than they already have.

For the first 30s of any timer reset, we will not show any additional output. After 30s of no change from the controller, additional output will be shown. For example:

Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines
Waiting on 1 model, 4 machines (direct destruction in 4m30s)
Waiting on 1 model, 4 machines (direct destruction in 4m25s)
Waiting on 1 model, 4 machines (direct destruction in 4m20s)

and so on. If there is a change, this message is removed until there is no change for 30s.

Waiting on 1 model, 4 machines (direct destruction in 4m25s)
Waiting on 1 model, 4 machines (direct destruction in 4m20s)
Waiting on 1 model, 2 machines
Waiting on 1 model, 2 machines

etc.

This is the approach I'm going to start working on now.

Tim Penhey (thumper) on 2016-09-26

Changed in juju:
status:	In Progress → Fix Committed

Revision history for this message

Christian Muirhead (2-xtian) wrote on 2016-09-29:

#16

I think the underlying cause here is the same as for http://pad.lv/1611093 and http://pad.lv/1611159
PR with a fix for that here: https://github.com/juju/juju/pull/6351

Curtis Hovey (sinzui) on 2016-09-29

Changed in juju:
status:	Fix Committed → Fix Released

Curtis Hovey (sinzui) on 2016-10-21

tags:	added: gap
tags:	added: eda

Curtis Hovey (sinzui) on 2016-11-23

Changed in juju-ci-tools:
assignee:	nobody → Andrew James Beach (andrewjbeach)
status:	New → Fix Released

Revision history for this message

Katja (katja-decuir) wrote on 2017-04-02:

#17

hey, i submitted a but a while ago kinda similar to this but like, a month ago in 2017 and was told its already solved.

my issue is that i have a controller i just cant connect to anymore. it says some random IP that doesn't exist and that it can't connect to it because the DHCP server can't name it or anything. ie: its using 10.0.4.xx and the other controllers are using 192.168.0.xx.. it cant connect. it can't be destroyed because it times out.. and it can't be killed because it also just times out.

how do i delete this either through juju or litterally sudo rm some file or from lxd itself?

Revision history for this message

John A Meinel (jameinel) wrote on 2017-04-02: Re: [Bug 1566426] Re: kill-controller should always work to bring down a controller

#18

'juju unregister' will remove the records but not clean up any resources
that are running if anything was left.

John
=:->

On Apr 2, 2017 9:45 AM, "Katja" <email address hidden> wrote:

> hey, i submitted a but a while ago kinda similar to this but like, a
> month ago in 2017 and was told its already solved.
>
> my issue is that i have a controller i just cant connect to anymore. it
> says some random IP that doesn't exist and that it can't connect to it
> because the DHCP server can't name it or anything. ie: its using
> 10.0.4.xx and the other controllers are using 192.168.0.xx.. it cant
> connect. it can't be destroyed because it times out.. and it can't be
> killed because it also just times out.
>
> how do i delete this either through juju or litterally sudo rm some file
> or from lxd itself?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1566426
>
> Title:
> kill-controller should always work to bring down a controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1566426/+subscriptions
>

Revision history for this message

Katja (katja-decuir) wrote on 2017-04-02:

#19

if i unregister the controllers and then reboot, will it stop lxd from resuming whatever virtual machines it was using for that controller? ie: will juju not tell lxd to start them on reboot? or do i have to comb through lxd and figure out which devices need to be manually deleted? thank you for a straight answer.

Revision history for this message

Katja (katja-decuir) wrote on 2017-04-02:

#20

nvm. i fixed it manually. went into /var/lib/lxd and umounted everything (apparently the stuff that cannot be deleted aren't even unmounted or anything, so i did that and removed everything except what was running in juju status. then i unregistered the controllers.) and i think that's all i need to do to clean up all the resources and all.

thanks!

Canonical Juju

kill-controller should always work to bring down a controller

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	High	Tim Penhey	Canonical Juju 2.0-rc2
	juju-ci-tools	Fix Released	Undecided	Andrew James Beach