Bug #1205451 “killing instance outside of juju, doesn't get noti...” : Bugs : juju-core

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-10-09:

#1

In some ways, this issue overlaps with bug 1089291 -- juju couldn't be used to terminate the machine. I think this bug needs to remain separate since users can always used their provider to terminate an instance and we expect Juju to notice.

Changed in juju-core:
status:	New → Triaged
importance:	Undecided → High

Curtis Hovey (sinzui) on 2013-10-12

tags:

added: state-server

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-10-14:

#2

From @Dave's duplicate:

The current Juju model does not cope with machines being removed manually behind the scenes.

In our EC2/HP Cloud world this is rarely a problem, but is more common in the MaaS world where people are predisposed to percussive maintenance. See, LP #1206532

Another more subtle effect of this is relation-departed/broken and peer relation hooks do not fire to signal that a backend service unit has gone away.

Obviously, the charm should make an attempt at coping with this; but if we made the charm authors do everything, there would be nothing left for us to do :) And being more realistic, charm authors can only work within the constraints of the software they are charming, memcache for instance has not heartbeat mechanism.

One solution is to hook the agent presence notification system in to the relation system, this would not solve the phantom machine problem from LP # 1206532, but would allow some level of automagic healing for Juju environments.

tags:

added: cts papercut

Revision history for this message

Curtis Hovey (sinzui) wrote on 2013-10-14:

#3

From @Dave's other duplicate:

Scenario:

Customer has used juju add-machine to enroll all their maas nodes * into their juju environment.

During this process one or more of the machines crashed, but a machine record exists for it, so it appears as 'unallocated' in juju status.

That dead machine will attract units made with add-unit / deploy yet never be able to provision them, nor can units that land on these zombie machines be removed, see lp#1206532.

Proposal:

Juju should consider the agent presence status at the point in time it tries to find a free machine to allocate the unit too. Machines which are currently failing presence would be excluded.

* in the exact scenario, this was so they could use add-unit --to to work around the lack of maas tags, but the underlying problem is applicable without this fact.

tags:

added: canonistack maas

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2013-10-14:

#4

This is pretty similar to bug:1176961

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2013-10-14:

#5

Also related bug:1227450 (lack of retries against a transient provider errors).

Curtis Hovey (sinzui) on 2013-10-17

tags:

added: cts-cloud-review
removed: cts

Curtis Hovey (sinzui) on 2013-11-12

Changed in juju-core:
milestone:	none → 1.17.0
assignee:	nobody → Frank Mueller (themue)
status:	Triaged → In Progress

Frank Mueller (themue) on 2013-11-20

Changed in juju-core:
status:	In Progress → Fix Committed

Revision history for this message

John A Meinel (jameinel) wrote on 2013-12-11:

#6

There are also some bits of bug #1089291 here. At least, if you have a machine that is in a "dead and will never be alive state" you can use "juju destroy-machine --force" to get that machine out of state (and it will both purge units that got assigned to it, and not try to assign things there in the future because it isn't there anymore :)

Revision history for this message

John A Meinel (jameinel) wrote on 2013-12-11:

#7

FWIW I can reproduce the bug as outlined using the local provider.

Doing the following:
$ juju bootstrap -e local
$ juju add-machine -e local
# Wait for the machine to finish starting up
$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: started
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise
$ sudo lxc-ls --fancy # to get the IP address
$ ssh ubuntu@10.0.3.37 # depends on IP
lxc$ sudo ifdown eth0
# exit the SSH session which is now dead
$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: started
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise
# it stays at 'started' for an unknown length of time
# when I do:
$ sudo lxc-stop -n jameinel-local-machine-1
# Then the instance still shows up as started
# Interestingly only when I do:
$ sudo lxc-start --daemon -n jameinel-local-machine-1
# then within a second or two I see:
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise

# And then once the machine finishes starting back up I can see:
2013-12-11 13:35:33 INFO juju.worker.machiner machiner.go:52 "machine-2" started

# And after almost exactly 1 minute it shows up as back up and running.

Now, if I don't stop eth0 before stopping the LXC, I see the machine show up as stopped immediately (and it comes back up as started just as quickly).

Note that even after upgrading to trunk (r2131) if I do sudo ifstop eth0, the agent still shows as 'started' 2 minutes later.

So I'm not sure if we've actually fixed the bug.

machines:
  "0":
    agent-state: started
    agent-version: 1.17.0.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "2":
    agent-state: started
    agent-version: 1.17.0.1
    instance-id: jameinel-local-machine-2
    instance-state: missing
    series: precise
services: {}

FWIW I can reproduce the bug as outlined using the local provider.

Doing the following:
$ juju bootstrap -e local
$ juju add-machine -e local
# Wait for the machine to finish starting up
$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: started
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise
$ sudo lxc-ls --fancy # to get the IP address
$ ssh ubuntu@10.0.3.37 # depends on IP
lxc$ sudo ifdown eth0
# exit the SSH session which is now dead
$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: started
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise
# it stays at 'started' for an unknown length of time
# when I do:
$ sudo lxc-stop -n jameinel-local-machine-1
# Then the instance still shows up as started
# Interestingly only when I do:
$ sudo lxc-start --daemon -n jameinel-local-machine-1
# then within a second or two I see:
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise

# And then once the machine finishes starting back up I can see:
2013-12-11 13:35:33 INFO juju.worker.machiner machiner.go:52 "machine-2" started

# And after almost exactly 1 minute it shows up as back up and running.

Now, if I don't stop eth0 before stopping the LXC, I see the machine show up as stopped immediately (and it comes back up as started just as quickly).

Note that even after upgrading to trunk (r2131) if I do sudo ifstop eth0, the agent still shows as 'started' 2 minutes later.

So I'm not sure if we've actually fixed the bug.

machines:
  "0":
    agent-state: started
    agent-version: 1.17.0.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "2":
    agent-state: started
    agent-version: 1.17.0.1
    instance-id: jameinel-local-machine-2
    instance-state: missing
    series: precise
services: {}

Revision history for this message

John A Meinel (jameinel) wrote on 2013-12-11:

#8

I did eventually see the status go to Down, but it was something like 5 min later.

Revision history for this message

John A Meinel (jameinel) wrote on 2013-12-12:

#9

The part of this bug about not assigning new units to machines which are down was split into bug #1260247.

Curtis Hovey (sinzui) on 2013-12-20

Changed in juju-core:
status:	Fix Committed → Fix Released

juju-core

killing instance outside of juju, doesn't get noticed

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Remote bug watches