killing instance outside of juju, doesn't get noticed

Bug #1205451 reported by Sidnei da Silva on 2013-07-26
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
juju-core
High
Frank Mueller

Bug Description

One frequent problem that I have with canonistack is that sometimes an instance will be stuck in 'BUILD' state, waiting for a resource that is not available (eg, a fixed ip address when the dhcp pool is exausted).

In pyjuju i would just kill that instance outside of juju (eg, nova delete) and juju would promptly launch a new one.

In gojuju, i killed the instance and juju status still lists it as pending, hasn't tried to fire up a new one.

Related branches

Curtis Hovey (sinzui) wrote :

In some ways, this issue overlaps with bug 1089291 -- juju couldn't be used to terminate the machine. I think this bug needs to remain separate since users can always used their provider to terminate an instance and we expect Juju to notice.

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Curtis Hovey (sinzui) on 2013-10-12
tags: added: state-server
Curtis Hovey (sinzui) wrote :

From @Dave's duplicate:

The current Juju model does not cope with machines being removed manually behind the scenes.

In our EC2/HP Cloud world this is rarely a problem, but is more common in the MaaS world where people are predisposed to percussive maintenance. See, LP #1206532

Another more subtle effect of this is relation-departed/broken and peer relation hooks do not fire to signal that a backend service unit has gone away.

Obviously, the charm should make an attempt at coping with this; but if we made the charm authors do everything, there would be nothing left for us to do :) And being more realistic, charm authors can only work within the constraints of the software they are charming, memcache for instance has not heartbeat mechanism.

One solution is to hook the agent presence notification system in to the relation system, this would not solve the phantom machine problem from LP # 1206532, but would allow some level of automagic healing for Juju environments.

tags: added: cts papercut
Curtis Hovey (sinzui) wrote :

From @Dave's other duplicate:

Scenario:

Customer has used juju add-machine to enroll all their maas nodes * into their juju environment.

During this process one or more of the machines crashed, but a machine record exists for it, so it appears as 'unallocated' in juju status.

That dead machine will attract units made with add-unit / deploy yet never be able to provision them, nor can units that land on these zombie machines be removed, see lp#1206532.

Proposal:

Juju should consider the agent presence status at the point in time it tries to find a free machine to allocate the unit too. Machines which are currently failing presence would be excluded.

* in the exact scenario, this was so they could use add-unit --to to work around the lack of maas tags, but the underlying problem is applicable without this fact.

tags: added: canonistack maas
Kapil Thangavelu (hazmat) wrote :

This is pretty similar to bug:1176961

Kapil Thangavelu (hazmat) wrote :

Also related bug:1227450 (lack of retries against a transient provider errors).

Curtis Hovey (sinzui) on 2013-10-17
tags: added: cts-cloud-review
removed: cts
Curtis Hovey (sinzui) on 2013-11-12
Changed in juju-core:
milestone: none → 1.17.0
assignee: nobody → Frank Mueller (themue)
status: Triaged → In Progress
Frank Mueller (themue) on 2013-11-20
Changed in juju-core:
status: In Progress → Fix Committed
John A Meinel (jameinel) wrote :

There are also some bits of bug #1089291 here. At least, if you have a machine that is in a "dead and will never be alive state" you can use "juju destroy-machine --force" to get that machine out of state (and it will both purge units that got assigned to it, and not try to assign things there in the future because it isn't there anymore :)

John A Meinel (jameinel) wrote :

FWIW I can reproduce the bug as outlined using the local provider.

Doing the following:
$ juju bootstrap -e local
$ juju add-machine -e local
# Wait for the machine to finish starting up
$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: started
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise
$ sudo lxc-ls --fancy # to get the IP address
$ ssh ubuntu@10.0.3.37 # depends on IP
lxc$ sudo ifdown eth0
# exit the SSH session which is now dead
$ juju status
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: started
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise
# it stays at 'started' for an unknown length of time
# when I do:
$ sudo lxc-stop -n jameinel-local-machine-1
# Then the instance still shows up as started
# Interestingly only when I do:
$ sudo lxc-start --daemon -n jameinel-local-machine-1
# then within a second or two I see:
machines:
  "0":
    agent-state: started
    agent-version: 1.16.4.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "1":
    agent-state: down
    agent-state-info: (started)
    agent-version: 1.16.4.1
    instance-id: jameinel-local-machine-1
    instance-state: missing
    series: precise

# And then once the machine finishes starting back up I can see:
2013-12-11 13:35:33 INFO juju.worker.machiner machiner.go:52 "machine-2" started

# And after almost exactly 1 minute it shows up as back up and running.

Now, if I don't stop eth0 before stopping the LXC, I see the machine show up as stopped immediately (and it comes back up as started just as quickly).

Note that even after upgrading to trunk (r2131) if I do sudo ifstop eth0, the agent still shows as 'started' 2 minutes later.

So I'm not sure if we've actually fixed the bug.

machines:
  "0":
    agent-state: started
    agent-version: 1.17.0.1
    dns-name: 10.0.3.1
    instance-id: localhost
    series: precise
  "2":
    agent-state: started
    agent-version: 1.17.0.1
    instance-id: jameinel-local-machine-2
    instance-state: missing
    series: precise
services: {}

John A Meinel (jameinel) wrote :

I did eventually see the status go to Down, but it was something like 5 min later.

John A Meinel (jameinel) wrote :

The part of this bug about not assigning new units to machines which are down was split into bug #1260247.

Curtis Hovey (sinzui) on 2013-12-20
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers