Automatically terminate machines that do not register with ZK

Bug #900873 reported by Jim Baker on 2011-12-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pyjuju
Low
Unassigned

Bug Description

Machines that fail to come up after being provisioned should be automatically terminated. This seems to be rare, but can potentially happen. See this blog post: http://www.outflux.net/blog/archives/2011/12/05/ec2-instances-in-support-of-a-bsp/

We will need to define some sort of reasonable heuristic for this, given eventuality and the fact that this sort of automation can readily cascade into other issues. For providers like EC2, this could also rapidly increase costs as machines are attempted to be brought up, and then are terminated.

Tread carefully, in other words.

Kapil Thangavelu (hazmat) wrote :

its not clear this bug report is capturing what's actually been described in the linked post which is more of an error in the provisioning agent.

Clint Byrum (clint-fewbar) wrote :

There is definitely a scenario where we start an instance and it fails to start the agent for uncontrollable reasons. It could get expensive if its a terminal problem (such as the AMI the user has chosen doesn't support cloud-init properly or some such thing). Start it, pay $0.08, fails, start a new one.. pay $0.08 ... fails.. we MUST guard against that scenario.

I think the current status that is shown in juju status is enough to allow users to decide for themselves when to give up on an instance. What would be useful would be the ability to use 'juju terminate-machine' while it is still "pending", which right now will fail because it is "not available". By enabling that, you allow the user to say "I've waited long enough, kill it" and the provisioner will then select a new destination for the unit. Of course, the user could just work around the issue by adding a new unit, and removing the failing one.

I think ultimately, we can only automate what happens after the agent has checked in. Until then, we give up control to whatever the provider does, and so , we cannot do *anything* intelligently in juju except tell the user that we're in that state.

Leaving New/Undecided for now, but I'd suggest that this be changed to suggest a documentation item for troubleshooting.. "What do I do if my instance does not come up?"

Maybe add a --force to juju terminate-machine that will override whatever
error/warning is currently associated with the "machine not available"
scenario.

My two cents ...

Thanks,

Juan

On Wed, Dec 7, 2011 at 2:51 PM, Clint Byrum <email address hidden> wrote:

> There is definitely a scenario where we start an instance and it fails
> to start the agent for uncontrollable reasons. It could get expensive if
> its a terminal problem (such as the AMI the user has chosen doesn't
> support cloud-init properly or some such thing). Start it, pay $0.08,
> fails, start a new one.. pay $0.08 ... fails.. we MUST guard against
> that scenario.
>
> I think the current status that is shown in juju status is enough to
> allow users to decide for themselves when to give up on an instance.
> What would be useful would be the ability to use 'juju terminate-
> machine' while it is still "pending", which right now will fail because
> it is "not available". By enabling that, you allow the user to say "I've
> waited long enough, kill it" and the provisioner will then select a new
> destination for the unit. Of course, the user could just work around the
> issue by adding a new unit, and removing the failing one.
>
> I think ultimately, we can only automate what happens after the agent
> has checked in. Until then, we give up control to whatever the provider
> does, and so , we cannot do *anything* intelligently in juju except tell
> the user that we're in that state.
>
> Leaving New/Undecided for now, but I'd suggest that this be changed to
> suggest a documentation item for troubleshooting.. "What do I do if my
> instance does not come up?"
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/900873
>
> Title:
> Automatically terminate machines that do not register with ZK
>
> Status in juju:
> New
>
> Bug description:
> Machines that fail to come up after being provisioned should be
> automatically terminated. This seems to be rare, but can potentially
> happen. See this blog post:
> http://www.outflux.net/blog/archives/2011/12/05/ec2-instances-in-
> support-of-a-bsp/
>
> We will need to define some sort of reasonable heuristic for this,
> given eventuality and the fact that this sort of automation can
> readily cascade into other issues. For providers like EC2, this could
> also rapidly increase costs as machines are attempted to be brought
> up, and then are terminated.
>
> Tread carefully, in other words.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/900873/+subscriptions
>
>

Changed in juju:
importance: Undecided → Wishlist
Changed in juju:
status: New → Triaged
Curtis Hovey (sinzui) on 2013-10-12
Changed in juju:
importance: Wishlist → Low
tags: added: improvement
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers