pyjuju

Automatically terminate machines that do not register with ZK

Bug #900873 reported by Jim Baker on 2011-12-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	pyjuju	Triaged	Low	Unassigned

Bug Description

Machines that fail to come up after being provisioned should be automatically terminated. This seems to be rare, but can potentially happen. See this blog post: http://www.outflux.net/blog/archives/2011/12/05/ec2-instances-in-support-of-a-bsp/

We will need to define some sort of reasonable heuristic for this, given eventuality and the fact that this sort of automation can readily cascade into other issues. For providers like EC2, this could also rapidly increase costs as machines are attempted to be brought up, and then are terminated.

Tread carefully, in other words.

Tags:

Revision history for this message

Kapil Thangavelu (hazmat) wrote on 2011-12-07:

its not clear this bug report is capturing what's actually been described in the linked post which is more of an error in the provisioning agent.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2011-12-07:

There is definitely a scenario where we start an instance and it fails to start the agent for uncontrollable reasons. It could get expensive if its a terminal problem (such as the AMI the user has chosen doesn't support cloud-init properly or some such thing). Start it, pay $0.08, fails, start a new one.. pay $0.08 ... fails.. we MUST guard against that scenario.

I think the current status that is shown in juju status is enough to allow users to decide for themselves when to give up on an instance. What would be useful would be the ability to use 'juju terminate-machine' while it is still "pending", which right now will fail because it is "not available". By enabling that, you allow the user to say "I've waited long enough, kill it" and the provisioner will then select a new destination for the unit. Of course, the user could just work around the issue by adding a new unit, and removing the failing one.

I think ultimately, we can only automate what happens after the agent has checked in. Until then, we give up control to whatever the provider does, and so , we cannot do *anything* intelligently in juju except tell the user that we're in that state.

Leaving New/Undecided for now, but I'd suggest that this be changed to suggest a documentation item for troubleshooting.. "What do I do if my instance does not come up?"

Revision history for this message

Juan L. Negron (negronjl) wrote on 2011-12-08: Re: [Bug 900873] Re: Automatically terminate machines that do not register with ZK

Maybe add a --force to juju terminate-machine that will override whatever
error/warning is currently associated with the "machine not available"
scenario.

My two cents ...

Thanks,

Juan

On Wed, Dec 7, 2011 at 2:51 PM, Clint Byrum <email address hidden> wrote:

> There is definitely a scenario where we start an instance and it fails
> to start the agent for uncontrollable reasons. It could get expensive if
> its a terminal problem (such as the AMI the user has chosen doesn't
> support cloud-init properly or some such thing). Start it, pay $0.08,
> fails, start a new one.. pay $0.08 ... fails.. we MUST guard against
> that scenario.
>
> I think the current status that is shown in juju status is enough to
> allow users to decide for themselves when to give up on an instance.
> What would be useful would be the ability to use 'juju terminate-
> machine' while it is still "pending", which right now will fail because
> it is "not available". By enabling that, you allow the user to say "I've
> waited long enough, kill it" and the provisioner will then select a new
> destination for the unit. Of course, the user could just work around the
> issue by adding a new unit, and removing the failing one.
>
> I think ultimately, we can only automate what happens after the agent
> has checked in. Until then, we give up control to whatever the provider
> does, and so , we cannot do *anything* intelligently in juju except tell
> the user that we're in that state.
>
> Leaving New/Undecided for now, but I'd suggest that this be changed to
> suggest a documentation item for troubleshooting.. "What do I do if my
> instance does not come up?"
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/900873
>
> Title:
> Automatically terminate machines that do not register with ZK
>
> Status in juju:
> New
>
> Bug description:
> Machines that fail to come up after being provisioned should be
> automatically terminated. This seems to be rare, but can potentially
> happen. See this blog post:
> http://www.outflux.net/blog/archives/2011/12/05/ec2-instances-in-
> support-of-a-bsp/
>
> We will need to define some sort of reasonable heuristic for this,
> given eventuality and the fact that this sort of automation can
> readily cascade into other issues. For providers like EC2, this could
> also rapidly increase costs as machines are attempted to be brought
> up, and then are terminated.
>
> Tread carefully, in other words.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/900873/+subscriptions
>
>

Maybe add a --force to juju terminate-machine that will override whatever
error/warning is currently associated with the "machine not available"
scenario.

My two cents ...

Thanks,

Juan

On Wed, Dec 7, 2011 at 2:51 PM, Clint Byrum <clint@fewbar.com> wrote:

> There is definitely a scenario where we start an instance and it fails
> to start the agent for uncontrollable reasons. It could get expensive if
> its a terminal problem (such as the AMI the user has chosen doesn't
> support cloud-init properly or some such thing). Start it, pay $0.08,
> fails, start a new one.. pay $0.08 ... fails.. we MUST guard against
> that scenario.
>
> I think the current status that is shown in juju status is enough to
> allow users to decide for themselves when to give up on an instance.
> What would be useful would be the ability to use 'juju terminate-
> machine' while it is still "pending", which right now will fail because
> it is "not available". By enabling that, you allow the user to say "I've
> waited long enough, kill it" and the provisioner will then select a new
> destination for the unit. Of course, the user could just work around the
> issue by adding a new unit, and removing the failing one.
>
> I think ultimately, we can only automate what happens after the agent
> has checked in. Until then, we give up control to whatever the provider
> does, and so , we cannot do *anything* intelligently in juju except tell
> the user that we're in that state.
>
> Leaving New/Undecided for now, but I'd suggest that this be changed to
> suggest a documentation item for troubleshooting.. "What do I do if my
> instance does not come up?"
>
> --
> You received this bug notification because you are subscribed to juju.
> https://bugs.launchpad.net/bugs/900873
>
> Title:
>  Automatically terminate machines that do not register with ZK
>
> Status in juju:
>  New
>
> Bug description:
>  Machines that fail to come up after being provisioned should be
>  automatically terminated. This seems to be rare, but can potentially
>  happen. See this blog post:
>  http://www.outflux.net/blog/archives/2011/12/05/ec2-instances-in-
>  support-of-a-bsp/
>
>  We will need to define some sort of reasonable heuristic for this,
>  given eventuality and the fact that this sort of automation can
>  readily cascade into other issues. For providers like EC2, this could
>  also rapidly increase costs as machines are attempted to be brought
>  up, and then are terminated.
>
>  Tread carefully, in other words.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/900873/+subscriptions
>
>

Kapil Thangavelu (hazmat) on 2011-12-09

Changed in juju:
importance:	Undecided → Wishlist

Clint Byrum (clint-fewbar) on 2012-10-19

Changed in juju:
status:	New → Triaged

Curtis Hovey (sinzui) on 2013-10-12

Changed in juju:
importance:	Wishlist → Low
tags:	added: improvement

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.