unit destruction depends on unit agents

Bug #1190715 reported by Andreas Hasenack on 2013-06-13
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
juju-core
Critical
William Reade

Bug Description

The command sequence:

$ juju deploy foo
$ juju destroy-service foo

...can have surprising consequences, as follows:

* foo/0 will persist, apparently "alive", until it's deployed; only at that point will it
  be destroyed (because the unit agent checks the service).
* if a new machine were created to hold foo/0, and provisioning failed for that machine,
  the unit will never be destroyed (except manually, via `destroy-unit`).
* if the unit agent was previously running, but the machine agent went away unexpectedly,
  the unit can never be destroyed at all (lp:1089289).

In all these cases, the impact is that the foo service gets "stuck" for longer than it should, waiting on the unit agent. By slightly tweaking service destruction, we can automatically destroy all units; this will trigger existing short-circuit paths and resolve the first two consequences, and move us a step towards a simple fix for the third.

Related branches

description: updated
description: updated
William Reade (fwereade) wrote :

This is partly a communication issue -- it's intending to say something like "I didn't do anything, because the flag I'd be setting is already set"; and the problem is that the unit agent, because it's not running, can't respond to that flag and advance the lifecycle.

So, that's definitely a problem, and we need --force flags on destroy-machine and destroy-unit (lp:1089291 and lp:1089289), that will cause some other part of the system to take over the appropriate responsibilities and tidy up the entities correctly.

Longer-term, this issue emphasizes the value of a storage management system that could let us migrate unit and machine state onto fresh hardware; but that's not on the cards in the immediate future.

It is correct that, once the instance is unrecoverable (what happened to it, btw?), the only way to remove that machine and unit (and the unit's service, and any of its relations the unit had joined...) is to destroy the whole environment. But in practice the *environment* itself should not be in trouble -- unless you lose the bootstrap instance, ofc -- and you should be able to continue to interact with other entities without difficulty. I presume the biggest problem is being unable to reuse service names, but I may be misunderstanding your use case... or unaware of additional problems triggered by this situation?

Changed in juju-core:
status: New → Incomplete
Andreas Hasenack (ahasenack) wrote :

"To reproduce, destroy the instance while the unit is in "pending" state.". It's what I did, because I realised it was launching an instance of the wrong type for my workload.

The biggest problem was not being able to reuse service names, right.

David Britton (davidpbritton) wrote :

Could I get someone else to look at this. It seems to happen consistently when you just do:

  juju deploy cs:ubuntu
  juju destroy-service cs:ubuntu

Regardless of being able to --force your way out of this situation. Getting into that state with a common workflow seems wrong?

Changed in juju-core:
status: Incomplete → New
status: New → Confirmed
William Reade (fwereade) wrote :

@ahasenack, juju expects that you'll use juju to manipulate the environment. Deliberately terminating an instance out-of-band is not supported behaviour at the moment; so I think the precise expression of the original bug is "invalid".

@davidpbritton, just to be clear (with no implication that it's not-a-bug): I think that that use case *will* recover; it will just take a while to do so. I absolutely agree that it could and should be done faster when the unit agent's not running. If that's not the case, please post a status demonstrating the stuckness of deploy/destroy-service please (or just point me to the equivalent existing bug ofc)?

Either way, I think this bug can be actionably characterized as "units of destroyed services are not destroyed if their agents are not running"; and fixing that will clearly improve the experience in both the described use cases. Sensible?

William Reade (fwereade) wrote :

(note that, in this bug, I'm explicitly not including the --force stuff that'd be needed to deal with a started unit agent whose machine was terminated -- this is just about redistributing responsibility for handling service death to be nicer in common use cases.)

William Reade (fwereade) wrote :

...oh, hell, I had mixed state with another bug in my mind. To restate:

1) Tools for dealing with unexpectedly missing instances are on their way (--force), but at the moment we just trust that the user won't sabotage their own system. So please don't terminate instances out of band.

2) If a unit's never been run, service destroy should definitely short-circuit, so we don't need to wait for the unit agents to kill themselves.

3) In the meantime, manually destroying the units will induce the shortcut anyway, so there's a workaround for the cases that don't involve --force.

4) (inferred) What you'd *really* like is to be able to `juju destroy-machine 1` and have all its units automatically destroyed. If that's correct, I think it's a distinct bug...

I'm renaming the bug in light of (2).

summary: - Unit in error, yet juju resolved claims it's fixed
+ unit destruction depends on unit agents
William Reade (fwereade) on 2013-06-19
description: updated
Changed in juju-core:
status: Confirmed → Triaged
importance: Undecided → Critical
assignee: nobody → William Reade (fwereade)

William, let me introduce you to canonistack :) It has its own chaos monkey
that terminates instances when you least expect it :)

On Wed, Jun 19, 2013 at 8:41 AM, William Reade
<email address hidden>wrote:

> @ahasenack, juju expects that you'll use juju to manipulate the
> environment. Deliberately terminating an instance out-of-band is not
> supported behaviour at the moment; so I think the precise expression of
> the original bug is "invalid".
>
> @davidpbritton, just to be clear (with no implication that it's
> not-a-bug): I think that that use case *will* recover; it will just take
> a while to do so. I absolutely agree that it could and should be done
> faster when the unit agent's not running. If that's not the case, please
> post a status demonstrating the stuckness of deploy/destroy-service
> please (or just point me to the equivalent existing bug ofc)?
>
> Either way, I think this bug can be actionably characterized as "units
> of destroyed services are not destroyed if their agents are not
> running"; and fixing that will clearly improve the experience in both
> the described use cases. Sensible?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1190715
>
> Title:
> Unit in error, yet juju resolved claims it's fixed
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1190715/+subscriptions
>

description: updated
William Reade (fwereade) on 2013-06-21
Changed in juju-core:
status: Triaged → In Progress
William Reade (fwereade) on 2013-07-10
Changed in juju-core:
milestone: none → dev-docs
William Reade (fwereade) on 2013-07-11
Changed in juju-core:
milestone: dev-docs → 1.11.3
Changed in juju-core:
milestone: 1.11.3 → 1.11.4
Changed in juju-core:
milestone: 1.11.4 → 1.11.5
Jorge Castro (jorge) on 2013-08-06
Changed in juju-core:
status: In Progress → Fix Committed
John A Meinel (jameinel) on 2013-09-03
Changed in juju-core:
status: Fix Committed → Fix Released
tags: added: landscape
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers