remove-unit and destroy-service don't work when agent state is error
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-core |
New
|
Undecided
|
Unassigned |
Bug Description
In this scenario, the unit has the agent-state in error:
$ juju status
machines:
"0":
agent-state: started
agent-version: 1.10.0.1
dns-name: 10.55.63.219
instance-id: c05e2388-
series: precise
"2":
agent-state: started
agent-version: 1.10.0.1
dns-name: 10.55.63.220
instance-id: 44095fed-
series: precise
services:
cinder:
charm: local:precise/
exposed: false
relations:
cluster:
- cinder
units:
cinder/0:
machine: "2"
I then alternatively try to destroy the service and remove the unit, but it doesn't work. Nor can I terminate the machine:
$ juju destroy-service cinder
$ juju terminate-machine 2
error: no machines were destroyed: machine 2 has unit "cinder/0" assigned
$ juju remove-unit cinder/0
$ juju status
machines:
"0":
agent-state: started
agent-version: 1.10.0.1
dns-name: 10.55.63.219
instance-id: c05e2388-
series: precise
"2":
agent-state: started
agent-version: 1.10.0.1
dns-name: 10.55.63.220
instance-id: 44095fed-
series: precise
services:
cinder:
charm: local:precise/
exposed: false
life: dying
units:
cinder/0:
life: dying
machine: "2"
As far as I can see, the cinder service and the cinder/0 unit are stuck.
The current behaviour is in fact as intended; error states are intended to prevent a unit from doing anything until a human has solved the problem that juju considers intractable. This is done with `juju resolved`, which indicates to juju that you have yourself completed the task that juju failed to do. This would of course be a bare-faced lie, and wouldn't help the next hook's chances of success much, but by repeatedly resolving errors without looking you can assist a dying unit to its eventual suicide.
In this specific case -- a failure on install -- I think it would be reasonable for the unit to be removed directly when it was destroyed; and it is probably reasonable to do so at any point up to the successful completion of the start hook; but once that's run, we really ought to be running a stop hook before shutting the unit down. And once it's joined relations the question is harder still; so we err on the side of safety, and ask for interventions whenever we're unsure. So I have two proposals to address the near and far terms:
1) destroy-unit on a unit that has not run its "start" hook should remove the unit directly regardless of error state.
2) destroy-unit --force on a unit that has run its "start" hook should cause it to run all hooks necessary for it to disengage, but to ignore error states and continue blindly on through "stop" to death.
Would either, or both, address your needs?