uniter-hook-execution error prevents "resolve" unit.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Tim Penhey | ||
juju-core |
Fix Released
|
High
|
Tim Penhey | ||
1.25 |
Fix Released
|
High
|
Tim Penhey |
Bug Description
[Environment]
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty
1.24.2.
[Description]
It's not possible to resolve a failed unit after a config-changed error (juju resolved -r unit) ; the unit remains in error state and the failed hook is never executed again.
The unit claims to be in resolved state, however it's not possible to re-trigger the failed hook.
ERROR cannot set resolved mode for unit "keystone/0": already resolved
However running juju status keystone/0 reported that the hook was in error state.
After manually removing the lock /var/lib/
This is something that is not reproducible in all situations so I think there is a race condition between the code responsible
of marking a hook as failed and the routine responsible of unlocking/remove the lock, on some situations this happens and the lock is held there preventing to execute the failed hook again.
tags: | added: sts |
tags: | added: hooks race-condition |
Changed in juju-core: | |
status: | New → Triaged |
importance: | Undecided → Medium |
tags: | added: sts-needs-review |
Changed in juju-core: | |
importance: | Medium → High |
tags: | added: canonical-bootstack |
summary: |
- Race on uniter-hook-execution, prevents to resolve unit. + uniter-hook-execution error prevents "resolve" unit. |
Changed in juju-core: | |
status: | Fix Committed → Fix Released |
tags: | added: canonical-is |
affects: | juju-core → juju |
Changed in juju: | |
milestone: | 2.0-beta10 → none |
milestone: | none → 2.0-beta10 |
Changed in juju-core: | |
assignee: | nobody → Tim Penhey (thumper) |
importance: | Undecided → High |
status: | New → Fix Released |
I'm pretty sure that (1) it was correct to reject the second resolved, because there's no point telling the unit again when it hasn't handled the first resolve; but (2) the unit was not handling the first resolved *because* something else was holding the machine lock.
I don't suppose you know what was in the lock dir that you removed? That would help us figure out why it was locked in the first place -- might be a good reason, might not, not enough info to tell yet.