uniter-hook-execution error prevents "resolve" unit.

Bug #1486712 reported by Jorge Niedbalski
38
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Tim Penhey
juju-core
Fix Released
High
Tim Penhey
1.25
Fix Released
High
Tim Penhey

Bug Description

[Environment]

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

1.24.2.1-trusty-amd64

[Description]

It's not possible to resolve a failed unit after a config-changed error (juju resolved -r unit) ; the unit remains in error state and the failed hook is never executed again.

The unit claims to be in resolved state, however it's not possible to re-trigger the failed hook.

ERROR cannot set resolved mode for unit "keystone/0": already resolved 

However running juju status keystone/0 reported that the hook was in error state.

After manually removing the lock /var/lib/juju/locks/uniter-hook-execution/, the command juju resolved -r unit succeed.

This is something that is not reproducible in all situations so I think there is a race condition between the code responsible
of marking a hook as failed and the routine responsible of unlocking/remove the lock, on some situations this happens and the lock is held there preventing to execute the failed hook again.

tags: added: sts
Curtis Hovey (sinzui)
tags: added: hooks race-condition
Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
William Reade (fwereade) wrote :

I'm pretty sure that (1) it was correct to reject the second resolved, because there's no point telling the unit again when it hasn't handled the first resolve; but (2) the unit was not handling the first resolved *because* something else was holding the machine lock.

 I don't suppose you know what was in the lock dir that you removed? That would help us figure out why it was locked in the first place -- might be a good reason, might not, not enough info to tell yet.

Revision history for this message
Junien F (axino) wrote :

Hi,

I think I just hit the same bug on a server. There were several services with subordinates deployed on LXCs on this server, which went down abruptly after a power outage.

In the /var/lib/juju/locks/uniter-hook-execution/ on each impacted unit were 2 empty files :
/var/lib/juju/locks/uniter-hook-execution# ls -l
total 0
-rwxr-xr-x 1 root root 0 Nov 13 10:39 held
-rwxr-xr-x 1 root root 0 Nov 13 10:39 message

sudo lsof -n|grep W|grep juju didn't return anything.

Renaming the directory made the units in error resolve themselves without needing to resolved --retry.

juju version 1.24.5, on trusty.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

axino, do you have logs you could upload? Specifically, logs for:
- state server machine logs
- machine log for machine hosting the units in question
- unit logs

Revision history for this message
Manuel Seelaus (seelaman) wrote :

hit the same issue, renaming the directory immediately unblocked the config-changed hooks (no resolved -r needed as mentioned before), attaching logs

Revision history for this message
Manuel Seelaus (seelaman) wrote :
Revision history for this message
Manuel Seelaus (seelaman) wrote :
tags: added: sts-needs-review
Revision history for this message
JuanJo Ciarlante (jjo) wrote :

FYI still seeing this at 1.25.5 - bootstack deploy, mostly (but not only)
on smooshed units - it does require manual intervention to let units
make progress (else they effectively stall forever).

Revision history for this message
JuanJo Ciarlante (jjo) wrote :

forgot to add: not only blocks "resolved -r", but any further hooks
execution (obviously including eg. juju run ).

Changed in juju-core:
importance: Medium → High
tags: added: canonical-bootstack
summary: - Race on uniter-hook-execution, prevents to resolve unit.
+ uniter-hook-execution error prevents "resolve" unit.
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Without an easily reproducible scenario, I dived into the code and observed:

1. We optimistically Unlock held locks, i.e we do not check if unlock succeeds;
2. Unlocking may fail if we are trying to unlock a lock that (a) is not held by us or (b) we could not move the lock to a temporary file/directory.

I think that in this bug we are observing the consequences of the 2b and due to 1, we will not see anything useful in the logs - we are not logging failures to unlock.

So, in conclusion, current workaround - removing lock's directory as above - is necessary in the rare cases where lock is not unlocked after running/resolving hooks. However, this should be used with care and under advisement to avoid removing valid locks.

We need to re-design lock mechanism to be more robust to address this and other scenarios.

An easily re-producible scenario could help us ensure that the new design/implementation will address this case. Please submit one if you have it readily available \o/

Revision history for this message
Cheryl Jennings (cherylj) wrote :

The fslock implementation has been redesigned. 1.25 PR: https://github.com/juju/juju/pull/5663

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Changed in juju-core:
status: Triaged → Fix Committed
assignee: nobody → Tim Penhey (thumper)
milestone: none → 2.0-beta10
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Paul Gear (paulgear)
tags: added: canonical-is
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta10 → none
milestone: none → 2.0-beta10
Changed in juju-core:
assignee: nobody → Tim Penhey (thumper)
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.