configstore should break fslock if time > few seconds
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Tim Penhey | ||
| | 1.24 |
Critical
|
Tim Penhey | ||
| | 1.25 |
High
|
Tim Penhey | ||
Bug Description
The filesystem lock that is used to serialize access to the config store should only ever be kept for a very short time.
However evidence shows that sometimes the client aborts and the lock is left around.
If the lock is held and the time of the lock is greater than a few seconds, the lock should be forcibly broken - and logged.
| Ryan Beisner (1chb1n) wrote : | #2 |
ps fyi
juju:
Installed: 1.24.6-
| Ryan Beisner (1chb1n) wrote : | #3 |
With 1.24.6, we are observing increased false test failures in OpenStack charm testing due to this lock issue.
Unfortunately, we are not in a position to easily do juju lock file checking and cleanup with the way we run tests. Each test run is 5 to 7 or more [bootstrap/
FWIW - Only one user, from one machine, regarding one deployment, is ever interacting with any given juju environment at any given time, and we still hit this issue.
WARNING configstore lock held, lock dir: /var/lib/
WARNING lock holder message: pid: 12576, operation: writing
ERROR Unable to connect to environment "osci-sv02".
Please check your credentials or use 'juju bootstrap' to create a new environment.
Error details:
cannot read info: lock timeout exceeded
| Ryan Beisner (1chb1n) wrote : | #4 |
Can this please be addressed in 1.24.x? We started having this issue after moving from 1.24.5 to 1.24.6, and it is causing many false test failures, consuming additional lab time to attempt to get clean runs.
Our WARNING configstore lock held, lock dir: /var/lib/
WARNING lock holder message: pid: 32447, operation: writing
ERROR Unable to connect to environment "osci-sv13".
Please check your credentials or use 'juju bootstrap' to create a new environment.
Error details:
cannot read info: lock timeout exceeded
| tags: | removed: tech-debt |
| Cheryl Jennings (cherylj) wrote : | #5 |
It appears that James is currently working on this, so assigning to him.
| James Tunnicliffe (dooferlad) wrote : | #6 |
The logic I have implemented for Linux is based on breaking locks that have been left behind by process that no longer exist. I haven't put any logic in for locks held for longer than a timeout. I am currently looking at how to replicate my changes on Windows. I suggest we don't implement a timeout yet so we can see if one is needed after these changes.
| Tim Penhey (thumper) wrote : | #7 |
There was no error checking on the unlock. New solution is to watch and retry unlocking, along with a longer lock time-out, automatic breaking and better logging.
| Changed in juju-core: | |
| status: | Triaged → In Progress |
| assignee: | nobody → Tim Penhey (thumper) |
| Changed in juju-core: | |
| status: | In Progress → Fix Committed |
| Changed in juju-core: | |
| status: | Fix Committed → Fix Released |


While trying to reproduce a separate, unrelated bug, I hit this lock issue.
After encountering the lock error, I cannot juju stat or juju destroy-environment (with or without --force).
Reproducer and --debug output @: paste.ubuntu. com/12621534/
http://