configstore should break fslock if time > few seconds

Bug #1500613 reported by Tim Penhey
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Tim Penhey
1.24
Fix Released
Critical
Tim Penhey
1.25
Fix Released
High
Tim Penhey

Bug Description

The filesystem lock that is used to serialize access to the config store should only ever be kept for a very short time.

However evidence shows that sometimes the client aborts and the lock is left around.

If the lock is held and the time of the lock is greater than a few seconds, the lock should be forcibly broken - and logged.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

While trying to reproduce a separate, unrelated bug, I hit this lock issue.

After encountering the lock error, I cannot juju stat or juju destroy-environment (with or without --force).

Reproducer and --debug output @:
http://paste.ubuntu.com/12621534/

tags: added: amulet openstack-provider uosci
Revision history for this message
Ryan Beisner (1chb1n) wrote :

ps fyi

juju:
  Installed: 1.24.6-0ubuntu1~14.04.1~juju1

Revision history for this message
Ryan Beisner (1chb1n) wrote :

With 1.24.6, we are observing increased false test failures in OpenStack charm testing due to this lock issue.

Unfortunately, we are not in a position to easily do juju lock file checking and cleanup with the way we run tests. Each test run is 5 to 7 or more [bootstrap/deploy/test/destroy] iterations, back-to-back, run by the Amulet (and juju test) runners.

FWIW - Only one user, from one machine, regarding one deployment, is ever interacting with any given juju environment at any given time, and we still hit this issue.

WARNING configstore lock held, lock dir: /var/lib/jenkins/.juju/environments/env.lock
WARNING lock holder message: pid: 12576, operation: writing
ERROR Unable to connect to environment "osci-sv02".
Please check your credentials or use 'juju bootstrap' to create a new environment.

Error details:
cannot read info: lock timeout exceeded

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Can this please be addressed in 1.24.x? We started having this issue after moving from 1.24.5 to 1.24.6, and it is causing many false test failures, consuming additional lab time to attempt to get clean runs.

Our WARNING configstore lock held, lock dir: /var/lib/jenkins/.juju/environments/env.lock
WARNING lock holder message: pid: 32447, operation: writing
ERROR Unable to connect to environment "osci-sv13".
Please check your credentials or use 'juju bootstrap' to create a new environment.

Error details:
cannot read info: lock timeout exceeded

tags: removed: tech-debt
Revision history for this message
Cheryl Jennings (cherylj) wrote :

It appears that James is currently working on this, so assigning to him.

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

The logic I have implemented for Linux is based on breaking locks that have been left behind by process that no longer exist. I haven't put any logic in for locks held for longer than a timeout. I am currently looking at how to replicate my changes on Windows. I suggest we don't implement a timeout yet so we can see if one is needed after these changes.

Revision history for this message
Tim Penhey (thumper) wrote :

There was no error checking on the unlock. New solution is to watch and retry unlocking, along with a longer lock time-out, automatic breaking and better logging.

Tim Penhey (thumper)
Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Tim Penhey (thumper)
Tim Penhey (thumper)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.