error during juju-wait ERROR cannot load cookies: file locked for too long; giving up: cannot acquire lock: resource temporarily unavailable

Bug #1632362 reported by Larry Michel
18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Michael Foord
2.0
Fix Released
Critical
Michael Foord
Juju Wait Plugin
Invalid
Undecided
Unassigned

Bug Description

This is the pipeline: http://10.25.9.231:8080/job/pipeline_deploy/606505/console

While systems were deploying, we hit error below while doing a juju-wait.

Version of juju wait is 2.3.8-0~43~ubuntu16.04.1

and this was with juju 2.0 beta18: 2.0-beta18-0ubuntu1~16.04.1~juju1

From console:

Related "nexentaedge-mgmt:swift-gw" and "nexentaedge-swift-gw:swift-gw"
Related "nexentaedge-mgmt:nedge" and "nexentaedge-data:nedge"
Bundle has one or more containers specified as lxc. lxc containers are deprecated in Juju 2.0. lxd containers will be deployed instead.
Deploy of bundle completed.
2016-10-11 09:12:40,273 [INFO] oil_ci.juju.juju2: Waiting up to two hours for deployment to stabilize
ERROR:root:ERROR cannot load cookies: file locked for too long; giving up: cannot acquire lock: resource temporarily unavailable

ERROR:root:juju status --format=json failed: 1
2016-10-11 09:15:31,711 [ERROR] oil_ci.deploy.oil_deployer: Deployment failed: Command '['timeout', '7200', 'juju-2.0', 'wait', '-e', u'ci-oil-slave12:default']' returned non-zero exit status 1
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oil_ci/deploy/oil_deployer.py", line 210, in deploy
    rc = juju2.deploy(self.env_name, deployment_yaml=self.yaml_file)
  File "/usr/lib/python2.7/dist-packages/oil_ci/juju/juju2.py", line 76, in deploy
    check_call(['timeout', '7200', JUJU, 'wait', '-e', model])
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['timeout', '7200', 'juju-2.0', 'wait', '-e', u'ci-oil-slave12:default']' returned non-zero exit status 1

Including juju status file and log file from the controller node.

Tags: eda oil
Revision history for this message
Larry Michel (lmic) wrote :
Revision history for this message
Anastasia (anastasia-macmood) wrote :

If "wait" plugin logs out, but we have just fixed clearing cookies on logout as part of bug # 1621375 on master tip.

Is there a chance that this can be re-tested with master tip?
It would be great to have this part of our QA testing...

tags: added: eda
Changed in juju:
status: New → Incomplete
Revision history for this message
Greg Lutostanski (lutostag) wrote :

The wait plugin does not log out, nor login. It essentially just does busy-looping on juju-status for the given model (where the credentials.yaml is fully specified).

It only happens when we have lots of contention for that lock. We are seeing it when we use the juju-wait plugin (which does pretty quick looping on juju status) on about 5 different envs at the same time.

Seems like that lock might be being held for more than just the critical section of reading. In addition we are only doing reads at this time (at least I think that is what should be happening for a bunch of 'juju status' calls), so it might be better to have a 1-write-many-read lock for this lock in particular, but don't know if that is entirely necessary for this bug.

Revision history for this message
Greg Lutostanski (lutostag) wrote :

juju wait does have a 4 second sleep between calls to juju status. Our basic setup is we have x number of environments running in different shells where we:
 $ juju bootstrap -m modelx # against maas
 $ juju deploy -m modelx ./bundle.yaml
 $ juju wait -m modelx

when x is around 4 or 5 we often get these errors. Is there any way to get more debug info for you guys?

As a side-note... when I run 8 'while true; do juju status || break; done' in parallel I do get errors runtime/cgo: pthread_create failed: Resource temporarily unavailable, which are similar but not the same.

Revision history for this message
Greg Lutostanski (lutostag) wrote :

And still happening on rc3, we will update to 2.0 soon, but I don't think there was a large delta there either.

Changed in juju:
status: Incomplete → New
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Greg

"wait" is juju plugin. It is maintain by a different group. The fix need to happen there not in juju codebase. I am removing this bug from juju project.

no longer affects: juju
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Greg

"juju wait" maintainers believe that the problem occurs when plugin is calling "juju status".
So it may well be a juju problem :D

Could you please provide logs? We are hoping that there will be more information there.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 2.1.0
summary: - [2.0 beta18] error during juju-wait ERROR cannot load cookies: file
- locked for too long; giving up: cannot acquire lock: resource
- temporarily unavailable
+ error during juju-wait ERROR cannot load cookies: file locked for too
+ long; giving up: cannot acquire lock: resource temporarily unavailable
description: updated
Revision history for this message
Greg Lutostanski (lutostag) wrote :

That is all the information provided. There is no more information in the logs, nor where it is coming from.

My digging pointed a few clues though...
https://bugs.launchpad.net/juju/+bug/1534643 which links to https://github.com/juju/persistent-cookiejar/issues/11

https://github.com/juju/persistent-cookiejar/blob/b48f5b9290d63455d10de0c0e4c26e06e6e74842/jar.go#L113 is where we are getting spit out.

The upstream lock library juju forked from has had an update, but it is documentation changes only.

I don't know of any flag I can pass to juju client to get any more debug information. Let me know if there is anything else I can provide.

Revision history for this message
Greg Lutostanski (lutostag) wrote :

https://github.com/juju/persistent-cookiejar/blob/master/serialize.go#L145 is where the error is emitted.

If a many-read-one-write lock is employed rather than only letting 1 reader at a time I think it would mitigate it enough for us to not hit this any more.

https://github.com/juju/persistent-cookiejar/blob/master/serialize.go#L59 (should allow other readers through if the lock is held)

Otherwise we could increase the timeout on waiting for the lock -- but that might actually increase lock contention rather than reduce it (having more processes wait longer before failing). (at the end of the serialize file).

Changed in juju:
importance: Medium → High
Changed in juju:
assignee: nobody → Richard Harding (rharding)
Revision history for this message
Stuart Bishop (stub) wrote :

2016-10-11 09:12:40,273 [INFO] oil_ci.juju.juju2: Waiting up to two hours for deployment to stabilize
ERROR:root:ERROR cannot load cookies: file locked for too long; giving up: cannot acquire lock: resource temporarily unavailable

ERROR:root:juju status --format=json failed: 1

juju-wait, as previously mentioned, just calls 'juju status' with Juju 2.0. Here it is shown informing the user of what it is doing and exposing the failure message from Juju with all the information it has. So I'm flagging this as invalid for juju-wait.

Changed in juju-wait:
status: New → Invalid
Revision history for this message
Larry Michel (lmic) wrote :

This is with 2.0.1:

Controller Model User Access Cloud/Region Models Machines HA Version
 maas* ci-oil-slave12 admin superuser maas 2 1 none 2.0.1

2016-11-03 16:49:05,775 [INFO] oil_ci.juju.juju2: Waiting up to two hours for deployment to stabilize
ERROR:root:ERROR cannot load cookies: file locked for too long; giving up: cannot acquire lock: resource temporarily unavailable

ERROR:root:juju status --format=json failed: 1

Controller logs attached.

Revision history for this message
Michael Foord (mfoord) wrote :

Switching to an RWMutex, instead of a Mutex, within persistent-cookiejar is simple - however a robust multiple-reader-one-writer file lock may be problematic to get right. The code that attempts to do the locking waits for at least a second, which should be more than enough, however due to the exponential back-off it only polls about four times during that second.

With multiple models the chance of getting four collisions in a row increases (assuming a single cookiejar file for multiple models). Increasing the timeout slightly and also increasing polling frequency could solve this if this is genuinely the cause of the problem.

I'm still digging in.

Revision history for this message
Michael Foord (mfoord) wrote :

This is the kind of error that caused us to move away from file based file locking to using juju/mutex within juju. If the problem is a stale lockfile then switching persistent-cookiejar to juju/mutex will fix it. If the problem is lock contention due to the low number of retries then removing the exponential backoff (which isn't really needed here) and increasing the timeout will fix it.

I'll do both and provide binaries for testing.

Revision history for this message
Larry Michel (lmic) wrote :

I think this bug should be critical. This is the number of occurrences for the last 24 hours:

start_date:(=24 Hours Ago) finish_date:(=Now)
 Submit
1 matching bug

Number Project Summary Occurrences Last Seen (UTC)
1632362 juju ERROR cannot load cookies. file locked 48 10-Nov-16 at 14:44:53

The controller was restarted a few hours ago, but we are still hitting it:

2016-11-10 14:34:43,477 [INFO] oil_ci.juju.juju2: Waiting up to two hours for deployment to stabilize
ERROR:root:ERROR cannot load cookies: file locked for too long; giving up: cannot acquire lock: resource temporarily unavailable

ERROR:root:juju status --format=json failed: 1
2016-11-10 14:35:12,497 [ERROR] oil_ci.deploy.oil_deployer: Deployment failed: Command '['timeout', '7200', 'juju-2.0', 'wait', '-e', u'maas:ci-oil-slave10']' returned non-zero exit status 1
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/oil_ci/deploy/oil_deployer.py", line 209, in deploy
    rc = juju2.deploy(self.env_name, deployment_yaml=self.yaml_file)
  File "/usr/lib/python2.7/dist-packages/oil_ci/juju/juju2.py", line 137, in deploy
    check_call(['timeout', '7200', JUJU, 'wait', '-e', model])
  File "/usr/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['timeout', '7200', 'juju-2.0', 'wait', '-e', u'maas:ci-oil-slave10']' returned non-zero exit status 1

Revision history for this message
Larry Michel (lmic) wrote :

latest logs from controller that's only been for a couple of hours:

Revision history for this message
Larry Michel (lmic) wrote :

LP not processing the tabs makes it difficult to see what the number of occurrences is in my previous comments. The number of occurrences is 48 for the last 24 hours.

Changed in juju:
importance: High → Critical
assignee: Richard Harding (rharding) → Katherine Cox-Buday (cox-katherine-e)
Revision history for this message
Larry Michel (lmic) wrote :

This is the juju version:

+ echo 'JUJU_VERSION =' 1:2.0.1-0ubuntu1~16.04.4~juju1 1:2.0.1-0ubuntu1~16.04.4~juju1 2.3.8-0~43~ubuntu16.04.1
JUJU_VERSION = 1:2.0.1-0ubuntu1~16.04.4~juju1 1:2.0.1-0ubuntu1~16.04.4~juju1 2.3.8-0~43~ubuntu16.04.1

Revision history for this message
Michael Foord (mfoord) wrote :

This version of juju uses a better file locking mechanism (juju/mutex) instead of file based locking. It also has a longer timeout and no exponential back off so should be less liable to lock contention.

Changed in juju:
status: Triaged → In Progress
assignee: Katherine Cox-Buday (cox-katherine-e) → Michael Foord (mfoord)
Revision history for this message
Michael Foord (mfoord) wrote :

Please try the custom build of juju attached to this bug. On Monday I'll snap it up (unless there's feedback by then) to make it easier to try.

Revision history for this message
Michael Foord (mfoord) wrote :

This version is based off the 2.1 branch. Porting the fix to 2.0.x is trivial as it's just changing the dependency to a newer version of juju/persistent-cookiejar where the change is made, so if it fixes the issue we should do it for 2.0.x and 2.1.

Revision history for this message
Greg Lutostanski (lutostag) wrote :

How can we deploy this? PPA build is the best for us to deploy currently if at all possible.

Revision history for this message
Michael Foord (mfoord) wrote :

You can just unzip this onto whichever machine you normally deploy *from* and deploy as normal with it. I'll see if getting a custom build into a ppa is feasible.

Revision history for this message
Richard Harding (rharding) wrote :

We won't have a PPA of this until it lands/hits devel and then we'll have a dev snap available.

Michael can provide a one off snap build to test with from his machine, but that'll be the best path to testing apart from the binary provided.

Revision history for this message
Michael Foord (mfoord) wrote :

Hmmm... I was wrong that persistent-cookiejar only polls 4 times, the initial retry delay is 100us not 100ms. The juju wait plugin itself can call juju concurrently many times (once per service) whilst doing the leadership check. This will cause lock contention if there are many services. See the leadership_poll function here: https://git.launchpad.net/juju-wait/tree/juju_wait/__init__.py

Revision history for this message
Michael Foord (mfoord) wrote :

Although note that with the old persistent-cookiejar code, as soon as the wait gets to 100ms it does make only four further polls. A new PR is changing this machinery in a different way. It *won't* deal with stale lock files (if that's the underlying issue), but it will deal with lock contention:

https://github.com/juju/persistent-cookiejar/pull/21

We'll land this in develop today.

Revision history for this message
Michael Foord (mfoord) wrote :
Revision history for this message
Michael Foord (mfoord) wrote :

Looks like the devel ppa only covers 2.0. There will be a dev snap of this available shortly however. You can obtain it with:

    snap install juju --edge --devmode

Revision history for this message
Michael Foord (mfoord) wrote :

I filed bug #1642007 against the juju-wait plugin, as it shouldn't need to poll units concurrently (possibly causing this lock contention) to determine leader.

Revision history for this message
Michael Foord (mfoord) wrote :

Committed to 2.0 and 2.1 branches.

Changed in juju:
status: In Progress → Fix Committed
Michael Foord (mfoord)
Changed in juju:
milestone: 2.1.0 → 2.0.3
milestone: 2.0.3 → 2.1.0
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.1.0 → 2.1-beta1
milestone: 2.1-beta1 → 2.1-beta2
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.