Error causes deployments to fail

Bug #1635664 reported by Aaron Bentley
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
Fix Committed
Undecided
Francis Ginther
Canonical Juju
Fix Released
High
Christian Muirhead
Landscape Server
New
Undecided
Francis Ginther

Bug Description

As seen here:
http://reports.vapour.ws/releases/issue/580a3a58749a565438026730

The test fails trying to acquire a lock.

util_test.go:136:
    c.Assert(err, jc.ErrorIsNil)
... value *errors.Err = &errors.Err{message:"", cause:(*errors.Err)(0xc8201cd400), previous:(*errors.Err)(0xc820846dc0), file:"github.com/juju/juju/worker/uniter/resolver/loop.go", line:66} ("could not acquire lock: cancelled acquiring mutex")
... error stack:
 github.com/juju/mutex/errors.go:12: cancelled acquiring mutex
 github.com/juju/juju/worker/uniter/uniter.go:543:
 github.com/juju/juju/worker/uniter/operation/executor.go:74: could not acquire lock
 github.com/juju/juju/worker/uniter/resolver/loop.go:66:

[LOG] 0:08.862 DEBUG juju.api RPC connection died

Revision history for this message
Francis Ginther (fginther) wrote :

I'm seeing instances of this error in our automated testing of Landscape autopilot openstack deployments. The most recent example was with juju 1:2.1~rc1, but I also have runs with 1:2.1~beta4. Also, I first found lp:1604915, which is duped to this bug. I've seen this error four times between beta4 and rc1.

The error message seen in one of the juju unit logs is:
[from build 5173 landscape-0-inner-logs/ceilometer-1/var/log/juju/unit-ceilometer-1.log]
2017-02-02 05:12:48 ERROR juju.worker.dependency engine.go:547 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped
2017-02-02 05:12:48 ERROR juju.worker.uniter agent.go:28 resolver loop error: could not acquire lock: cancelled acquiring mutex

The juju-status message for that unit is also set to "resolver loop error". In each instance of hitting this, it appears to be a different application. The error causes the entire deployment to fail.

Builds associated with this failure:
 - https://ci.lscape.net/job/landscape-system-tests/5263
 - https://ci.lscape.net/job/landscape-system-tests/5173
 - https://ci.lscape.net/job/landscape-system-tests/5139
 - https://ci.lscape.net/job/landscape-system-tests/5132

Revision history for this message
Francis Ginther (fginther) wrote :
Changed in autopilot-log-analyser:
status: New → Fix Committed
tags: added: oil
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Speaking of leadership issues, it could be related to https://bugs.launchpad.net/juju/+bug/1654116 (just a remark)

summary: + Deployment fails and affects landscape `15-20% of the time,
TestUniterSteadyStateUpgradeRelations: could not acquire lock
Revision history for this message
Anastasia (anastasia-macmood) wrote : Re: Deployment fails and affects landscape `15-20% of the time, TestUniterSteadyStateUpgradeRelations: could not acquire lock

We'll address this in 2.1.1 due early March. Putting Importance higher based on the feedback from Interested Parties.

Changed in juju:
importance: Medium → High
summary: - Deployment fails and affects landscape `15-20% of the time,
- TestUniterSteadyStateUpgradeRelations: could not acquire lock
+ Error causes deployments to fail
Changed in juju:
milestone: none → 2.1.1
assignee: nobody → Ian Booth (wallyworld)
Revision history for this message
Torsten Baumann (torbaumann) wrote :

Affects landscape `15-20% of the time, TestUniterSteadyStateUpgradeRelations: could not acquire lock

Ian Booth (wallyworld)
Changed in juju:
assignee: Ian Booth (wallyworld) → nobody
Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Christian Muirhead (2-xtian)
tags: added: cdo-qa-blocker
tags: added: landscape
Chad Smith (chad.smith)
Changed in autopilot-log-analyser:
assignee: nobody → Francis Ginther (fginther)
Revision history for this message
Christian Muirhead (2-xtian) wrote :

This is happening because a mutex acquire is being cancelled while delaying. The uniter tries to grab the hook lock, but is blocked by the test (in waitHooks). It delays 250ms and while it's delayed the test kills the uniter, which makes the delayed acquire call return ErrCancelled.

That is returned by the worker.Stop call in util_test.go, and so the test fails.

I'm changing the test code so that it won't fail if it gets ErrCancelled from the stop call.

Revision history for this message
Christian Muirhead (2-xtian) wrote :
Changed in juju:
status: Triaged → Fix Committed
Changed in landscape:
assignee: nobody → Francis Ginther (fginther)
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
Chris Gregan (cgregan)
tags: removed: cdo-qa-blocker
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.