leadership dependency failures in CI deploy tests

Bug #1539656 reported by Cheryl Jennings
52
This bug affects 15 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth
juju-core
Fix Released
Critical
Ian Booth
1.25
Fix Released
Critical
Ian Booth

Bug Description

As seen in:
http://reports.vapour.ws/releases/3555/job/maas-1_9-deploy-trusty-amd64/attempt/1874

The dummy charms never come up because of this repeated loop:
2016-01-29 10:57:18 INFO juju.worker.uniter uniter.go:169 unit "dummy-sink/0" started
2016-01-29 10:57:18 INFO juju.worker.uniter uniter.go:177 resuming charm install
2016-01-29 10:57:18 DEBUG juju.worker.uniter.operation executor.go:68 running operation install local:trusty/dummy-sink-0
2016-01-29 10:57:18 DEBUG juju.worker.uniter.operation executor.go:108 preparing operation "install local:trusty/dummy-sink-0"
2016-01-29 10:57:18 DEBUG juju.worker.uniter runlistener.go:105 juju-run listener running
2016-01-29 10:57:18 INFO juju.worker.uniter.charm bundles.go:60 downloading local:trusty/dummy-sink-0 from https://10.0.30.17:17070/model/feeb0377-897c-4734-83c5-78d048b3f830/charms?file=%2A&url=local%3Atrusty%2Fdummy-sink-0
2016-01-29 10:57:18 INFO juju.worker.uniter.charm bundles.go:103 download aborted
2016-01-29 10:57:18 DEBUG juju.worker.uniter runlistener.go:118 juju-run listener stopping
2016-01-29 10:57:18 DEBUG juju.worker.uniter runlistener.go:137 juju-run listener stopped
2016-01-29 10:57:18 DEBUG juju.worker.dependency engine.go:464 "uniter" manifold worker stopped: preparing operation "install local:trusty/dummy-sink-0": failed to download charm "local:trusty/dummy-sink-0" from ["https://10.0.30.17:17070/model/feeb0377-897c-4734-83c5-78d048b3f830/charms?file=%2A&url=local%3Atrusty%2Fdummy-sink-0"]: aborted
2016-01-29 10:57:18 DEBUG juju.worker.dependency engine.go:464 "uniter" manifold worker stopped: dependency not available
2016-01-29 10:57:18 INFO juju.worker.upgrader upgrader.go:156 desired tool version: 2.0-alpha3
2016-01-29 10:57:21 DEBUG juju.worker.leadership tracker.go:123 dummy-sink/0 making initial claim for dummy-sink leadership
2016-01-29 10:57:21 DEBUG juju.worker.leadership tracker.go:165 checking dummy-sink/0 for dummy-sink leadership
2016-01-29 10:57:21 DEBUG juju.worker.dependency engine.go:450 "leadership-tracker" manifold worker started
2016-01-29 10:57:21 DEBUG juju.worker.dependency engine.go:464 "leadership-tracker" manifold worker stopped: leadership failure: leadership manager stopped
2016-01-29 10:57:21 ERROR juju.worker.dependency engine.go:505 "leadership-tracker" manifold worker returned unexpected error: leadership failure: leadership manager stopped
2016-01-29 10:57:21 INFO worker.uniter.jujuc tools.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-sink-0
2016-01-29 10:57:21 INFO worker.uniter.jujuc tools.go:40 was a symlink, now looking at /var/lib/juju/tools/2.0-alpha3-trusty-amd64
2016-01-29 10:57:21 DEBUG worker.uniter.jujuc tools.go:44 jujud path /var/lib/juju/tools/2.0-alpha3-trusty-amd64/jujud
2016-01-29 10:57:21 DEBUG juju.worker.uniter uniter.go:489 starting juju-run listener on unix:/var/lib/juju/agents/unit-dummy-sink-0/run.socket

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Also seen in this test: http://reports.vapour.ws/releases/3555/job/maas-1_7-deploy-trusty-amd64/attempt/1836

The log here has a tighter loop:
2016-01-29 11:05:17 INFO juju.worker.uniter uniter.go:169 unit "dummy-sink/0" started
2016-01-29 11:05:17 INFO juju.worker.uniter uniter.go:177 resuming charm install
2016-01-29 11:05:17 INFO juju.worker.uniter.charm bundles.go:60 downloading local:trusty/dummy-sink-0 from https://10.0.60.120:17070/model/c6314e14-750c-449d-8c2d-333cc2deea2c/charms?file=%2A&url=local%3Atrusty%2Fdummy-sink-0
2016-01-29 11:05:17 INFO juju.worker.uniter.charm bundles.go:103 download aborted
2016-01-29 11:05:17 INFO juju.worker.upgrader upgrader.go:156 desired tool version: 2.0-alpha3
2016-01-29 11:05:20 ERROR juju.worker.dependency engine.go:505 "leadership-tracker" manifold worker returned unexpected error: leadership failure: leadership manager stopped
2016-01-29 11:05:20 INFO worker.uniter.jujuc tools.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-sink-0
2016-01-29 11:05:20 INFO worker.uniter.jujuc tools.go:40 was a symlink, now looking at /var/lib/juju/tools/2.0-alpha3-trusty-amd64

Revision history for this message
Ian Booth (wallyworld) wrote :

The logs:

bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 DEBUG juju.worker.leadership tracker.go:164 checking dummy-source/0 for dummy-source leadership
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 DEBUG juju.worker.dependency engine.go:450 "meter-status" manifold worker started
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 DEBUG juju.worker.dependency engine.go:464 "leadership-tracker" manifold worker stopped: leadership failure: lease manager stopped
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 ERROR juju.worker.dependency engine.go:508 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped

appear to indicate that something has called Kill() on the leadership manager. This then means subsequent attempts to claim leadership etc all fail. The reason for the Kill() is not apparent from the logs so far.

Revision history for this message
Ian Booth (wallyworld) wrote :

We can see that at the same time (within the logging timestamp granularity of 1 second):

- "leadership-tracker" manifold worker started
- dummy-sink/0 making initial claim for dummy-sink leadership
- checking dummy-sink/0 for dummy-sink leadership
- "leadership-tracker" manifold worker stopped: leadership failure: lease manager stopped
- "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped

It's like there had been an error starting the leadership tracker worker.

bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.dependency engine.go:450 "leadership-tracker" manifold worker started
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.leadership tracker.go:122 dummy-sink/0 making initial claim for dummy-sink leadership
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.leadership tracker.go:164 checking dummy-sink/0 for dummy-sink leadership
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.dependency engine.go:464 "leadership-tracker" manifold worker stopped: leadership failure: lease manager stopped
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 ERROR juju.worker.dependency engine.go:508 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped

Revision history for this message
Cheryl Jennings (cherylj) wrote :

I was able to recreate this using the api-command-rename branch on my vMAAS, but couldn't once I turned the logging level up to TRACE.

I wasn't able to reproduce on master.

Changed in juju-core:
status: Incomplete → Triaged
no longer affects: juju-core/api-command-rename
Changed in juju-core:
importance: Undecided → High
milestone: none → 2.0-alpha2
importance: High → Critical
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Here's the general flow of what's happening to cause this problem.

- Machine agent starts up on machine-0, starts the "state" worker which opens a state connection (let's call it "state connection 1")

- As part of opening state connection 1, a new lease manager is created for this state connection (let's call it lease manager A).

- Other workers start, including starting the API server (which uses the state connection 1, as passed in from the machine agent).

- The mongo replicaset is updated, causing mongo to drop all its connections.

- The API server dies, as the result of the mongoPinger stopping from mongo dropping its connections.

- On its way out the door, the API server calls HackLeadership() which kills all leadership workers (including lease manager A)

- Workers restart

- The "apiserver" worker is restarted, but uses the SAME state connection 1 from when the "state" worker first started. This state connection 1 still points to lease manager A, which has already been Kill()ed and will not service any requests (it will return "leadership manager stopped")

So, when we have to restart the "apiserver" worker, it needs to get a fresh state connection with new lease / leadership managers.

Ian Booth (wallyworld)
Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Ian Booth (wallyworld)
Revision history for this message
Ian Booth (wallyworld) wrote :
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I can verify that the above patch fixes the problem in my testing.

Ian Booth (wallyworld)
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-alpha2 → none
milestone: none → 2.0-alpha2
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
importance: Undecided → Critical
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.