juju-core

leadership dependency failures in CI deploy tests

Bug #1539656 reported by Cheryl Jennings on 2016-01-29

This bug affects 15 people

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	Critical	Ian Booth	Canonical Juju 2.0-alpha2
juju-core	Fix Released	Critical	Ian Booth
1.25	Fix Released	Critical	Ian Booth	juju-core 1.25.4

Bug Description

As seen in:
http://reports.vapour.ws/releases/3555/job/maas-1_9-deploy-trusty-amd64/attempt/1874

The dummy charms never come up because of this repeated loop:
2016-01-29 10:57:18 INFO juju.worker.uniter uniter.go:169 unit "dummy-sink/0" started
2016-01-29 10:57:18 INFO juju.worker.uniter uniter.go:177 resuming charm install
2016-01-29 10:57:18 DEBUG juju.worker.uniter.operation executor.go:68 running operation install local:trusty/dummy-sink-0
2016-01-29 10:57:18 DEBUG juju.worker.uniter.operation executor.go:108 preparing operation "install local:trusty/dummy-sink-0"
2016-01-29 10:57:18 DEBUG juju.worker.uniter runlistener.go:105 juju-run listener running
2016-01-29 10:57:18 INFO juju.worker.uniter.charm bundles.go:60 downloading local:trusty/dummy-sink-0 from https://10.0.30.17:17070/model/feeb0377-897c-4734-83c5-78d048b3f830/charms?file=%2A&url=local%3Atrusty%2Fdummy-sink-0
2016-01-29 10:57:18 INFO juju.worker.uniter.charm bundles.go:103 download aborted
2016-01-29 10:57:18 DEBUG juju.worker.uniter runlistener.go:118 juju-run listener stopping
2016-01-29 10:57:18 DEBUG juju.worker.uniter runlistener.go:137 juju-run listener stopped
2016-01-29 10:57:18 DEBUG juju.worker.dependency engine.go:464 "uniter" manifold worker stopped: preparing operation "install local:trusty/dummy-sink-0": failed to download charm "local:trusty/dummy-sink-0" from ["https://10.0.30.17:17070/model/feeb0377-897c-4734-83c5-78d048b3f830/charms?file=%2A&url=local%3Atrusty%2Fdummy-sink-0"]: aborted
2016-01-29 10:57:18 DEBUG juju.worker.dependency engine.go:464 "uniter" manifold worker stopped: dependency not available
2016-01-29 10:57:18 INFO juju.worker.upgrader upgrader.go:156 desired tool version: 2.0-alpha3
2016-01-29 10:57:21 DEBUG juju.worker.leadership tracker.go:123 dummy-sink/0 making initial claim for dummy-sink leadership
2016-01-29 10:57:21 DEBUG juju.worker.leadership tracker.go:165 checking dummy-sink/0 for dummy-sink leadership
2016-01-29 10:57:21 DEBUG juju.worker.dependency engine.go:450 "leadership-tracker" manifold worker started
2016-01-29 10:57:21 DEBUG juju.worker.dependency engine.go:464 "leadership-tracker" manifold worker stopped: leadership failure: leadership manager stopped
2016-01-29 10:57:21 ERROR juju.worker.dependency engine.go:505 "leadership-tracker" manifold worker returned unexpected error: leadership failure: leadership manager stopped
2016-01-29 10:57:21 INFO worker.uniter.jujuc tools.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-sink-0
2016-01-29 10:57:21 INFO worker.uniter.jujuc tools.go:40 was a symlink, now looking at /var/lib/juju/tools/2.0-alpha3-trusty-amd64
2016-01-29 10:57:21 DEBUG worker.uniter.jujuc tools.go:44 jujud path /var/lib/juju/tools/2.0-alpha3-trusty-amd64/jujud
2016-01-29 10:57:21 DEBUG juju.worker.uniter uniter.go:489 starting juju-run listener on unix:/var/lib/juju/agents/unit-dummy-sink-0/run.socket

Cheryl Jennings (cherylj) on 2016-01-29

Changed in juju-core:
status:	New → Incomplete

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-29:

Also seen in this test: http://reports.vapour.ws/releases/3555/job/maas-1_7-deploy-trusty-amd64/attempt/1836

The log here has a tighter loop:
2016-01-29 11:05:17 INFO juju.worker.uniter uniter.go:169 unit "dummy-sink/0" started
2016-01-29 11:05:17 INFO juju.worker.uniter uniter.go:177 resuming charm install
2016-01-29 11:05:17 INFO juju.worker.uniter.charm bundles.go:60 downloading local:trusty/dummy-sink-0 from https://10.0.60.120:17070/model/c6314e14-750c-449d-8c2d-333cc2deea2c/charms?file=%2A&url=local%3Atrusty%2Fdummy-sink-0
2016-01-29 11:05:17 INFO juju.worker.uniter.charm bundles.go:103 download aborted
2016-01-29 11:05:17 INFO juju.worker.upgrader upgrader.go:156 desired tool version: 2.0-alpha3
2016-01-29 11:05:20 ERROR juju.worker.dependency engine.go:505 "leadership-tracker" manifold worker returned unexpected error: leadership failure: leadership manager stopped
2016-01-29 11:05:20 INFO worker.uniter.jujuc tools.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-dummy-sink-0
2016-01-29 11:05:20 INFO worker.uniter.jujuc tools.go:40 was a symlink, now looking at /var/lib/juju/tools/2.0-alpha3-trusty-amd64

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-01-30:

The logs:

bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 DEBUG juju.worker.leadership tracker.go:164 checking dummy-source/0 for dummy-source leadership
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 DEBUG juju.worker.dependency engine.go:450 "meter-status" manifold worker started
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 DEBUG juju.worker.dependency engine.go:464 "leadership-tracker" manifold worker stopped: leadership failure: lease manager stopped
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-source-0: 2016-01-29 20:23:46 ERROR juju.worker.dependency engine.go:508 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped

appear to indicate that something has called Kill() on the leadership manager. This then means subsequent attempts to claim leadership etc all fail. The reason for the Kill() is not apparent from the logs so far.

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-01-30:

We can see that at the same time (within the logging timestamp granularity of 1 second):

- "leadership-tracker" manifold worker started
- dummy-sink/0 making initial claim for dummy-sink leadership
- checking dummy-sink/0 for dummy-sink leadership
- "leadership-tracker" manifold worker stopped: leadership failure: lease manager stopped
- "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped

It's like there had been an error starting the leadership tracker worker.

bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.dependency engine.go:450 "leadership-tracker" manifold worker started
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.leadership tracker.go:122 dummy-sink/0 making initial claim for dummy-sink leadership
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.leadership tracker.go:164 checking dummy-sink/0 for dummy-sink leadership
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 DEBUG juju.worker.dependency engine.go:464 "leadership-tracker" manifold worker stopped: leadership failure: lease manager stopped
bdbf38fa-9cb1-49ab-8285-39eb68fde43c unit-dummy-sink-0: 2016-01-29 20:23:59 ERROR juju.worker.dependency engine.go:508 "leadership-tracker" manifold worker returned unexpected error: leadership failure: lease manager stopped

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-01-30:

I was able to recreate this using the api-command-rename branch on my vMAAS, but couldn't once I turned the logging level up to TRACE.

I wasn't able to reproduce on master.

Cheryl Jennings (cherylj) on 2016-02-02

Changed in juju-core:
status:	Incomplete → Triaged
no longer affects:	juju-core/api-command-rename
Changed in juju-core:
importance:	Undecided → High
milestone:	none → 2.0-alpha2
importance:	High → Critical

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-02:

Here's the general flow of what's happening to cause this problem.

- Machine agent starts up on machine-0, starts the "state" worker which opens a state connection (let's call it "state connection 1")

- As part of opening state connection 1, a new lease manager is created for this state connection (let's call it lease manager A).

- Other workers start, including starting the API server (which uses the state connection 1, as passed in from the machine agent).

- The mongo replicaset is updated, causing mongo to drop all its connections.

- The API server dies, as the result of the mongoPinger stopping from mongo dropping its connections.

- On its way out the door, the API server calls HackLeadership() which kills all leadership workers (including lease manager A)

- Workers restart

- The "apiserver" worker is restarted, but uses the SAME state connection 1 from when the "state" worker first started. This state connection 1 still points to lease manager A, which has already been Kill()ed and will not service any requests (it will return "leadership manager stopped")

So, when we have to restart the "apiserver" worker, it needs to get a fresh state connection with new lease / leadership managers.

Ian Booth (wallyworld) on 2016-02-02

Changed in juju-core:
status:	Triaged → In Progress
assignee:	nobody → Ian Booth (wallyworld)

Revision history for this message

Ian Booth (wallyworld) wrote on 2016-02-02:

https://github.com/juju/juju/pull/4266

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-02:

I can verify that the above patch fixes the problem in my testing.

Ian Booth (wallyworld) on 2016-02-02

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-02-11

Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-alpha2 → none
milestone:	none → 2.0-alpha2

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

Changed in juju-core:
assignee:	nobody → Ian Booth (wallyworld)
importance:	Undecided → Critical
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.