juju-core

ensure-availability brings down the state-server

Bug #1541473 reported by Curtis Hovey on 2016-02-03

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Canonical Juju	Fix Released	Critical	Cheryl Jennings	Canonical Juju 2.0-alpha2
juju-core	Fix Released	Critical	Cheryl Jennings
1.25	Fix Released	Critical	Cheryl Jennings	juju-core 1.25.4

Bug Description

As seen in
http://reports.vapour.ws/releases/issue/56b24585749a56173940e824

Juju 1.25.4 cannot bring up the additional state-severs calling ensure-availability.

The suspect commits are:
    https://github.com/juju/juju/commit/e0fc2ca179881c658dc931e15e848707901bd97f
    https://github.com/juju/juju/commit/9f5ca7ef8d0bc60f438e9f44813e1b9701a8a654
    https://github.com/juju/juju/commit/76ca2e226388be4afbd5f40f66ab86e7fafc21de

See original description

Tags:

Curtis Hovey (sinzui) on 2016-02-03

Changed in juju-core:
status:	New → Incomplete

Revision history for this message

Curtis Hovey (sinzui) wrote on 2016-02-03:

the logs are missing. We can see that the test was using status to poll for other state-servers to come up when the client lost connection with state-server machine-0. The script then tried to query the state-server to learn the addressees of the machines in the env to get the logs. Since the state-server is down, not addresses were know.

This is a bug in the test script because the bootstrap server's address was known and should have been used as a fallback.

Curtis Hovey (sinzui) on 2016-02-03

summary:	- ensure-availability timesout in 1.25 + ensure-availability brings down the state-server in 1.25
description:	updated

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-03: Re: ensure-availability brings down the state-server in 1.25

I was able to recreate this in my own AWS account.

I see in syslog this message over and over:
[rsHealthPoll] replset info 172.31.18.140:37017 heartbeat failed, retrying
[rsHealthPoll] couldn't connect to 172.31.18.140:37017: couldn't connect to server 172.31.18.140:37017

This is machine "2", which, when I looked at the AWS console to get the IP for machine-0, still showed as "pending". Maybe it hadn't come up yet before we changed the replicaset?

We can't connect to machine-0 for a `juju status` because it's busy trying to reconnect to mongo, which in turn is busy trying to connect to the third machine.

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-03:

We have a deadlock in this bug, which happens as follows:

1 - We add machines using ensure-availability, and AWS begins provisioning

2 - The instancepoller updates the IPs of the new machines before they have been completely provisioned (cloud-init not yet completed)

3 - machine-0 changes the replicaset to add the two machines

4 - mongo drops connections (as it does during replicaset changes), and blocks, waiting to contact the two machines added to the replica set.

5 - workers on machine-0 die because mongo has dropped the connections, including the apiserver

6 - machine-2 is coming up and is trying to get tools from machine-0, but it can't because machine-0 is waiting on mongo, mongo is waiting on machine-2, and machine-2 is waiting on machine-0

Round and round we go.

We really shouldn't update API hostports until the machine is "started".

Cheryl Jennings (cherylj) on 2016-02-04

Changed in juju-core:
status:	Incomplete → Triaged
importance:	Undecided → Critical
milestone:	none → 2.0-alpha2

Aaron Bentley (abentley) on 2016-02-04

summary:

- ensure-availability brings down the state-server in 1.25
+ ensure-availability brings down the state-server

Cheryl Jennings (cherylj) on 2016-02-04

Changed in juju-core:
assignee:	nobody → Cheryl Jennings (cherylj)
status:	Triaged → In Progress

Cheryl Jennings (cherylj) on 2016-02-04

tags:

removed: blocker

Revision history for this message

Cheryl Jennings (cherylj) wrote on 2016-02-09:

Master PR: https://github.com/juju/juju/pull/4338

Cheryl Jennings (cherylj) on 2016-02-09

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2016-02-11

tags:	added: tech-debt
Changed in juju-core:
status:	Fix Committed → Fix Released

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

affects:	juju-core → juju
Changed in juju:
milestone:	2.0-alpha2 → none
milestone:	none → 2.0-alpha2

Canonical Juju QA Bot (juju-qa-bot) on 2016-08-23

Changed in juju-core:
assignee:	nobody → Cheryl Jennings (cherylj)
importance:	Undecided → Critical
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.