ensure-availability brings down the state-server

Bug #1541473 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Cheryl Jennings
juju-core
Fix Released
Critical
Cheryl Jennings
1.25
Fix Released
Critical
Cheryl Jennings
Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Incomplete
Revision history for this message
Curtis Hovey (sinzui) wrote :

the logs are missing. We can see that the test was using status to poll for other state-servers to come up when the client lost connection with state-server machine-0. The script then tried to query the state-server to learn the addressees of the machines in the env to get the logs. Since the state-server is down, not addresses were know.

This is a bug in the test script because the bootstrap server's address was known and should have been used as a fallback.

Curtis Hovey (sinzui)
summary: - ensure-availability timesout in 1.25
+ ensure-availability brings down the state-server in 1.25
description: updated
Revision history for this message
Cheryl Jennings (cherylj) wrote : Re: ensure-availability brings down the state-server in 1.25

I was able to recreate this in my own AWS account.

I see in syslog this message over and over:
[rsHealthPoll] replset info 172.31.18.140:37017 heartbeat failed, retrying
[rsHealthPoll] couldn't connect to 172.31.18.140:37017: couldn't connect to server 172.31.18.140:37017

This is machine "2", which, when I looked at the AWS console to get the IP for machine-0, still showed as "pending". Maybe it hadn't come up yet before we changed the replicaset?

We can't connect to machine-0 for a `juju status` because it's busy trying to reconnect to mongo, which in turn is busy trying to connect to the third machine.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

We have a deadlock in this bug, which happens as follows:

1 - We add machines using ensure-availability, and AWS begins provisioning

2 - The instancepoller updates the IPs of the new machines before they have been completely provisioned (cloud-init not yet completed)

3 - machine-0 changes the replicaset to add the two machines

4 - mongo drops connections (as it does during replicaset changes), and blocks, waiting to contact the two machines added to the replica set.

5 - workers on machine-0 die because mongo has dropped the connections, including the apiserver

6 - machine-2 is coming up and is trying to get tools from machine-0, but it can't because machine-0 is waiting on mongo, mongo is waiting on machine-2, and machine-2 is waiting on machine-0

Round and round we go.

We really shouldn't update API hostports until the machine is "started".

Changed in juju-core:
status: Incomplete → Triaged
importance: Undecided → Critical
milestone: none → 2.0-alpha2
Aaron Bentley (abentley)
summary: - ensure-availability brings down the state-server in 1.25
+ ensure-availability brings down the state-server
Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
status: Triaged → In Progress
tags: removed: blocker
Revision history for this message
Cheryl Jennings (cherylj) wrote :
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
tags: added: tech-debt
Changed in juju-core:
status: Fix Committed → Fix Released
affects: juju-core → juju
Changed in juju:
milestone: 2.0-alpha2 → none
milestone: none → 2.0-alpha2
Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
importance: Undecided → Critical
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.