Agents are "lost" after terminating a state server in an HA env

Bug #1510651 reported by Cheryl Jennings
32
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Unassigned
juju-core
Won't Fix
High
Unassigned
1.25
Fix Released
Critical
Unassigned

Bug Description

Recreate steps:

1 - Bootstrap (ec2 in my case)
2 - Deploy services and wait for completion
3 - Run ensure-availability -n 3 and wait for completion
4 - Stop the initial state server (machine 0) from the ec2 console

Some services will show as "agent lost".

Killing jujud on "lost" machines will trigger them to reconnect.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Also, even though machine 0 is stopped, it still shows as "started" in `juju status`

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Revision history for this message
Cheryl Jennings (cherylj) wrote :
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Mark Ramm (mark-ramm) wrote :

I think this should be assigned to 1.25.1 and prioritized accordingly. HA issues are critical.

I also think there needs to be a test case for this in our CI infrastructure.

Changed in juju-core:
importance: High → Critical
milestone: none → 1.25.1
tags: added: bug-squad ensure-availability
Revision history for this message
Cheryl Jennings (cherylj) wrote :

In recreating with 1.24.7, the agents seem to eventually connect in my small deployment, but it took over 10 minutes.

Revision history for this message
Cheryl Jennings (cherylj) wrote :
Revision history for this message
John A Meinel (jameinel) wrote :

So is this killing the entire machine-0, or is it just stopping the API server?
If you stop a Juju API server without stopping its associated Mongo instance than we are in "known broken but not fixable right now" territory. The way we select the Master Juju server is as the one that has a local connection to the Mongo Master. (We did this to avoid writing our own election mechanism, and because it gives us DB level guarantees that if we stop being the master, we'll stop being able to make changes to the DB on that Local connection.)

Revision history for this message
Andrew McDermott (frobware) wrote :

James: let's focus on 1.24 first.

Changed in juju-core:
assignee: nobody → James Tunnicliffe (dooferlad)
Revision history for this message
godleon (godleon) wrote :

Hi Canonical friends,

I am Leon from QCT Taiwan, I deployed OpenStack in my environment to reproduce the issue.

I had 3 state server(machine 0, 4, 5), they are also my controller server in OpenStack.

I shutdown the bootstrap server(machine 0) around 2015-10-28 08:47, then Juju took about 1~2 minutes to get back, but I saw a lot of lost agent status in Juju status.

I originally thought lost status should only be in the services of machine 0, but it seems not. Almost all services in all machines are impacted by the failure of bootstrap server.

I attached the logs in this comment, you can check it from "2015-10-28 08:49" to see what's happened when bootstrap server failed in my environment.

By the way, I made all my test on physical servers with Juju 1.24.6 and MAAS 1.8.2 in dedicated network.

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Just tried this on EC2 with 1.24:

juju bootstrap
juju deploy ubuntu
juju ensure-availability -n 3

An approximate timeline after this point:
0 minutes:
 - terminate machine 0 from EC2 console
2 minutes:
 - juju status returns a status
 - State server (machine 3) agent-status: started
 - State server (machine 2) agent-status down
 - ubuntu charm agent-status: lost
5 minutes:
 - ubuntu charm agent-status: idle
 - State server (machine 3) agent-status: started
 - State server (machine 2) agent-status down
7 minutes:
 - Everything OK (apart from machine 0, which we terminated).

I didn't need to bounce any processes. It would be better if this happened in seconds rather than minutes though.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Bouncing jujud in my first comment was necessary for the unit agents under 1.26. There are logs from the original environment in comment #9

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

Using 1.26 I did "juju ssh 0 sudo halt -p" and machine 0 is still showing up in the status output as started, but EC2 shows it as stopped. Interesting.

Didn't need to do anything other than wait for the Ubuntu charm to get back to idle.

Strange, but machine 0 is reporting as started >10 minutes after the machine stopped.

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

juju set-env logging-config=juju.state.presence=TRACE
juju debug-log
<snip>
machine-2: 2015-10-30 14:10:52 TRACE juju.state.presence presence.go:671 pinging "u#ubuntu/0" with seq=36
machine-2: 2015-10-30 14:10:52 TRACE juju.state.presence presence.go:671 pinging "m#3" with seq=37
machine-2: 2015-10-30 14:10:53 TRACE juju.state.presence presence.go:671 pinging "m#1" with seq=38
machine-2: 2015-10-30 14:11:04 TRACE juju.state.presence presence.go:671 pinging "m#2" with seq=30
machine-2: 2015-10-30 14:11:08 TRACE juju.state.presence presence.go:671 pinging "m#2" with seq=31
machine-1: 2015-10-30 14:11:05 TRACE juju.state.presence presence.go:671 pinging "m#1" with seq=39
machine-2: 2015-10-30 14:11:15 TRACE juju.state.presence presence.go:671 pinging "m#3" with seq=37
machine-2: 2015-10-30 14:11:15 TRACE juju.state.presence presence.go:671 pinging "u#ubuntu/0" with seq=36
machine-2: 2015-10-30 14:11:16 TRACE juju.state.presence presence.go:671 pinging "m#1" with seq=38
machine-1: 2015-10-30 14:11:28 TRACE juju.state.presence presence.go:671 pinging "m#1" with seq=39
machine-2: 2015-10-30 14:11:27 TRACE juju.state.presence presence.go:671 pinging "m#2" with seq=30
machine-2: 2015-10-30 14:11:31 TRACE juju.state.presence presence.go:671 pinging "m#2" with seq=31
machine-2: 2015-10-30 14:11:38 TRACE juju.state.presence presence.go:671 pinging "m#3" with seq=37

Looks like the pinger isn't trying to ping machine 0. Odd.

Revision history for this message
John A Meinel (jameinel) wrote :

There was work being done (I think in 1.26) to move status into part of the machine document so that it could be included in the All watcher. (It may not have been done but it was discussed)
IIRC, it was the responsibility of the API server that a machine connected to, to notice when the connection was down and change the machine doc.
However, an API server probably only talks to itself. And with it dead nobody else is responsible for marking it dead.

That could explain 1.26 vs 1.24

Revision history for this message
John A Meinel (jameinel) wrote :

As for the times listed in James comment,
2 min till status: AFAIK we connect to mongo with consistency STRONG. Which means we can only read from the master. Which means while mongo is reestablishing a master there is nobody we can read from.
I'm not positive if you can read from a slave if there is no master in other circumstances, we could investigate.
But also last login wants to record the connection, so we do want a (soft?) write.
As for why machine 2 stayed down so long and agent was lost, I don't have specific insights.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

John - RE: comment #14, the discussions around machine presence haven't been implemented yet, so that code hasn't changed.

Revision history for this message
William Reade (fwereade) wrote :

Just to be clear: this information may be associated with the machine, but that's not a good reason to put it into the machine document. The AllWatcher can watch different collections and compose the changes into client-facing "docs" just fine; but every time we add a new write pattern for a mongodb document we (1) make O(N^2) transactions ever more likely and (2) make every other watcher for that document less efficient (perhaps significantly so at scale).

Revision history for this message
James Tunnicliffe (dooferlad) wrote :
Download full text (4.1 KiB)

I have done some tests with a very simple set up that shows the "agent is lost, sorry!" message for somewhere between two and twenty minutes. I am still looking for the cause of this variability. It has always eventually settled down for me though.

When Juju loses a state server the units that are connected to it retry connecting and eventually give up and restart, triggering a reconnect to all state servers.

When you can't see state from "juju status" that is because your client is trying to talk to the dead state server. Once it works this out and tries another, that is when you get your status information.

The problem is that if you don't re-run 'juju ensure-availability -n 3' you don't get a replacement state server and Juju won't do this automatically. I think what is happening is:

worker tries to talk to start a watcher that is trying to watch a machine that is down.
watcher eventually gives up with 'ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down'
next, with the api as an example:
2015-10-28 09:29:39 INFO juju.worker runner.go:275 stopped "api", err: watcher has been stopped
2015-10-28 09:29:42 DEBUG juju.worker runner.go:203 "api" done: watcher has been stopped
2015-10-28 09:29:45 ERROR juju.worker runner.go:223 exited "api": watcher has been stopped
2015-10-28 09:29:48 INFO juju.worker runner.go:261 restarting "api" in 3s
2015-10-28 09:29:54 INFO juju.worker runner.go:269 start "api"
2015-10-28 09:29:57 INFO juju.api apiclient.go:331 dialing "wss://10.10.197.27:17070/environment/f388baa3-77af-4b05-8f0f-5f277e626e58/api"
2015-10-28 09:29:57 INFO juju.api apiclient.go:331 dialing "wss://10.10.197.29:17070/environment/f388baa3-77af-4b05-8f0f-5f277e626e58/api"
2015-10-28 09:29:57 INFO juju.api apiclient.go:331 dialing "wss://10.10.197.30:17070/environment/f388baa3-77af-4b05-8f0f-5f277e626e58/api"
2015-10-28 09:30:03 INFO juju.api apiclient.go:339 error dialing "wss://10.10.197.27:17070/environment/f388baa3-77af-4b05-8f0f-5f277e626e58/api": websocket.Dial wss://10.10.197.27:17070/environment/f388baa3-77af-4b05-8f0f-5f277e626e58/api: dial tcp 10.10.197.27:17070: no route to host
2015-10-28 09:30:03 INFO juju.api apiclient.go:263 connection established to "wss://10.10.197.29:17070/environment/f388baa3-77af-4b05-8f0f-5f277e626e58/api"
2015-10-28 09:30:12 DEBUG juju.worker runner.go:196 "api" started

Of course, since there is nothing on the end of 10.10.197.27 we get 'no route to host'. As machines 4 and 5 (the other state servers) still think machine 0 is in the peer group I imagine this is why we see this repeatedly in the logs (this is from machine-4.log):

2015-10-28 09:19:08 INFO juju.state.watcher watcher.go:121 watcher loop failed: watcher iteration error: EOF
2015-10-28 09:19:08 ERROR juju.worker.peergrouper worker.go:138 peergrouper loop terminated: tomb: dying
2015-10-28 09:19:08 DEBUG juju.cmd.jujud machine.go:1604 worker "peergrouper" exited with watcher iteration error: EOF
2015-10-28 09:19:08 INFO juju.worker runner.go:275 stopped "peergrouper", err: watcher iteration error: EOF
2015-10-28 09:19:08 DEBUG juju.cmd.jujud machine.go:1604 worker "certupdater" exited with watcher iterat...

Read more...

Changed in juju-core:
status: Triaged → In Progress
Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

James has said we should look at the peer grouper.

Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

James also has some python scripts he will be uploading to do correlation between logs & times things are happening.

Revision history for this message
Katherine Cox-Buday (cox-katherine-e) wrote :

Need to validate whether this is happening in 1.26.

Revision history for this message
James Tunnicliffe (dooferlad) wrote :

The scripts I have hacked together to debug this are here: https://github.com/dooferlad/jujuWand/tree/master

I will try not to make them any more broken or less commented! What I have now is at rev 2 in case you need to rewind to this point in time.

Changed in juju-core:
assignee: James Tunnicliffe (dooferlad) → nobody
Changed in juju-core:
assignee: nobody → Wayne Witzel III (wwitzel3)
Revision history for this message
Wayne Witzel III (wwitzel3) wrote :

I verified this is happening in 1.26 as well. I am able to reproduce using the lxd provider which speeds up the process of iterating and debugging the deployments.

Changed in juju-core:
milestone: 1.25.1 → 1.25.2
Tim Penhey (thumper)
Changed in juju-core:
assignee: Wayne Witzel III (wwitzel3) → Tim Penhey (thumper)
Revision history for this message
Tim Penhey (thumper) wrote :

I have replicated this on 1.26 with lxd with the following observations:

* firstly, after about 20 - 30 minutes it appears to have all settled itself down (not actually true)
* I killed machine-0 and it always shows as started in status, never down
* both machines 3 and 4 (new state server machines) only list the machine-0 address for state addresses
  - something running in those agents is attempting to connect to just the down machine-0 address for mongo
* the machine agent for machine-1 never shown as lost, but for some reason took 15 minutes to return from dialing the API server addresses.

2015-11-24 01:30:58 INFO juju.worker runner.go:269 start "api"
2015-11-24 01:30:58 INFO juju.api apiclient.go:476 dialing "wss://10.0.3.49:17070/environment/a0b3e91b-e80f-4ac8-8264-3b250e8f7599/api"
2015-11-24 01:30:58 INFO juju.api apiclient.go:476 dialing "wss://10.0.3.202:17070/environment/a0b3e91b-e80f-4ac8-8264-3b250e8f7599/api"
2015-11-24 01:30:58 INFO juju.api apiclient.go:476 dialing "wss://10.0.3.218:17070/environment/a0b3e91b-e80f-4ac8-8264-3b250e8f7599/api"
2015-11-24 01:31:01 INFO juju.api apiclient.go:484 error dialing "wss://10.0.3.49:17070/environment/a0b3e91b-e80f-4ac8-8264-3b250e8f7599/api": websocket.Dial wss://10.0.3.49:17070/environment/a0b3e91b-e80f
-4ac8-8264-3b250e8f7599/api: dial tcp 10.0.3.49:17070: getsockopt: no route to host
2015-11-24 01:44:25 INFO juju.api apiclient.go:269 connection established to "wss://10.0.3.202:17070/environment/a0b3e91b-e80f-4ac8-8264-3b250e8f7599/api"

* the dependency engine used by the unit agent seemed to get itself into a pathological state, and took some time to reconnect everything when the server went away.
* the unit agent log file has timestamps recorded out of order (weird) - although it appears that this is only during the time the machine-0 got stopped
* it appears that as the uniter restarted in the unit agent, it re-ran the config-changed hook
* the "rsyslog-config-updater" worker is stuck restarting as it is only trying to look at the old machine-0 address

Revision history for this message
Tim Penhey (thumper) wrote :

OK, well at least I have eliminated one problem. The log elements out of order seem to be somehow related to the combination of juju ssh, less, and the lxd machine.

Copying the file from the lxd machine to the host shows that the log file is in fact in the correct order.

Which is good, because I was really beginning to wonder WTF was going on.

Changed in juju-core:
milestone: 1.25.2 → 1.25.3
Revision history for this message
Tim Penhey (thumper) wrote :

The fundamentals here is that it appears that once the machine dies, the rest of Juju doesn't realise. This seems to be the foundation upon which the other failures are layered.

I'm looking into the instance poller and presence code to work out where the failure is.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.3 → 2.0-alpha1
Changed in juju-core:
milestone: 2.0-alpha1 → 2.0-alpha2
Tim Penhey (thumper)
Changed in juju-core:
assignee: Tim Penhey (thumper) → nobody
Changed in juju-core:
status: In Progress → Triaged
milestone: 2.0-alpha2 → 2.0-alpha3
Changed in juju-core:
milestone: 2.0-alpha3 → 2.0-beta4
Changed in juju-core:
milestone: 2.0-beta4 → none
importance: Critical → High
Revision history for this message
Benjamin Kaehne (ben-kaehne) wrote :

This bug is also affecting me with juju 1.25.5 deploying xenial.

I notice for some reason or another. The initial state server will become unreachable; agents will be lost and when leadership elections performed for other parts in the env, various machine units will hang.

I never see the machine units try to connect to the other state servers.

tags: added: canonical-bootstack
affects: juju-core → juju
Changed in juju-core:
importance: Undecided → High
status: New → Triaged
Changed in juju-core:
status: Triaged → Won't Fix
Changed in juju:
milestone: none → 2.0-beta18
Revision history for this message
Anastasia (anastasia-macmood) wrote :
Revision history for this message
Anastasia (anastasia-macmood) wrote :
Changed in juju:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.