flapping presence on MAAS in HA when controller shut down

Bug #1818041 reported by Christian Muirhead
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Bootstrap a controller on MAAS and enable HA. Stop controller machine 2.

juju status -m controller will take a long time to change the state of machine 2 to down, and it will oscillate between started and down for a while (depending on which of the remaining controller machines the client asks for the status).

Running juju_presence_report and juju_pubsub_report on the running controller machines also shows the discrepancy - controller 0 will indicate that controller 2 is missing but controller 1 will show it as alive/connected. (See below)

Eventually (after about 10 mins?) controller 1 will notice that controller 2 is gone and the presence will stop flapping.

[controller 0]
ubuntu@nuc2:~$ juju_presence_report
Querying @jujud-machine-0 introspection socket: /presence/
[5983ba1d-ce0b-4b86-8a8d-1d9b4dfdd92e]

AGENT SERVER CONN ID STATUS
machine-0 machine-0 4 alive
machine-0 machine-0 6 alive
machine-0 machine-1 17 alive
machine-0 machine-2 8 missing
machine-1 machine-0 2 alive
machine-1 machine-1 8 alive
machine-1 machine-1 10 alive
machine-1 machine-2 6 missing
machine-2 machine-2 2 missing
machine-2 machine-2 4 missing

[5a08305e-027a-41c4-8ed9-ebd98d431aa7]

AGENT SERVER CONN ID STATUS
machine-0 (controller) machine-0 5 alive
machine-0 machine-1 2 alive
machine-1 (controller) machine-1 9 alive
machine-2 (controller) machine-2 3 missing
unit-ubuntu-lite-4 machine-0 399 alive
unit-ubuntu-lite-5 machine-0 402 alive
unit-ubuntu-lite-6 machine-0 403 alive

ubuntu@nuc2:~$ juju_pubsub_report
Querying @jujud-machine-0 introspection socket: /pubsub
PubSub Report:

Source: machine-0

Target: machine-1
  Status: connected
  Addresses: [10.0.0.170:17070]
  Queue length: 0
  Sent count: 148270

Target: machine-2
  Status: disconnected
  Addresses: [10.0.0.32:17070]
  Queue length: 0
  Sent count: 2183

[controller 1]
ubuntu@nuc7:~$ juju_presence_report
Querying @jujud-machine-1 introspection socket: /presence/
[5983ba1d-ce0b-4b86-8a8d-1d9b4dfdd92e]

AGENT SERVER CONN ID STATUS
machine-0 machine-0 4 alive
machine-0 machine-0 6 alive
machine-0 machine-1 17 alive
machine-0 machine-2 8 alive
machine-1 machine-0 2 alive
machine-1 machine-1 8 alive
machine-1 machine-1 10 alive
machine-1 machine-2 6 alive
machine-2 machine-2 2 alive
machine-2 machine-2 4 alive

[5a08305e-027a-41c4-8ed9-ebd98d431aa7]

AGENT SERVER CONN ID STATUS
machine-0 (controller) machine-0 5 alive
machine-0 machine-1 2 alive
machine-1 (controller) machine-1 9 alive
machine-2 (controller) machine-2 3 alive
unit-ubuntu-lite-4 machine-0 399 alive
unit-ubuntu-lite-5 machine-0 402 alive
unit-ubuntu-lite-6 machine-0 403 alive

ubuntu@nuc7:~$ juju_pubsub_report
Querying @jujud-machine-1 introspection socket: /pubsub
PubSub Report:

Source: machine-1

Target: machine-0
  Status: connected
  Addresses: [10.0.0.156:17070]
  Queue length: 0
  Sent count: 14468

Target: machine-2
  Status: connected
  Addresses: [10.0.0.32:17070]
  Queue length: 0
  Sent count: 760

Revision history for this message
Tim Penhey (thumper) wrote :

I think we should have the presence worker publish "I'm alive" messages on udp on the controller port, failing back to the apiserver port if the controller port isn't there.

A udp packet every second isn't a big issues, and we can fail if we think the agent is alive and we don't get a ping after 5 seconds.

Revision history for this message
Tim Penhey (thumper) wrote :

After discussions with Joel in IS, he strongly suggested not using UDP for heartbeats if we can help it.

Discussed TCP keep alives more with Joel and read more around this too. The keep alives are only sent after periods of inactivity. It is also noted that the connection isn't killed until a number has failed. The default in linux is 9.

Checked the go source and when you set a keep alive time, it sets the socket options for both the interval and time. The time is how long the connection needs to be idle before keep alive packets are sent, and the interval is the duration between packets. The linux defaults are too long to be useful.

Note that the keep alive are only sent during data idle. If there is a pending message then this will be sent, and retransmitted if not acked. If the packet isn't acked then it is retransmitted after about 100 seconds. Current hypothesis is that the socket is considered idle during this waiting time, so a short keep alive timeout should be able to close the socket while awaiting an ack.

The api clients both for normal api connections and the streaming connections for logs should use a 15 second keep alive time.

For the controller streaming websockets we should set a low timeout for the tcp keep alive. If we used a one second keep alive for the pubsub stream connections then if the other side goes away, we should see a socket disconnection after approximately ten seconds. Having a very low tcp keep alive for just the pubsub connections isn't likely to be impactful as we are only talking about six total connections for a three node HA cluster.

TCP dtails checked here: http://man7.org/linux/man-pages/man7/tcp.7.html

Tim Penhey (thumper)
Changed in juju:
importance: Undecided → High
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.