Intermitent HA controller down

Bug #1973323 reported by Juan M. Tirado
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

During the triage of bug https://bugs.launchpad.net/juju/+bug/1973164 I found that after rebooting one of the controllers, the status command shows one of the controller in an endless started->down loop.

To reproduce use the same steps mentioned in the bug above and kill a secondary :

juju bootstrap localhost lxd-lcl-controller
juju add-machine -m controller -n 2
juju enable-ha --to 1,2

Find the HA primary:

  controller-machines: [6/1657]
    "0":
      instance-id: juju-4fe7ec-0
      ha-status: ha-enabled
      ha-primary: true
PRIMARY=0
juju ssh -m $PRIMARY -- sudo reboot

In my case I rebooted controller 0. Then, juju status reports:

Model Controller Cloud/Region Version SLA Timestamp
controller lxd-lcl-controller localhost/localhost 2.9.29 unsupported 14:53:51+02:00

Machine State DNS Inst id Series AZ Message
0 started 10.73.25.245 juju-c924af-0 focal Running
1 started 10.73.25.60 juju-c924af-1 focal Running
2 down 10.73.25.130 juju-c924af-2 focal Running

and then...

Model Controller Cloud/Region Version SLA Timestamp
controller lxd-lcl-controller localhost/localhost 2.9.29 unsupported 14:54:40+02:00

Machine State DNS Inst id Series AZ Message
0 started 10.73.25.245 juju-c924af-0 focal Running
1 started 10.73.25.60 juju-c924af-1 focal Running
2 started 10.73.25.130 juju-c924af-2 focal Running

This might be a concurrency issue because the problem may appear eventually.

Tags: sts
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Please upload controller logs for investigative purposes. My understanding this not always reproduced.

Revision history for this message
Juan M. Tirado (tiradojm) wrote :

I attach the logs for the three machines.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

I have hit the same result, in a different way. After controller upgrade.

The 2 secondary agents are going up and down.

Not sure why yet.

Nothing stands out in the logs provided: #2

Changed in juju:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

A side effect is that agents of machines in other models go down and back up too. Uniter and leadership workers for units on those machines are restarting a log, started-count: 1560.

Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
John A Meinel (jameinel)
Changed in juju:
milestone: 2.9.31 → none
Revision history for this message
Joseph Phillips (manadart) wrote :

I believe I experienced this with the develop HEAD.

I had a controller on AWS where one node kept reporting "down" every few seconds, but there were no errors in any of the logs, and status history showed the controller unit as "idle" since deployment.

Revision history for this message
Arif Ali (arif-ali) wrote :

Hi, we've had a similar issue in a few sites now with 2.9.29, and looking at the logs, we first get multiple WARNINGs with TLS handshake failed, like below, and this appear on all 3 controllers

WARNING juju.mongo open.go:166 TLS handshake failed: EOF

After this, we get the following error several times across all the hosts

ERROR juju.rpc server.go:600 error writing response: *tls.permanentError write tcp <snip ip>:17070-><snip ip>:49096: write: connection reset by peer

This is when is starts to misbehave, with the other models starting to give various units to be down, and then eventually, the 2 secondary ones start fluctuating.

Both set of logs I have, have a similar set of entries

Restarting the jujud-machine service for those units where the mongodb is SECONDARY works around the problem until the next time.

tags: added: sts
Changed in juju:
assignee: Heather Lanigan (hmlanigan) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.