Canonical Juju

HA Juju controllers showing inconsistent status

Bug #2020849 reported by James Simpson on 2023-05-26

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	New	Undecided	Joseph Phillips

Bug Description

Hi,

We've experienced an issue with the consistency and stability of our Juju controllers, and are struggling to pinpoint what's actually happening.

We're operating a HA controller set, running Juju 2.9.42, deployed in an Openstack cloud.

Symptoms we've observed have been:

* Issues with the stability of relationship hooks in deployed models (we have observed issues with relationships being created, updated, and departed)

* Controllers returning inconsistent "juju status" results
When running "juju status --debug" to make sure we get one result from each controller, we have observed that at least one controller will consistently return a different result than the other(s).
For example, this paste shows both secondary controllers reporting the primary controller as "agent-lost", while the primary disagrees: https://pastebin.ubuntu.com/p/VHrwFbm79Z/

Controller logs from the period in question have been made available via secure portal https://juju-controller-reports.admin.canonical.com/ps5-prodstack-is/

Model logs for the specific model in which we observed relationship hook issues are located in "special-request" under that directory.

Please advise if there are any additional logs we should supply, any metrics we can gather from the time, or anything else.

Thanks!

## ADDED ###

A customer(using 2.9.49, 2,9,42) faced the same issue. and I also see this intermittently(not easy to reproduce though) when one of controllers are restarted.

I analyzed a bit of juju code below.

https://pastebin.ubuntu.com/p/kZDTB7PDtZ/

it seems that Presence(connections..?) has different value (in this case 'down' for machine or 'agent lost' for unit) then DB info. and if c.Presence.AgentStatus has down, juju status shows us it is down.

It seems that when one of controllers are rebooted, that value has changed or remained down.

I'm not sure about 3.x but is there related patch for 3.x? and can we backport it if there is?

I keep analyzing code but I would appreciate if juju team can give me any advice.

Thanks a lot.

See original description

Tags:

Tom Haddon (mthaddon) on 2023-05-26

tags:

added: canonical-is

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-05-28:

The controllers share agent connectivity info (aka presence) using pubsub. I don't think there's an explicit delivery guarantee for such messages.

Logging to turn on would be

juju.worker.pubsub=TRACE
juju.worker.presence=TRACE

Logs could also contain messages matching the format string

"%p programming error, e.ch=%v did not accept %v - missing Unwatch?\nwatch source:\n%s"

Extra relation debug can be obtained by setting

juju.worker.uniter.relation=TRACE

Vitaly Antonenko (anvial) on 2023-06-01

Changed in juju:
status:	New → Incomplete

Revision history for this message

James Simpson (jsimpso) wrote on 2023-06-01:

Thanks Ian - I've configured those via "juju model-config -m controller logging-config=...", next time we experience an de-sync event do we just need to grab the same logs as we provided above?

Junien F (axino) on 2023-06-02

Changed in juju:
status:	Incomplete → New

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-06-15:

Looking through the initial logs supplied, I can see on machine 2 that we were spewing 404s around 25/5.
https://pastebin.canonical.com/p/3ZkgCDRpCt/

It almost looks as though the HTTP server worker was down, then managed to come up. This *might* explain inconsistencies.

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-06-15:

Then there were a great deal of transaction issues:
https://pastebin.canonical.com/p/HD36KQxrWv/

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-06-15:

Connectivity issues continue for some time with connection resets and broken pipes, then the lease issues that inevitably follow along with model leadership issues.

Joseph Phillips (manadart) on 2023-06-22

Changed in juju:
status:	New → Triaged
assignee:	nobody → Joseph Phillips (manadart)

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-08-09:

Based on the latest logs and their corresponding engine reports, I'm not seeing an obvious cause for inconsistencies.

Can we get some detail on what the alerts indicate when they are raised?

Errors are mostly the usual firewaller worker ones for ipv6-icmp, which will be fixed by upgrading. This should be done in the short term, because it will reduce load and log spam.

I can see a sudden jump up in load on ubuntu/2 here:
https://grafana.admin.canonical.com/d/sR1-JkYmz/juju2-controllers-thumpers?orgId=1&var-controller=prodstack5-prodstack5-prodstack-is&var-host=All&var-node=All&from=1691466766995&to=1691556622995

There's nothing to explain this, and it is obviously affecting API response times.

Revision history for this message

Romain Couturat (romaincout) wrote on 2023-10-06:

around 2023-10-06T13:56:52+00:00 , running
alias jsft='juju status --format tabular'
4 times in a row gave me 3 different results (look at landscape-client scale) https://pastebin.canonical.com/p/QCGgK4kf24/

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-10-09:

When you're seeing this, please capture the output with --debug supplied to the status command.

That will give us some idea of which controller we're actually connecting to for the result.

Joseph Phillips (manadart) on 2023-10-26

Changed in juju:
status:	Triaged → Incomplete

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2024-06-11:

Hello @manadart

Could you please check description I added once?

Thanks in advance.

description:	updated
Changed in juju:
status:	Incomplete → New

Seyeong Kim (seyeongkim) on 2024-06-11

description:	updated
tags:	added: sts
description:	updated
description:	updated

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2024-06-14:

#10

problematic controller FullStatus info
https://pastebin.ubuntu.com/p/2PM549bGhY/

non-problematic controller FullStatus info
https://pastebin.ubuntu.com/p/zKmX943VCM/

presence information doesn't seem to be updated.

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2024-06-14:

#11

I guess this is because race condition.

When one controller is restarted, the other controller gets presence request and run presence response.

presence response run UpdateServer() which update current status in Presence.

but as you can see juju status when one controller is restarted, shortly it shows us down or agent lost and getting back to normal.

While doing that, if presense response runs quickly before the status settle down. then it update Presence as down or agent lost.

This is assumption so I would appreciate if someone in juju team can confirm this.

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2024-06-14:

#12

So, restarting another controller again fixes the status.

Revision history for this message

John A Meinel (jameinel) wrote on 2024-06-14:

#13

I would certainly try to upgrade from 2.9.42 to 2.9.49. Some of the relation issues that you are seeing are likely fixed in 2.9.45: https://bugs.launchpad.net/juju/+bug/2011277
or
https://bugs.launchpad.net/bugs/2022855

You can see the high level of fixes here:
https://juju.is/docs/juju/roadmap#juju-juju-29-45

That doesn't directly impact `juju status` reporting whether or not the agent is 'lost', but it potentially addresses other issues that you're seeing.

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2024-06-15:

#14

@jameinel

Thanks for your info, I already tested 2.9.49. Unfortunately I still can see the issue.
reproducer is quite simple but takes time.
(restarting one of controllers again and again)
Thanks a lot.

Revision history for this message

Seyeong Kim (seyeongkim) wrote on 2024-06-19:

#15

3.1.5 has the same issue(3.1.0 doesn't work at deployment stage)
testing later version.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.