HA Juju controllers showing inconsistent status
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
New
|
Undecided
|
Joseph Phillips |
Bug Description
Hi,
We've experienced an issue with the consistency and stability of our Juju controllers, and are struggling to pinpoint what's actually happening.
We're operating a HA controller set, running Juju 2.9.42, deployed in an Openstack cloud.
Symptoms we've observed have been:
* Issues with the stability of relationship hooks in deployed models (we have observed issues with relationships being created, updated, and departed)
* Controllers returning inconsistent "juju status" results
When running "juju status --debug" to make sure we get one result from each controller, we have observed that at least one controller will consistently return a different result than the other(s).
For example, this paste shows both secondary controllers reporting the primary controller as "agent-lost", while the primary disagrees: https:/
Controller logs from the period in question have been made available via secure portal https:/
Model logs for the specific model in which we observed relationship hook issues are located in "special-request" under that directory.
Please advise if there are any additional logs we should supply, any metrics we can gather from the time, or anything else.
Thanks!
## ADDED ###
A customer(using 2.9.49, 2,9,42) faced the same issue. and I also see this intermittently(not easy to reproduce though) when one of controllers are restarted.
I analyzed a bit of juju code below.
https:/
it seems that Presence(
It seems that when one of controllers are rebooted, that value has changed or remained down.
I'm not sure about 3.x but is there related patch for 3.x? and can we backport it if there is?
I keep analyzing code but I would appreciate if juju team can give me any advice.
Thanks a lot.
tags: | added: canonical-is |
Changed in juju: | |
status: | New → Incomplete |
Changed in juju: | |
status: | Incomplete → New |
Changed in juju: | |
status: | New → Triaged |
assignee: | nobody → Joseph Phillips (manadart) |
Changed in juju: | |
status: | Triaged → Incomplete |
description: | updated |
tags: | added: sts |
description: | updated |
description: | updated |
The controllers share agent connectivity info (aka presence) using pubsub. I don't think there's an explicit delivery guarantee for such messages.
Logging to turn on would be
juju.worker. pubsub= TRACE presence= TRACE
juju.worker.
Logs could also contain messages matching the format string
"%p programming error, e.ch=%v did not accept %v - missing Unwatch?\nwatch source:\n%s"
Extra relation debug can be obtained by setting
juju.worker. uniter. relation= TRACE