Hi,
In a K8s model, the elected leader can sometimes be either a unit that's in error state (with the pod no longer around) or terminating, see ~Juju 2022-02-10). Juju should not allow non-"active" units to be elected as the leader:
|
| Model Controller Cloud/Region Version SLA Timestamp
| prod-charmhub-mattermost prodstack-is-2 k8s-is-external/default 2.9.18 unsupported 03:13:27Z
|
| SAAS Status Store URL
| postgresql active prodstack-is-2 admin/prod-charmhub-mattermost-db.postgresql
|
| App Version Status Scale Charm Store Channel Rev OS Address Message
| mattermost mattermost:5.39.0-canonical active 8/5 mattermost local stable 4 kubernetes 10.85.0.39
|
| Unit Workload Agent Address Ports Message
| mattermost/35 error idle 10.86.56.182 8065/TCP hook failed: "db-relation-changed"
| mattermost/36* terminated failed 10.86.77.171 8065/TCP unit stopped by the cloud
| mattermost/37 terminated failed 10.86.56.205 8065/TCP unit stopped by the cloud
| mattermost/39 error idle 10.86.77.232 8065/TCP hook failed: "db-relation-created"
| mattermost/40 error idle 10.86.56.34 8065/TCP hook failed: "leader-settings-changed"
| mattermost/49 active idle 10.86.56.84 8065/TCP
| mattermost/50 active idle 10.86.56.85 8065/TCP
| mattermost/51 active idle 10.86.77.10 8065/TCP
| mattermost/52 active idle 10.86.77.11 8065/TCP
| mattermost/53 active idle 10.86.56.86 8065/TCP
| Model Controller Cloud/Region Version SLA Timestamp
| prod-charmhub-mattermost prodstack-is-2 k8s-is-external/default 2.9.18 unsupported 03:13:02Z
|
| SAAS Status Store URL
| postgresql active prodstack-is-2 admin/prod-charmhub-mattermost-db.postgresql
|
| App Version Status Scale Charm Store Channel Rev OS Address Message
| mattermost mattermost:5.39.0-canonical active 8/5 mattermost local stable 4 kubernetes 10.85.0.39
|
| Unit Workload Agent Address Ports Message
| mattermost/35* error idle 10.86.56.182 8065/TCP hook failed: "db-relation-changed"
| mattermost/36 terminated failed 10.86.77.171 8065/TCP unit stopped by the cloud
| mattermost/37 terminated failed 10.86.56.205 8065/TCP unit stopped by the cloud
| mattermost/39 error idle 10.86.77.232 8065/TCP hook failed: "db-relation-created"
| mattermost/40 error idle 10.86.56.34 8065/TCP hook failed: "leader-settings-changed"
| mattermost/49 active idle 10.86.56.84 8065/TCP
| mattermost/50 active idle 10.86.56.85 8065/TCP
| mattermost/51 active idle 10.86.77.10 8065/TCP
| mattermost/52 active idle 10.86.77.11 8065/TCP
| mattermost/53 active idle 10.86.56.86 8065/TCP
Had to use juju_revoke_lease multiple times to eventually get an "active" unit to be elected and for things to be functional.
This usually happens when the application has so many units and the operator is just too busy and the uniters running on the operator could not handle those events and state changes quick enough (or even could be stuck temporarily). Those new units should be able to be stabilized and dead units should be removed later in some time. Because Juju doesn't define the resource limit/request for the operator pod. Ideally, the pod will consume resources as much as it can from the k8s worker node. So we don't know what's the exact max number of units that an operator can manage.