With Juju2.9-rc3-829b0c7, a fresh deployment of single units of cs:~postgresql-charmers/postgresql and cs:~discourse-charmers/discourse triggered the following hook failure when setting up the relation. The sole discourse unit confirmed it was the leader, called pod-set-spec, which failed with an error message that it is not the leader.
application-postgresql: 06:26:30 INFO juju.worker.uniter.relation joining relation "discourse:db postgresql:db-admin"
application-postgresql: 06:26:30 INFO juju.worker.uniter.relation joined relation "discourse:db postgresql:db-admin"
application-discourse: 06:26:42 DEBUG jujuc running hook tool "is-leader" for discourse/0-db-relation-changed-3008356497487723991
application-discourse: 06:26:43 DEBUG jujuc running hook tool "juju-log" for discourse/0-db-relation-changed-3008356497487723991
application-discourse: 06:26:43 ERROR unit.discourse/0.juju-log db:1: Uncaught exception while in charm code:
Traceback (most recent call last):
File "./src/charm.py", line 273, in <module>
main(DiscourseCharm)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/main.py", line 401, in main
_emit_charm_event(charm, dispatcher.event_name)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/main.py", line 140, in _emit_charm_event
event_to_emit.emit(*args, **kwargs)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 234, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 678, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 723, in _reemit
custom_handler(event)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/pgsql/opslib/pgsql/client.py", line 396, in _on_changed
self.on.master_changed.emit(**kwargs)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 234, in emit
framework._emit(event)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 678, in _emit
self._reemit(event_path)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 723, in _reemit
custom_handler(event)
File "./src/charm.py", line 269, in on_database_changed
self.configure_pod()
File "./src/charm.py", line 223, in configure_pod
self.model.pod.set_spec(pod_spec)
File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/model.py", line 925, in set_spec
raise ModelError('cannot set a pod spec as this unit is not a leader')
ops.model.ModelError: cannot set a pod spec as this unit is not a leader
application-discourse: 06:26:43 ERROR juju.worker.uniter.operation hook "db-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
application-discourse: 06:26:43 DEBUG juju.machinelock machine lock released for discourse/0 uniter (run relation-changed (1; app: postgresql) hook)
application-discourse: 06:26:43 DEBUG juju.worker.uniter.operation lock released for discourse/0
application-discourse: 06:26:43 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
application-discourse: 06:26:43 DEBUG juju.worker.uniter [AGENT-STATUS] error: hook failed: "db-relation-changed"
controller-0: 06:27:55 DEBUG juju.worker.caasapplicationprovisioner killing runner 0xc002b10ea0
controller-0: 06:27:55 INFO juju.worker.caasapplicationprovisioner runner is dying
controller-0: 06:27:55 INFO juju.worker.logger logger worker stopped
application-discourse: 06:29:10 ERROR juju.worker.uniter resolver loop error: cannot set status: read tcp 10.32.225.13:33912->10.32.225.13:37017: i/o timeout
application-discourse: 06:29:10 DEBUG juju.worker.uniter [AGENT-STATUS] failed: resolver loop error
application-postgresql: 06:29:10 INFO juju.worker.uniter unit "postgresql/0" shutting down: failed to initialize uniter for "unit-postgresql-0": cannot create relation state tracker: cannot set status: read tcp 127.0.0.1:60038->127.0.0.1:37017: i/o timeout
application-postgresql: 06:29:10 INFO juju.worker.caasoperator stopped "postgresql/0", err: leadership failure: lease operation timed out
application-postgresql: 06:29:10 DEBUG juju.worker.caasoperator "postgresql/0" done: leadership failure: lease operation timed out
application-postgresql: 06:29:10 ERROR juju.worker.caasoperator exited "postgresql/0": leadership failure: lease operation timed out
The is-leader hook tool returns true not just if the unit is currently the leader, but if it is guaranteed to be the leader for at least the requested lease duration (30 seconds).
In practice, we grant a lease for twice this duration, and the unit will extend the lease each time it thinks it reaches the duration requested - so renewal for another minute, every 30 seconds.
So is-leader should in-theory return true if we are currently the leader. However, the logic mitigates a race where the leader has reached its 30 second duration and is *right now* negotiating a lease extension.
If there is any flux affecting controllers, the lease operation may time out. This includes: to-controller communication issues.
- controller-
- establishment of HA.
- Raft leader election.
I would anticipate that the caasoperator worker is restarted and we quiesce subsequently. Is this not the case? The unit remains in an error/down state?