lead k8s unit fails to call pod-set-spec because it is not the leader

Bug #1903810 reported by Stuart Bishop
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
High
Joseph Phillips

Bug Description

With Juju2.9-rc3-829b0c7, a fresh deployment of single units of cs:~postgresql-charmers/postgresql and cs:~discourse-charmers/discourse triggered the following hook failure when setting up the relation. The sole discourse unit confirmed it was the leader, called pod-set-spec, which failed with an error message that it is not the leader.

application-postgresql: 06:26:30 INFO juju.worker.uniter.relation joining relation "discourse:db postgresql:db-admin"
application-postgresql: 06:26:30 INFO juju.worker.uniter.relation joined relation "discourse:db postgresql:db-admin"
application-discourse: 06:26:42 DEBUG jujuc running hook tool "is-leader" for discourse/0-db-relation-changed-3008356497487723991
application-discourse: 06:26:43 DEBUG jujuc running hook tool "juju-log" for discourse/0-db-relation-changed-3008356497487723991
application-discourse: 06:26:43 ERROR unit.discourse/0.juju-log db:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 273, in <module>
    main(DiscourseCharm)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/main.py", line 401, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/main.py", line 140, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 234, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 678, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 723, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/pgsql/opslib/pgsql/client.py", line 396, in _on_changed
    self.on.master_changed.emit(**kwargs)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 234, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 678, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/framework.py", line 723, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 269, in on_database_changed
    self.configure_pod()
  File "./src/charm.py", line 223, in configure_pod
    self.model.pod.set_spec(pod_spec)
  File "/var/lib/juju/agents/unit-discourse-0/charm/venv/ops/model.py", line 925, in set_spec
    raise ModelError('cannot set a pod spec as this unit is not a leader')
ops.model.ModelError: cannot set a pod spec as this unit is not a leader
application-discourse: 06:26:43 ERROR juju.worker.uniter.operation hook "db-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
application-discourse: 06:26:43 DEBUG juju.machinelock machine lock released for discourse/0 uniter (run relation-changed (1; app: postgresql) hook)
application-discourse: 06:26:43 DEBUG juju.worker.uniter.operation lock released for discourse/0
application-discourse: 06:26:43 INFO juju.worker.uniter awaiting error resolution for "relation-changed" hook
application-discourse: 06:26:43 DEBUG juju.worker.uniter [AGENT-STATUS] error: hook failed: "db-relation-changed"
controller-0: 06:27:55 DEBUG juju.worker.caasapplicationprovisioner killing runner 0xc002b10ea0
controller-0: 06:27:55 INFO juju.worker.caasapplicationprovisioner runner is dying
controller-0: 06:27:55 INFO juju.worker.logger logger worker stopped
application-discourse: 06:29:10 ERROR juju.worker.uniter resolver loop error: cannot set status: read tcp 10.32.225.13:33912->10.32.225.13:37017: i/o timeout
application-discourse: 06:29:10 DEBUG juju.worker.uniter [AGENT-STATUS] failed: resolver loop error
application-postgresql: 06:29:10 INFO juju.worker.uniter unit "postgresql/0" shutting down: failed to initialize uniter for "unit-postgresql-0": cannot create relation state tracker: cannot set status: read tcp 127.0.0.1:60038->127.0.0.1:37017: i/o timeout
application-postgresql: 06:29:10 INFO juju.worker.caasoperator stopped "postgresql/0", err: leadership failure: lease operation timed out
application-postgresql: 06:29:10 DEBUG juju.worker.caasoperator "postgresql/0" done: leadership failure: lease operation timed out
application-postgresql: 06:29:10 ERROR juju.worker.caasoperator exited "postgresql/0": leadership failure: lease operation timed out

Stuart Bishop (stub)
description: updated
Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
importance: Undecided → High
status: New → Triaged
Revision history for this message
Joseph Phillips (manadart) wrote :

The is-leader hook tool returns true not just if the unit is currently the leader, but if it is guaranteed to be the leader for at least the requested lease duration (30 seconds).

In practice, we grant a lease for twice this duration, and the unit will extend the lease each time it thinks it reaches the duration requested - so renewal for another minute, every 30 seconds.

So is-leader should in-theory return true if we are currently the leader. However, the logic mitigates a race where the leader has reached its 30 second duration and is *right now* negotiating a lease extension.

If there is any flux affecting controllers, the lease operation may time out. This includes:
- controller-to-controller communication issues.
- establishment of HA.
- Raft leader election.

I would anticipate that the caasoperator worker is restarted and we quiesce subsequently. Is this not the case? The unit remains in an error/down state?

Revision history for this message
John A Meinel (jameinel) wrote :

It sounds like we are having trouble managing the lease information:
pplication-postgresql: 06:29:10 INFO juju.worker.caasoperator stopped "postgresql/0", err: leadership failure: lease operation timed out

Revision history for this message
John A Meinel (jameinel) wrote :

From Joe, I think I understand that when we are checking if you are the leader, we also check if we are currently renewing the lease, we wait to see if the renew is successful. So if the raft engine dies at the moment we are checking if you are the leader, we see the "failure to extend", and that is translated into "you aren't the leader". Which from a sense if we can't ask the lease engine if you *are* the leader, then we can't guarantee that you are, which could lead to multiple leaders.

We need to understand why the lease engine is failing in the first place.

Revision history for this message
Joseph Phillips (manadart) wrote :

Given the pod-spec charm deprecation and the replacement of the lease back-end in 3.2, we no do not have plans to work on this.

Changed in juju:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.