Deleted k8s pods remain stuck in hook failed: "db-relation-broken"

Bug #1961074 reported by Loïc Gomez
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Unassigned

Bug Description

On a kubernetes environment, running ch:discourse-k8s charm with cross model relations, juju seems to have trouble clearing a unit when its pod gets deleted/replaced (like after a juju config or juju refresh).

Logs from discourse-operator-0: https://pastebin.canonical.com/p/hkzwT8bBZh/ (sorry, Canonical link). Relevant lines:
2022-02-14 08:23:47 ERROR juju.worker.caasoperator caasoperator.go:616 could not get pod "unit-discourse-11" "facea0c9-880f-4cf9-937e-d82ae3f2e477" pod "facea0c9-880f-4cf9-937e-d82ae3f2e477" not found
2022-02-14 08:24:43 INFO juju.worker.caasoperator.uniter.discourse/11 resolver.go:150 awaiting error resolution for "relation-broken" hook
2022-02-14 08:24:51 ERROR juju-log db:0: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/model.py", line 1521, in _run
    result = run(args, **kwargs)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-discourse-11/relation-get', '-r', '0', '-', '', '--app', '--format=json')' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./src/charm.py", line 421, in <module>
    main(DiscourseCharm)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/main.py", line 426, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/main.py", line 142, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/framework.py", line 276, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/framework.py", line 736, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/framework.py", line 783, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/pgsql/opslib/pgsql/client.py", line 479, in _on_broken
    self.on.master_changed.emit(**kwargs)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/framework.py", line 276, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/framework.py", line 736, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/framework.py", line 783, in _reemit
    custom_handler(event)
  File "./src/charm.py", line 402, in on_database_changed
    if event.master is None:
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/pgsql/opslib/pgsql/client.py", line 65, in master
    conn_str = _master(self.log, self.relation, self._local_unit)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/pgsql/opslib/pgsql/client.py", line 600, in _master
    conn_str = reldata.get("master")
  File "/usr/lib/python3.8/_collections_abc.py", line 660, in get
    return self[key]
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/model.py", line 430, in __getitem__
    return self._data[key]
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/model.py", line 414, in _data
    data = self._lazy_data = self._load()
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/model.py", line 779, in _load
    return self._backend.relation_get(self.relation.id, self._entity.name, self._is_app)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/model.py", line 1588, in relation_get
    return self._run(*args, return_output=True, use_json=True)
  File "/var/lib/juju/agents/unit-discourse-11/charm/venv/ops/model.py", line 1523, in _run
    raise ModelError(e.stderr)
ops.model.ModelError: b'ERROR "" is not a valid unit or application\n'
2022-02-14 08:24:52 ERROR juju.worker.caasoperator.uniter.discourse/11.operation runhook.go:146 hook "db-relation-broken" (via hook dispatching script: dispatch) failed: exit status 1

Unit can be manually cleared with: juju resolve --no-retry discourse/11

We saw lingering units like these on various juju controllers, versions 2.9.21, 2.9.18. Afaik it also happened on version 2.8.9, but I can't find one at this moment.

This bug might be related to https://bugs.launchpad.net/juju/+bug/1950705, fixing PR https://github.com/juju/juju/pull/13511 released in 2.9.25 but I'm not 100% sure, hence opening this. Feel free to close this bug if you find it's a duplicate/already resolved.

Thank you,
Loïc

Tags: canonical-is
Revision history for this message
Ian Booth (wallyworld) wrote :

Can you include the output of juju status --format yaml to show what units juju thinks should exist in the model?
You are right there have been issues in the past where a pod upgrade confused juju. There have been some fixes but we may need to revisit it.

Revision history for this message
Romain Couturat (romaincout) wrote :
Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.9.28
importance: Undecided → High
status: New → Triaged
Changed in juju:
milestone: 2.9.28 → 2.9.29
Changed in juju:
milestone: 2.9.29 → 2.9.30
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

You mentioned this issue was happening on `2.9.18` and even `2.8.9` which are versions before https://github.com/juju/juju/pull/13511. So I don't think it's related to that fix.

It might be a bug on operator framework,
the cmd was called ->
`'/var/lib/juju/tools/unit-discourse-11/relation-get', '-r', '0', '-', '', '--app', '--format=json'`
`return self._backend.relation_get(self.relation.id, self._entity.name, self._is_app)`
You can see the self._entity.name was an empty string.

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote (last edit ):

Hi Romain,
I wasn't able to reproduce this bug.
The steps I tried:

```
juju deploy postgresql-k8s
juju deploy redis-k8s
juju deploy discourse-k8s
juju relate discourse-k8s postgresql-k8s:db-admin
juju relate discourse-k8s redis-k8s

juju scale-application discourse-k8s 6

mkubectl -nt1 delete pods -l app.kubernetes.io/name=discourse-k8s
mkubectl -nt1 delete pods -l app.kubernetes.io/name=discourse-k8s
mkubectl -nt1 delete pods -l app.kubernetes.io/name=discourse-k8s

```
Are you still experiencing this issue?

John A Meinel (jameinel)
Changed in juju:
status: In Progress → Incomplete
milestone: 2.9.30 → none
Changed in juju:
assignee: Yang Kelvin Liu (kelvin.liu) → nobody
Revision history for this message
Loïc Gomez (kotodama) wrote :

Hi,

Yes we're still having this issue on a 2.8.9 controller:
Unit Workload Agent Address Ports Message
discourse/6 error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/10 error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/11* error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/12 error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/13 error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/14 error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/15 error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/16 error idle xxxxxxxxxxxx 3000/TCP hook failed: "db-relation-broken"
discourse/17 active idle xxxxxxxxxxxx 3000/TCP
discourse/18 active idle xxxxxxxxxxxx 3000/TCP

Same on 2.9.29 controller.

Changed in juju:
status: Incomplete → Triaged
tags: added: canonical-is
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.