relation-joined hook not firing between created and changed

Bug #1903784 reported by Stuart Bishop
4
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

With Juju 2.9rc2 we are seeing a problem where the db-relation-joined hook is not being run. Logs show these hooks running: install, db-relation-created, leader-settings-changed, db-relation-changed, config-changed, then db-relation-changed again, start, and db-relation-changed for the third time (which fails, due to required initialization from the db-relation-joined hook not being run). The charm is a k8s Operator Framework charm, using ./dispatch for hooks.

2020-11-10 17:21:24 INFO juju.worker.caasoperator runner.go:527 start "discourse/1"
2020-11-10 17:21:24 INFO juju.worker.leadership tracker.go:217 discourse leadership for discourse/1 denied
2020-11-10 17:21:24 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-discourse-1
2020-11-10 17:21:24 INFO juju.worker.uniter uniter.go:305 unit "discourse/1" started
2020-11-10 17:21:24 INFO juju.worker.uniter uniter.go:592 resuming charm install
2020-11-10 17:21:24 INFO juju.worker.uniter.charm bundles.go:79 downloading cs:~discourse-charmers/discourse-k8s-6 from API server
2020-11-10 17:21:24 INFO juju.downloader download.go:111 downloading from cs:~discourse-charmers/discourse-k8s-6
2020-11-10 17:21:24 INFO juju.downloader download.go:94 download complete ("cs:~discourse-charmers/discourse-k8s-6")
2020-11-10 17:21:24 INFO juju.downloader download.go:174 download verified ("cs:~discourse-charmers/discourse-k8s-6")
2020-11-10 17:21:25 INFO juju.worker.uniter uniter.go:323 hooks are retried true
2020-11-10 17:21:25 INFO juju.worker.uniter resolver.go:147 found queued "install" hook
2020-11-10 17:21:25 INFO juju-log Running legacy hooks/install.
2020-11-10 17:21:28 INFO juju.worker.uniter.operation runhook.go:142 ran "install" hook (via hook dispatching script: dispatch)
2020-11-10 17:21:28 INFO juju.worker.uniter.relation statetracker.go:157 joining relation "discourse:db postgresql:db-admin"
2020-11-10 17:21:28 INFO juju.worker.uniter.relation statetracker.go:193 joined relation "discourse:db postgresql:db-admin"
2020-11-10 17:21:29 INFO juju.worker.uniter.operation runhook.go:142 ran "db-relation-created" hook (via hook dispatching script: dispatch)
2020-11-10 17:21:32 INFO juju.worker.uniter.operation runhook.go:142 ran "leader-settings-changed" hook (via hook dispatching script: dispatch)
2020-11-10 17:21:32 INFO juju.worker.caasoperator initializer.go:124 started pod init on "discourse/1"
2020-11-10 17:21:36 INFO juju.worker.uniter.operation runhook.go:142 ran "db-relation-changed" hook (via hook dispatching script: dispatch)
2020-11-10 17:21:39 INFO juju.worker.uniter.operation runhook.go:142 ran "config-changed" hook (via hook dispatching script: dispatch)
2020-11-10 17:21:39 INFO juju.worker.uniter resolver.go:147 found queued "start" hook
2020-11-10 17:21:41 INFO juju.worker.uniter.operation runhook.go:142 ran "db-relation-changed" hook (via hook dispatching script: dispatch)
2020-11-10 17:21:41 INFO juju-log Running legacy hooks/start.
2020-11-10 17:21:43 INFO juju.worker.uniter.operation runhook.go:142 ran "start" hook (via hook dispatching script: dispatch)
2020-11-10 17:21:44 ERROR juju-log db:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 273, in <module>
    main(DiscourseCharm)
  File "/var/lib/juju/agents/unit-discourse-1/charm/venv/ops/main.py", line 401, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-discourse-1/charm/venv/ops/main.py", line 140, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-discourse-1/charm/venv/ops/framework.py", line 234, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-discourse-1/charm/venv/ops/framework.py", line 678, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-discourse-1/charm/venv/ops/framework.py", line 723, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-discourse-1/charm/venv/pgsql/opslib/pgsql/client.py", line 431, in _on_changed
    self._state.rels[relid]["master"] = new_master
  File "/var/lib/juju/agents/unit-discourse-1/charm/venv/ops/framework.py", line 979, in __getitem__
    return _wrap_stored(self._stored_data, self._under[key])
KeyError: 1
2020-11-10 17:21:45 ERROR juju.worker.uniter.operation runhook.go:136 hook "db-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
2020-11-10 17:21:45 INFO juju.worker.uniter resolver.go:143 awaiting error resolution for "relation-changed" hook
2020-11-10 17:21:50 INFO juju.worker.uniter resolver.go:143 awaiting error resolution for "relation-changed" hook

Revision history for this message
Ian Booth (wallyworld) wrote :

Do we think this is a regression in 2.9?
Does it work in 2.8 or 2.9 beta etc?

Revision history for this message
Tom Haddon (mthaddon) wrote :

Note that if we're testing on 2.8 we'll need to use an older version of cs:~postgresql-charmers/postgresql-k8s because of https://code.launchpad.net/~stub/charm-k8s-postgresql/+git/charm-k8s-postgresql/+merge/393311.

In any case, testing on 2.8 shouldn't block work on this bug, I think, as presumably the relation-joined hook should firing between created and changed?

Revision history for this message
Stuart Bishop (stub) wrote :

Debug level Logs from 2.9rc3. Shows an attempted deployment of cs:discourse-k8s and cs:~postgresql-charmers/postgresql-k8s. We see both units running their relation-created hooks, discourse/0 runs relation-joined and relation-created fine. discourse/1 skips relation-joined and thus relation-created fails.

Revision history for this message
Stuart Bishop (stub) wrote :

The second unit might be the key. I had a successful single unit discourse deployment. I added a second unit and it failed. db-relation-created ran, db-relation-joined skipped, and db-relation-changed failed.

Revision history for this message
Stuart Bishop (stub) wrote :

Further investigation shows that the db-relation-joined hook will eventually run, so the problem is that is it being run out of order rather than not being run at all.

Revision history for this message
John A Meinel (jameinel) wrote :

I would expect you to get db-relation-changed for any Application data if you have been setting it, before you get db-relation-changed for the Unit.
Can you confirm what event.unit event.app is for the events that are firing?

IIRC this is the same behavior in 2.8 and 2.9, so isn't a regression in 2.9.

Revision history for this message
John A Meinel (jameinel) wrote :

IIRC bug 1866828 was because you can see app data before relation-joined, but you can know about application data even if you haven't seen units yet.

Changed in juju:
status: New → Incomplete
importance: Undecided → High
milestone: none → 2.9-rc3
Revision history for this message
John A Meinel (jameinel) wrote :

If the ordering is confusing, we can change it, but we have to be careful because charms may be expecting the ordering.

There is a nice potential side effect if you send the application data last, in that charms could know that they've seen all of the "currently running units" by the time application data changed has fired. (as always, that number might be changing, there might be a unit that is actively starting but hasn't finished installing yet, etc.)

Revision history for this message
Stuart Bishop (stub) wrote :

Yes, the PostgreSQL server is setting app data on the relation. So the first client unit gets relation-created, relation-joined, then relation-changed, because it is setting up the relation before the server has had a chance to set application relation data. The second client unit gets relation-created, then one or more relation-changed due to the app data now existing, before relation-joined.

If this is works as intended, I guess we can close this. Charms will need to cope with the two possible initialization paths (probably by not bothering, and just having one handler for all relation events rather than risk dependencies on event ordering).

It is certainly confusing that relation-changed events can be received for a relation that has not yet been joined. It makes the life-cycle harder to explain, and will certainly need to be documented so that charmers don't make the same mistake we did, assuming that initialization hooks will happen in a consistent order. It also makes relation-joined completely pointless, as there is no point using a hook that *might* be called before relation-changed during relation initialization, so maybe it just gets deprecated and only relation-created, -changed, -departed & -broken get documented.

I don't think any charms are expecting this ordering though, since it is a new feature with 2.7. And I can't think of any useful way of depending on the existing ordering, since it entirely depends on at what point during setup the remote end gets around to initializing the application relation data.

The current ordering makes it tricky to switch existing relation protocols to using application relation settings, which is the case here. The PostgreSQL service is setting both application relation settings, and the per unit relation settings for backwards compatibility with older charms. However, the very act of daring to set application relation data means that the charms start triggering hooks in a different order, and client charms may well not cope.

Revision history for this message
Stuart Bishop (stub) wrote :

(what happens if a unit attempts to set relation data in its relation-changed hook on a relation it has not yet joined? Or does this work, because technically it has joined even though the hook hasn't had a chance to run yet?)

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1903784] Re: relation-joined hook not firing between created and changed

So relation-joined isn't *I* have joined the relation. relation-joined is
"another unit is now visible to me". You just can't see any other units
until you have also joined the relation. If you have postgres and
discourse, when discourse/1 starts, you'll get another relation-joined on
postgresql/0 to say that discourse/1 is now visible to you.
So you would already be getting: (from postgresql/0)

relation-created
relation-joined discourse/0
relation-changed discourse/0
relation-changed discourse # if discourse sets app data
relation-joined discourse/1
relation-changed discourse/1

So you can always get a later relation-joined.

The intent is that relation-created tells you that you can do
initialization for the app that you are related to, and it will happen
before a relation-changed for that app. While relation-joined happens for a
unit of that application and happens before relation-changed for that unit.

The actual code change isn't very hard. It is essentially just moving the
block that checks if there is a relation-changed event for application data
after we check for relation events for the units. I'll add it to the bug if
someone wants to try it out.

On Wed, Nov 11, 2020 at 10:10 PM Stuart Bishop <email address hidden>
wrote:

> (what happens if a unit attempts to set relation data in its relation-
> changed hook on a relation it has not yet joined? Or does this work,
> because technically it has joined even though the hook hasn't had a
> chance to run yet?)
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1903784
>
> Title:
> relation-joined hook not firing between created and changed
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1903784/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :
Changed in juju:
milestone: 2.9-rc3 → 2.9-rc4
Changed in juju:
milestone: 2.9-rc4 → 2.9-rc5
Changed in juju:
milestone: 2.9-rc5 → 2.9-rc6
John A Meinel (jameinel)
Changed in juju:
milestone: 2.9-rc6 → 3.0.0
Changed in juju:
milestone: 3.0.0 → 3.0.1
Changed in juju:
milestone: 3.0.1 → 3.0.2
Changed in juju:
milestone: 3.0.2 → 3.0.3
Changed in juju:
milestone: 3.0.3 → 3.0.4
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.