Canonical Juju

relation data in unit not updated until _relation_changed event.

Bug #2063087 reported by Javier de la Puente Alonso on 2024-04-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Won't Fix	Undecided	Unassigned

Bug Description

I have tested this situation with Juju 3.1.8 under MicroK8s (it also happens in charmed kubernetes on top of Openstack).

When one unit updates its unit relation data, the units in the other side of the relation do not get the updated data until the _relation_changed event.

This contrasts with the information in the application databag, that is updated automatically, as I can see that the units in the other application can see immediately.

This situation is problematic, as an unit in error state has to be resolved with "no-retry" until the _relation_changed event occurs.

The full situation is as follows:

Following the tutorial https://github.com/canonical/discourse-k8s-operator/blob/main/docs/tutorial.md
(the only difference is `juju deploy discourse-k8s --channel edge`).

After a while, all the units will be active:
```
ubuntu@discourse:~/ischarms/discourse-k8s-operator$ juju status
Model Controller Cloud/Region Version SLA Timestamp
discourse microk8s-localhost microk8s/localhost 3.1.8 unsupported 10:12:39+02:00

App Version Status Scale Charm Channel Rev Address Exposed Message
discourse-k8s 3.2.0 waiting 1 discourse-k8s edge 120 10.152.183.141 no installing agent
postgresql-k8s 14.10 active 1 postgresql-k8s 14/stable 193 10.152.183.49 no
redis-k8s 7.0.4 active 1 redis-k8s latest/edge 27 10.152.183.219 no

Unit Workload Agent Address Ports Message
discourse-k8s/0* active idle 10.1.44.239
postgresql-k8s/0* active idle 10.1.44.197
redis-k8s/0* active idle 10.1.44.221
```

After that I delete both pods for redis-k8s and discourse-k8s:
`kubectl delete pod redis-k8s-0 -n discourse & kubectl delete pod discourse-k8s-0 -n discourse`

Unit address will have changed, and `discourse-k8s/0` will go into error state in `upgrade-charm` because
the unit ip address in the relation is not and will not be updated.

```
ubuntu@discourse:~/ischarms/discourse-k8s-operator$ juju status
Model Controller Cloud/Region Version SLA Timestamp
discourse microk8s-localhost microk8s/localhost 3.1.8 unsupported 10:14:34+02:00

App Version Status Scale Charm Channel Rev Address Exposed Message
discourse-k8s 3.2.0 waiting 1 discourse-k8s edge 120 10.152.183.141 no installing agent
postgresql-k8s 14.10 active 1 postgresql-k8s 14/stable 193 10.152.183.49 no Primary
redis-k8s 7.0.4 active 1 redis-k8s latest/edge 27 10.152.183.219 no

Unit Workload Agent Address Ports Message
discourse-k8s/0* error idle 10.1.44.206 hook failed: "upgrade-charm"
postgresql-k8s/0* active idle 10.1.44.197 Primary
redis-k8s/0* active idle 10.1.44.200
```

```
ubuntu@discourse:~/ischarms/discourse-k8s-operator$ juju exec --unit discourse-k8s/0 "relation-get hostname redis-k8s/0 -r 5"
10.1.44.221
```

The correct value (the hostname field) is however in the relation unit data of redis-k8s/0.
```
ubuntu@discourse:~/ischarms/discourse-k8s-operator$ juju show-unit discourse-k8s/0
discourse-k8s/0:
  workload-version: |
    3.2.0
  opened-ports: []
  charm: ch:amd64/focal/discourse-k8s-120
  leader: true
  life: alive
  relation-info:
  ...
  - relation-id: 5
    endpoint: redis
    related-endpoint: redis
    application-data: {}
    related-units:
      redis-k8s/0:
        in-scope: true
        data:
          egress-subnets: 10.152.183.219/32
          hostname: 10.1.44.200
          ingress-address: 10.152.183.219
          port: "6379"
          private-address: 10.152.183.219
  ...
  provider-id: discourse-k8s-0
  address: 10.1.44.206
```

Resolving the unit state with `juju resolve discourse-k8s/0 --no-retry` will not update the
unit relation data until the event `redis_relation_changed` (so there will many retries going
through the events like `upgrade-charm`, `config-changed`, `start`, `discourse-pebble-readly`).

This contrasts with setting a field in the application databag like:
```
juju exec --unit redis-k8s/0 "relation-set -r5 --app appfield=field2"
```

This will be seen immediately by the the other unit, without even having to resolve the error:
```
ubuntu@discourse:~/ischarms/discourse-k8s-operator$ juju exec --unit discourse-k8s/0 "relation-get - redis-k8s -r5 --app"
appfield: field2
```

Is this behavior intended of the asymmetry between unit and application data?

Shouldn't the unit relation data be updated before the redis_relation_changed event?

See original description

Tags:

Javier de la Puente Alonso (javierdelapuente) on 2024-04-22

description:	updated
description:	updated

Javier de la Puente Alonso (javierdelapuente) on 2024-04-22

description:

updated

Tom Haddon (mthaddon) on 2024-04-23

tags:

added: canonical-is

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-04-25 (last edit on 2024-04-25):

I've gone over the logic for this. It is quite convoluted, but it does represent explicitly the scenario you've described.

We cache relation settings on the agent side, and invalidate/prune the cache selectively based on hook type and arguments.

It happens that application settings are always pruned indiscriminately when a new context is created, which causes the first fetch to go to the controller.

Because we only invalidate *unit* members at the beginning of a relation_* hook, the cache has the last fetched data in all other hooks types and for exec.

I will discuss potential avenues with the team.

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-04-25:

For reference.

Invalidation of cached remote unit settings for relation hook kinds (selective):
https://github.com/juju/juju/blob/f4a2a6605a6907eb40cf36133a5d419f7d4bd0f5/worker/uniter/runner/context/contextfactory.go#L268

Pruning of the cache members to those in the relation (all contexts):
https://github.com/juju/juju/blob/f4a2a6605a6907eb40cf36133a5d419f7d4bd0f5/worker/uniter/runner/context/contextfactory.go#L357

The latter of which always clears application settings in the cache, causing a fetch from the controller upon first access:
https://github.com/juju/juju/blob/f4a2a6605a6907eb40cf36133a5d419f7d4bd0f5/worker/uniter/runner/context/cache.go#L57

Javier de la Puente Alonso (javierdelapuente) on 2024-04-25

description:	updated
description:	updated

Revision history for this message

Tom Haddon (mthaddon) wrote on 2024-04-25:

We just saw this on a staging charmed kubernetes cluster. It happened when a discourse application and redis application were both on the same k8s worker that was on an OpenStack compute host that rebooted. The "fix" for now is to kill the discourse pod, at which point it sees the correct IP in relation data.

Revision history for this message

Joseph Phillips (manadart) wrote on 2024-05-02:

I dug into this further.

Juju's behaviour around hooks and visible data is deliberate.

If we notify via hook emission of changing data, that data should remain the same each time you read it, until we tell you it has changed again.

This is to provide idempotency around hook executions, and to eliminate potential charm behaviour of polling Juju in hook handlers waiting for data that may change.

The links I posted above indicate the behaviour for *application* relation data is incorrect and *not* vice-versa.

We do not intend to undertake a fix for the unit relation data that you have observed.

Changed in juju:
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.