Potential race condition leads to broken application state

Bug #1914415 reported by Jonathan Hall on 2021-02-03
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-k8s-postgresql
Medium
Unassigned
juju
Undecided
Unassigned

Bug Description

This problem is inconsistent, so it seems like a likely race condition of some sort.

On a freshly created GKE cluster, if I run the following script, I frequently end up with my application in a broken state:

    #!/bin/bash
    gcloud container clusters get-credentials cluster-1 --project canonical-jonathan
    juju add-k8s gke
    juju bootstrap gke
    juju add-model gke
    juju deploy cs:~postgresql-charmers/postgresql-k8s postgresql
    juju deploy cs:~mattermost-charmers/mattermost
    juju add-relation mattermost postgresql:db

After running the above script and waiting a few minutes for the services to start, the `postgresql` application eventually gets stuck. `juju status` shows:

    $ juju status
    Model Controller Cloud/Region Version SLA Timestamp
    gke gke-us-central1 gke/us-central1 2.9-rc6 unsupported 13:05:36+01:00

    App Version Status Scale Charm Store Rev OS Address Message
    mattermost waiting 1 mattermost charmstore 17 kubernetes Waiting for database relation
    postgresql pgcharm:edge active 1 postgresql-k8s charmstore 8 kubernetes 10.8.1.106

    Unit Workload Agent Address Ports Message
    mattermost/0* waiting idle Waiting for database relation
    postgresql/0* error idle 10.4.1.8 5432/TCP hook failed: "db-relation-changed"

And `juju debug-log` shows:

    application-postgresql: 13:03:21 INFO juju.worker.caasoperator.uniter.postgresql/0 awaiting error resolution for "relation-changed" hook
    application-postgresql: 13:04:45 INFO juju.worker.caasoperator.uniter.postgresql/0 awaiting error resolution for "relation-changed" hook
    application-postgresql: 13:04:46 ERROR unit.postgresql/0.juju-log db:1: Uncaught exception while in charm code:
    Traceback (most recent call last):
    File "./src/charm.py", line 237, in <module>
        ops.main.main(PostgreSQLCharm, use_juju_for_storage=True)
    File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/main.py", line 402, in main
        _emit_charm_event(charm, dispatcher.event_name)
    File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/main.py", line 140, in _emit_charm_event
        event_to_emit.emit(*args, **kwargs)
    File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/framework.py", line 278, in emit
        framework._emit(event)
    File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/framework.py", line 722, in _emit
        self._reemit(event_path)
    File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/framework.py", line 767, in _reemit
        custom_handler(event)
    File "/var/lib/juju/agents/unit-postgresql-0/charm/src/clientrel.py", line 115, in on_db_relation_changed
        req = _ClientRequests(event)
    File "/var/lib/juju/agents/unit-postgresql-0/charm/src/clientrel.py", line 234, in __init__
        buckets = [event.relation.data[event.app]]
    File "/var/lib/juju/agents/unit-postgresql-0/charm/venv/ops/model.py", line 699, in __getitem__
        return self._data[key]
    KeyError: <ops.model.Application mattermost>
    application-postgresql: 13:04:46 ERROR juju.worker.caasoperator.uniter.postgresql/0.operation hook "db-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1
    application-postgresql: 13:04:46 INFO juju.worker.caasoperator.uniter.postgresql/0 awaiting error resolution for "relation-changed" hook

In contrast, if I add a delay between some of the steps, as in this script:

    #!/bin/bash
    gcloud container clusters get-credentials cluster-1 --project canonical-jonathan
    juju add-k8s gke
    juju bootstrap gke
    juju add-model gke
    juju deploy cs:~postgresql-charmers/postgresql-k8s postgresql
    sleep 60
    juju deploy cs:~mattermost-charmers/mattermost
    sleep 60
    juju add-relation mattermost postgresql:db

then the installation eventually succeeds:

    $ juju status
    Model Controller Cloud/Region Version SLA Timestamp
    gke gke-us-central1 gke/us-central1 2.9-rc6 unsupported 14:24:37+01:00

    App Version Status Scale Charm Store Rev OS Address Message
    mattermost mattermost:5.31.0 active 1 mattermost charmstore 17 kubernetes 10.8.8.90
    postgresql pgcharm:edge active 1 postgresql-k8s charmstore 8 kubernetes 10.8.6.235

    Unit Workload Agent Address Ports Message
    mattermost/0* active idle 10.4.0.8 8065/TCP
    postgresql/0* active idle 10.4.0.7 5432/TCP Pod configured

Additional info:

    $ juju version
    2.9-rc6-genericlinux-amd64

John A Meinel (jameinel) wrote :

This looks like it might be an issue with the postgresql charm. Possibly a missed expectation on ordering of events (it seems like it thinks there should be a dict already holding information during relation-changed.)
I'm not sure what that expectation is, nor why waiting would make it more reliable.
Maybe the postgres operator is coming up and not able to see its own workload by the time the relation comes in. It may be something that needs to be deferred, or something that is waiting for some other piece of data to progress.

Tom Haddon (mthaddon) wrote :

I've retargeted this to charm-k8s-postgresql, and we'll take a look and update this bug.

affects: postgresql-charm → charm-k8s-postgresql
Stuart Bishop (stub) wrote :

The exception shows the charm is in a relation-changed hook, and the db-relation-changed event has been fired. The remote application is known to be mattermost (event.app), but attempting to check mattermost's application relation data fails (ie. relation-get is returning a failure). Which smells like a Juju race condition to me (or possibly Operator Framework). event.relation.data[event.app] should never raise a KeyError unless event.app is None (it isn't None here, and I can't think when event.app might be None either).

John A Meinel (jameinel) wrote :

I think this is actually https://github.com/canonical/operator/issues/175

Which is that the operator framework determines what event.app is by iterating relation.units, so even though Juju is setting JUJU_REMOTE_APP, that isn't used, and thus event.app *is* None in this situation.

John A Meinel (jameinel) wrote :

To try and distill issue 175, I believe the issue is that Relation.app is set by:
https://github.com/canonical/operator/blob/master/ops/model.py#L642

Which iterates the known units, and then pulls the application from those units.

However, in events like relation-created and relation-changed, there may be data associated with the application, and not with any units. Thus Relation.units has an empty list.
Juju itself is setting JUJU_REMOTE_APP as an environment variable during those events.

Which should be populated on the event object:
https://github.com/canonical/operator/blob/master/ops/main.py#L154

So I might be wrong. I was thinking about Relation.app (which has a known bug) not RelationEvent.app (which should be properly populated).

Unless it is:
https://github.com/canonical/operator/blob/master/ops/model.py#L684

which seems to only populate RelationData[app] if *Relation.app* is set.

Ben Hoyt (benhoyt) wrote :

Yeah, that analysis is correct -- this is definitely the same issue as https://github.com/canonical/operator/issues/175. I was able to repro this locally quite easily (using a minikube cluster).

Note that it's the relation.data access that's raising the KeyError, which is due to Relation.app being None (I don't think it's related to RelationEvent, as the event.app access has already executed by this time).

Note that I was also able to work around this using "juju resolved":

juju resolved --no-retry postgresql/0

After that the setup continued and Mattermost and the db came up with an issue.

I'll look at a proper fix tomorrow (spent too much time today getting Docker and minikube going again on my refreshed laptop...).

Ben Hoyt (benhoyt) wrote :

This has been fixed via https://github.com/canonical/operator/pull/475 (the canonical/operator issue is https://github.com/canonical/operator/issues/175) and will be available in the next release of Operator Framework.

In the meantime, you can work around it using the "juju resolved" hack above, or use the latest (unreleased) version of the Operator Framework by pulling the latest commit on the master branch.

Changed in juju:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Ben Hoyt (benhoyt)
status: Triaged → Fix Committed
Changed in charm-k8s-postgresql:
status: New → Fix Committed
Ben Hoyt (benhoyt) wrote :

Per above, this has been fixed in the Operator Framework (canonical/operator). However, charm-k8s-postgresql will still need to be updated to use the new version of the Operator Framework.

I corrected the status for the "juju" project to "Invalid" as it's not a bug in Juju itself, and corrected the status for "charm-k8s-postgresql" to "Confirmed" as the charm itself still needs to be updated.

Changed in juju:
assignee: Ben Hoyt (benhoyt) → nobody
status: Fix Committed → Invalid
Changed in charm-k8s-postgresql:
status: Fix Committed → Confirmed
Changed in juju:
importance: High → Undecided
Tom Haddon (mthaddon) wrote :

I've built a custom version of the charm using the latest commit and published it to cs:~mthaddon/postgresql-k8s.

I've then spun up a juju k8s model and ran the following immediately after each other:

juju deploy cs:~mthaddon/postgresql-k8s postgresql
juju deploy cs:~mattermost-charmers/mattermost
juju add-relation mattermost postgresql:db

This worked fine, and I didn't experience the KeyError issue, so I think the fix is good. Would be good if others could confirm.

I did see the following error in debug-log - https://pastebin.ubuntu.com/p/skt8Jq8xB7/ - but things continued fine without my intervention (my model-config has automatically-retry-hooks set to true, which I believe is the default), but that's unrelated I believe.

John A Meinel (jameinel) wrote :

That traceback looks like the update-status hook code is assuming that the application pod will always be up (or at least it isn't trapping an exception for when the pod isn't allowing connections yet).

That certainly doesn't seem like an Operator or Juju issue.

Thanks for testing the fix, we'll look to get it into a final Operator Framework release.

Tom Haddon (mthaddon) on 2021-03-02
Changed in charm-k8s-postgresql:
importance: Undecided → Medium
Tom Haddon (mthaddon) wrote :

We've released a fix for this by pinning the Operator Framework version in requirements.txt to git+git://github.com/canonical/operator@f521841#egg=ops. When a new version of the framework is released we'll revert to "ops".

Changed in charm-k8s-postgresql:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.