Potential race condition leads to broken application state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Invalid
|
Undecided
|
Unassigned | ||
charm-k8s-postgresql |
Fix Released
|
Medium
|
Unassigned |
Bug Description
This problem is inconsistent, so it seems like a likely race condition of some sort.
On a freshly created GKE cluster, if I run the following script, I frequently end up with my application in a broken state:
#!/bin/bash
gcloud container clusters get-credentials cluster-1 --project canonical-jonathan
juju add-k8s gke
juju bootstrap gke
juju add-model gke
juju deploy cs:~postgresql-
juju deploy cs:~mattermost-
juju add-relation mattermost postgresql:db
After running the above script and waiting a few minutes for the services to start, the `postgresql` application eventually gets stuck. `juju status` shows:
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
gke gke-us-central1 gke/us-central1 2.9-rc6 unsupported 13:05:36+01:00
App Version Status Scale Charm Store Rev OS Address Message
mattermost waiting 1 mattermost charmstore 17 kubernetes Waiting for database relation
postgresql pgcharm:edge active 1 postgresql-k8s charmstore 8 kubernetes 10.8.1.106
Unit Workload Agent Address Ports Message
mattermost/0* waiting idle Waiting for database relation
postgresql/0* error idle 10.4.1.8 5432/TCP hook failed: "db-relation-
And `juju debug-log` shows:
application
application
application
Traceback (most recent call last):
File "./src/charm.py", line 237, in <module>
File "/var/lib/
File "/var/lib/
File "/var/lib/
File "/var/lib/
File "/var/lib/
File "/var/lib/
req = _ClientRequests
File "/var/lib/
buckets = [event.
File "/var/lib/
return self._data[key]
KeyError: <ops.model.
application
application
In contrast, if I add a delay between some of the steps, as in this script:
#!/bin/bash
gcloud container clusters get-credentials cluster-1 --project canonical-jonathan
juju add-k8s gke
juju bootstrap gke
juju add-model gke
juju deploy cs:~postgresql-
sleep 60
juju deploy cs:~mattermost-
sleep 60
juju add-relation mattermost postgresql:db
then the installation eventually succeeds:
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
gke gke-us-central1 gke/us-central1 2.9-rc6 unsupported 14:24:37+01:00
App Version Status Scale Charm Store Rev OS Address Message
mattermost mattermost:5.31.0 active 1 mattermost charmstore 17 kubernetes 10.8.8.90
postgresql pgcharm:edge active 1 postgresql-k8s charmstore 8 kubernetes 10.8.6.235
Unit Workload Agent Address Ports Message
mattermost/0* active idle 10.4.0.8 8065/TCP
postgresql/0* active idle 10.4.0.7 5432/TCP Pod configured
Additional info:
$ juju version
2.9-
Changed in charm-k8s-postgresql: | |
importance: | Undecided → Medium |
This looks like it might be an issue with the postgresql charm. Possibly a missed expectation on ordering of events (it seems like it thinks there should be a dict already holding information during relation-changed.)
I'm not sure what that expectation is, nor why waiting would make it more reliable.
Maybe the postgres operator is coming up and not able to see its own workload by the time the relation comes in. It may be something that needs to be deferred, or something that is waiting for some other piece of data to progress.