bootstrap and resize timeout "(workload) Not all relations are ready" on keystone and placement

Bug #2036990 reported by Alexander Balderson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Snap
Fix Committed
High
Unassigned

Bug Description

On a deployment of microstack with converged control, one keystone unit (2) and one placement unit (2) both hang with "not all relations are ready" which causes the deployment to time out.

Nothing in the pod logs for either of these units stands out, and it seven seems that they ave started handling get and post requests.

the testrun can be found at https://solutions.qa.canonical.com/testruns/15c119d1-e198-4dfe-8d84-6c2794ee32d2/
with logs at https://oil-jenkins.canonical.com/artifacts/15c119d1-e198-4dfe-8d84-6c2794ee32d2/index.html

This is not a duplicate of LP:#2023664 as the log messages from that bug dont appear to be present.

Tags: cdo-qa
Revision history for this message
Marian Gasparovic (marosg) wrote :
summary: - bootstrap timeout "(workload) Not all relations are ready" on keystone
- and placement
+ bootstrap and resize timeout "(workload) Not all relations are ready" on
+ keystone and placement
Revision history for this message
James Page (james-page) wrote :

I think the original test run is on resize as well - the waiting units are both /2

Revision history for this message
Liam Young (gnuoy) wrote (last edit ):

The problem appears to be due to the workload status message not being updated after the ingress relation is complete. To reproduce it deploy keystone and assuming you did not hit this bug you should have a status like:

$ juju status keystone traefik
Model Controller Cloud/Region Version SLA Timestamp
zaza-49120919947b micro microk8s/localhost 3.2.3 unsupported 12:20:51Z

App Version Status Scale Charm Channel Rev Address Exposed Message
keystone active 1 keystone-k8s 2023.2/edge 136 10.152.183.244 no
traefik 2.10.4 active 1 traefik-k8s 1.0/candidate 148 10.20.21.4 no

Unit Workload Agent Address Ports Message
keystone/0* active idle 10.1.214.191
traefik/0* active idle 10.1.214.182

Start a debug-hooks session on traefik/0, this is to stop it from executing hooks:
juju debug-hooks traefik/0

Collect the traefik relation app data

juju exec --unit keystone/0 "relation-ids ingress-public"
ingress-public:6

juju exec --unit traefik/0 "relation-get -r ingress:6 - traefik/0 --app"
ingress: '{"url": "http://10.20.21.4/zaza-49120919947b-keystone"}'

Now simulate traefik not being ready by wiping the app data between traefik and keystone:

juju exec --unit traefik/0 "relation-set -r ingress:6 --app ingress="

keystone will now report: "(ingress-public) integration incomplete" in its workload status
Trigger a config changed hook:

juju config keystone debug=true

Keystone will now report "(workload) Not all relations are ready"

Finally fix the relation data from traefik:

DATA='{"url": "http://10.20.21.4/zaza-49120919947b-keystone"}'
juju exec --unit traefik/0 "relation-set -r ingress:6 --app ingress='$DATA'"

Keystone will STILL report "(workload) Not all relations are ready"

This is a bug in charm-ops-sunbeam

Revision history for this message
Liam Young (gnuoy) wrote :

That reproducer is a bit artificial because the traefik lib caches the ingress value and only emits an event if it has changed

Revision history for this message
Liam Young (gnuoy) wrote :

ok, I think this bug occurs when:

* A new unit of traefik is being provisioned
* A hook fires for another relation on keystone (or other sunbeam charm). This causes keystone to
  evaluate the state of its relations. It finds all app data has gone from the ingress relation *1 and
  reports the relation as not ready (sunbeam bypasses the traefik stored state).
* Eventually the new traefik unit sets its relation data. The traefik interface compares this relation
  data against its cache and does not emit an event because it hasn't changed.
* At this point keystone reports "(workload) Not all relations are ready". If another event fires (
  other than update-status) then keystone will spot that the relation is ready but if it is at the end
  of the deployment it will stay in this waiting state.

# Reproduce:
Deploy keystone bundle with traefik. To catch the bug we need to intercept a new unit of the traefik charm and stop it executing its hooks so that we can update keystone while it is in this transient state.

while true; do juju debug-hooks traefik/1; sleep 0.1; done
juju add-unit traefik

# When debug-hooks session kicks in

juju config keystone debug=true

# Keystone will now go into the following state:

keystone/0* waiting idle 10.1.214.167 (workload) Not all relations are ready

# Cycle through the hooks in the traefik debug-hooks session and keystone will remain blocked

Revision history for this message
Liam Young (gnuoy) wrote :

*1 I was very surprised to see missing app data during the new traefik unit being added:

# relation-list -r ingress-internal:7
traefik/0
traefik/1
# relation-get -r ingress-internal:7 - traefik/0 --app
{}
# relation-get -r ingress-internal:7 - traefik/0
egress-subnets: 10.20.21.2/32
ingress-address: 10.20.21.2
private-address: 10.20.21.2
# relation-get -r ingress-internal:7 - traefik/1
egress-subnets: 10.20.21.2/32
ingress-address: 10.20.21.2
private-address: 10.20.21.2

# relation-list -r ingress-internal:8
traefik/0
traefik/1
# relation-get -r ingress-internal:8 - traefik/0 --app
{}
# relation-get -r ingress-internal:8 - traefik/0
egress-subnets: 10.20.21.2/32
ingress-address: 10.20.21.2
private-address: 10.20.21.2
# relation-get -r ingress-internal:8 - traefik/1
egress-subnets: 10.20.21.2/32
ingress-address: 10.20.21.2
private-address: 10.20.21.2

Revision history for this message
Liam Young (gnuoy) wrote :
James Page (james-page)
Changed in snap-openstack:
status: New → Fix Committed
importance: Undecided → High
milestone: none → 2023.2.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.