Canonical Juju

incomplete relations too often

Bug #1967177 reported by Rodrigo Barbieri on 2022-03-30

This bug affects 5 people

	Status	Importance	Assigned to
Canonical Juju	Triaged	High	Joseph Phillips
MySQL InnoDB Cluster Charm	Invalid	Undecided	Unassigned
OpenStack Percona Cluster Charm	Invalid	Undecided	Unassigned
OpenStack Placement Charm	Invalid	Undecided	Unassigned

Bug Description

Juju being used: 2.9.27

When deploying the attached bundle, I consistently get incomplete relations errors between several units and mysql, despite the relations being declared correctly in the bundle.

The only solution I found to the problem is removing the relation and adding again. IMO this is not an acceptable solution because it causes outages. The root cause of the problem should be addressed.

In other certain environments I don't get the same error and works fine.

See original description

Tags:

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-03-30:

ganso5.yaml Edit (13.3 KiB, text/plain)

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-03-30:

juju_status.txt Edit (31.8 KiB, text/plain)

See juju status attached for example. The affected apps are usually random, most of the time I get placement and vault affected, sometimes cinder and glance

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-03-30:

Added charm-percona-cluster as affected as well because it seems all the affected units are affected in the way they relate to percona-cluster (mysql in my bundle)

tags:

added: sts

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-03-30:

to fix the deployment which I pasted the juju status from I had to remove relation between vault and mysql, placement and mysql, keystone and mysql, and re-add all of them again (in the same order, one by one). Those are the only steps I took to fix all issues in that juju status and now the deployment is working fine.

description:

updated

Revision history for this message

Alan Baghumian (alanbach) wrote on 2022-03-30 (last edit on 2022-03-31):

This exact same issue happens when adding new placement units. Tested with charm rev. 32 on focal/xena.

affects:

charm-percona-cluster → charm-placement

Revision history for this message

Ian Booth (wallyworld) wrote on 2022-03-30:

What does "juju status --relations" show? We need to determine whether the relations are "incomplete" because juju isn't making them properly, ie they do not show as "Joined" in status, or is it that charms aren't properly processing the async events which are triggered as part of standing up the model.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-03-31:

very good point, I should have attached a juju status --relations instead :facepalm:

since I fixed my deployment for the purpose of testing something else, I will be doing that today and then I can tear it down and redeploy to hit the same issue, and then attach a new juju status with --relations.

But in the meantime what I can say is that I had checked the juju status --relations and none of them (at that time) were stuck in "joining", I had also checked the relation data using juju show-unit for the affected units.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-03-31:

juju_status_relations.txt Edit (24.2 KiB, text/plain)

Re-deployed env and captured juju status --relations (attached).

This time keystone was not affected by the issue. I also don't see the "joined"/"joining" values in the output.

Alexander Balderson (asbalderson) on 2022-04-27

tags:	added: cdo-qa foundation-engine
tags:	added: foundations-engine removed: foundation-engine

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2022-05-11:

we've been flagging test runs in the SQA lab with this bug lately as we've seen a large number of cases where all but one unit in a service is up and running, and the one unit that isn't running says its missing a relation.

Most often these services are mysql-router's which are not getting the shared-db relation to finish or trigger a hook.

Here are some logs from a recent example, where neutron-api-mysql-router/0 (the leader unit) is waiting on "'shared-db' incomplete, Waiting for proxied DB creation from cluster" for a little over 3 hours.

We've recently added more logs to the juju-crashdump as well. Each units folder now has logs for juju show-machine and juju show-unit as well as the juju show-status-log.

Crashdump: https://oil-jenkins.canonical.com/artifacts/6834065e-9050-48a1-9da2-660752d77bba/generated/generated/openstack/juju-crashdump-openstack-2022-05-03-15.01.53.tar.gz

Controller Crashdump:https://oil-jenkins.canonical.com/artifacts/6834065e-9050-48a1-9da2-660752d77bba/generated/generated/juju_maas_controller/juju-crashdump-controller-2022-05-03-15.12.42.tar.gz

Testrun:https://solutions.qa.canonical.com/testruns/testRun/6834065e-9050-48a1-9da2-660752d77bba
All the logs for the run can be found at the "view artifacts for this run" at the bottom of the page.

Revision history for this message

Ian Booth (wallyworld) wrote on 2022-06-08:

#10

We can see that the ovn-chassis side had run the relation hooks, but not the ovn-central side.

We need some additional information to diagnose this further - a database dump and extra tracing.

If possible, "juju dump-db -m <model>" would be good (you need JUJU_DEV_FEATURE_FLAGS set to "developer-mode"). But at the least, a json dump of the collections is needed.

Also required is trace level logging for these loggers:

juju.state.relationunits
juju.worker.uniter.operation
juju.worker.uniter.resolved
juju.worker.uniter.relation

The above will help us diagnose why units on one side of the relation do not get to run the relation hooks.

Changed in juju:
status:	New → Incomplete

John A Meinel (jameinel) on 2022-06-10

Changed in juju:
importance:	Undecided → High
milestone:	none → 2.9-next
status:	Incomplete → Triaged

John A Meinel (jameinel) on 2022-06-10

Changed in juju:
status:	Triaged → Incomplete

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2022-09-07:

#11

Folling up with some logs
there are 2 major places we hit this, 1 is with ha cluster
and the other is with mysql-innodb-router

in these logs (with the extra logging above)vault-mysql-router is stuck waiting for the shared-db relation, but everything else reached a stable state:

crashdump: https://oil-jenkins.canonical.com/artifacts/09759a4c-0088-4bf4-9c97-82288f85f940/generated/generated/openstack/juju-crashdump-openstack-2022-09-06-19.30.38.tar.gz
db-dump: https://oil-jenkins.canonical.com/artifacts/09759a4c-0088-4bf4-9c97-82288f85f940/generated/generated/openstack/juju-dump-db-openstack-2022-09-06-19.30.38.tar.gz

testrun: https://solutions.qa.canonical.com/testruns/testRun/09759a4c-0088-4bf4-9c97-82288f85f940

Changed in juju:
status:	Incomplete → New

Vitaly Antonenko (anvial) on 2022-09-08

Changed in juju:
status:	New → Triaged
assignee:	nobody → Joseph Phillips (manadart)

Revision history for this message

Joseph Phillips (manadart) wrote on 2022-09-09:

#12

Was there a controller crash-dump for the run above?

John A Meinel (jameinel) on 2022-11-09

Changed in juju:
assignee:	Joseph Phillips (manadart) → nobody
milestone:	2.9-next → 2.9.38

Revision history for this message

Jeffrey Chang (modern911) wrote on 2022-11-22:

#13

Sorry for late reply, just listed a few runs with this bug, and crashdumps could be found there.

https://oil-jenkins.canonical.com/artifacts/90901686-39b0-472b-9bb1-59cd55f09ac0/index.html
https://oil-jenkins.canonical.com/artifacts/a46236b6-d7b9-423c-a453-ee63d384160f/index.html
https://oil-jenkins.canonical.com/artifacts/b63655d7-a61e-4322-bbcc-aa9f4be0be64/index.html
https://oil-jenkins.canonical.com/artifacts/1ab7e88f-d286-4efd-89a3-56551fd7438d/index.html

Canonical Juju QA Bot (juju-qa-bot) on 2023-01-06

Changed in juju:
milestone:	2.9.38 → 2.9.39

Heather Lanigan (hmlanigan) on 2023-01-17

Changed in juju:
assignee:	nobody → Heather Lanigan (hmlanigan)

Canonical Juju QA Bot (juju-qa-bot) on 2023-01-23

Changed in juju:
milestone:	2.9.39 → 2.9.40

Heather Lanigan (hmlanigan) on 2023-01-27

Changed in juju:
assignee:	Heather Lanigan (hmlanigan) → Joseph Phillips (manadart)

Canonical Juju QA Bot (juju-qa-bot) on 2023-02-17

Changed in juju:
milestone:	2.9.40 → 2.9.41

Canonical Juju QA Bot (juju-qa-bot) on 2023-02-27

Changed in juju:
milestone:	2.9.41 → 2.9.42

Canonical Juju QA Bot (juju-qa-bot) on 2023-03-01

Changed in juju:
milestone:	2.9.42 → 2.9.43

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-05-31:

#14

I don't think this is a charm bug as it's also reported as affecting other charms 'randomly'. Please feel free to re-open if you have more information showing that it's more likely to be a charm bug.

Changed in charm-mysql-innodb-cluster:
status:	New → Invalid
Changed in charm-percona-cluster:
status:	New → Invalid
Changed in charm-placement:
status:	New → Invalid

Canonical Juju QA Bot (juju-qa-bot) on 2023-06-02

Changed in juju:
milestone:	2.9.43 → 2.9.44

Canonical Juju QA Bot (juju-qa-bot) on 2023-07-10

Changed in juju:
milestone:	2.9.44 → 2.9.45

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-07-19:

#15

Checked a handful recent occurrences with juju 2.9.43/44 on openstack yoga.
The applications has "incomplete relations" are the same for all of them.

ceilometer 18.0.0 blocked 3 ceilometer yoga/stable 527 no Incomplete relations: database, Run the ceilometer-upgrade action on the leader to initialize ceilometer and gnocchi
ceph-radosgw 17.2.5 waiting 3 ceph-radosgw quincy/stable 548 no Incomplete relations: mon
glance 24.2.0 waiting 3 glance yoga/stable 562 no Incomplete relations: storage-backend
nova-compute 25.1.1 waiting 6 nova-compute yoga/stable 664 no Incomplete relations: storage-backend, vault

Canonical Juju QA Bot (juju-qa-bot) on 2023-09-18

Changed in juju:
milestone:	2.9.45 → 2.9.46

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-11-29:

#16

The next 2.9.46 candidate release will not include a fix for this bug and we don't plan on any more 2.9 releases. As such it is being removed from its 2.9 milestone.

If the bug is still important to you, let us know and we can consider it for inclusion on a 3.x milestone.

Changed in juju:
milestone:	2.9.46 → none

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-11-29:

#17

@wallyworld
SolQA still see this in many test runs with Juju 3.x, like 2~3 times a week.
Please reconsider this for 3.x.

You might find all runs from https://solutions.qa.canonical.com/bugs/1967177

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-11-30:

#18

Thanks for the input. We'll look to see how future test runs behave and once we can gather enough evidence to understand the root cause, can assign the bug to the next release milestone.

Revision history for this message

Andy Speagle (aspeagle) wrote on 2024-02-29:

#19

This is affecting us as well... I can see clearly in the juju logs on the unit that it fails to connect:

2024-02-29 00:27:01 ERROR unit.placement-mysql-router/5.juju-log server.go:316 db-router:70: Failed to bootstrap mysqlrouter: Error: Unable to connect to the metadata server: Error connecting to MySQL server at 10.100.33.27:0: Access denied for user 'mysqlrouteruser'@'10.100.33.2' (using password: YES) (1045)

But looking in the database... the grants look right. At the end of all of this we even see an established connection to the RW member of the DB cluster.

But... it's clearly not progressing beyond the bootstrap:
[metadata_cache:bootstrap]
cluster_type = gr
router_id = 35
user = mysql_router35_jboetej22v83
metadata_cluster = jujuCluster
ttl = 0.5
auth_cache_ttl = -1
auth_cache_refresh_interval = 2
use_gr_notifications = 0

[routing:bootstrap_rw]
bind_address = 127.0.0.1
bind_port = 3306
socket = /var/lib/mysql/placement-mysql-router/mysql.sock
destinations = metadata-cache://jujuCluster/?role=PRIMARY
routing_strategy = first-available
protocol = classic

[routing:bootstrap_ro]
bind_address = 127.0.0.1
bind_port = 3307
socket = /var/lib/mysql/placement-mysql-router/mysqlro.sock
destinations = metadata-cache://jujuCluster/?role=SECONDARY
routing_strategy = round-robin-with-fallback
protocol = classic

[routing:bootstrap_x_rw]
bind_address = 127.0.0.1
bind_port = 3308
socket = /var/lib/mysql/placement-mysql-router/mysqlx.sock
destinations = metadata-cache://jujuCluster/?role=PRIMARY
routing_strategy = first-available
protocol = x

[routing:bootstrap_x_ro]
bind_address = 127.0.0.1
bind_port = 3309
socket = /var/lib/mysql/placement-mysql-router/mysqlxro.sock
destinations = metadata-cache://jujuCluster/?role=SECONDARY
routing_strategy = round-robin-with-fallback
protocol = x

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.