incomplete relations too often

Bug #1967177 reported by Rodrigo Barbieri
24
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
High
Joseph Phillips
MySQL InnoDB Cluster Charm
Invalid
Undecided
Unassigned
OpenStack Percona Cluster Charm
Invalid
Undecided
Unassigned
OpenStack Placement Charm
Invalid
Undecided
Unassigned

Bug Description

Juju being used: 2.9.27

When deploying the attached bundle, I consistently get incomplete relations errors between several units and mysql, despite the relations being declared correctly in the bundle.

The only solution I found to the problem is removing the relation and adding again. IMO this is not an acceptable solution because it causes outages. The root cause of the problem should be addressed.

In other certain environments I don't get the same error and works fine.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

See juju status attached for example. The affected apps are usually random, most of the time I get placement and vault affected, sometimes cinder and glance

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Added charm-percona-cluster as affected as well because it seems all the affected units are affected in the way they relate to percona-cluster (mysql in my bundle)

tags: added: sts
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

to fix the deployment which I pasted the juju status from I had to remove relation between vault and mysql, placement and mysql, keystone and mysql, and re-add all of them again (in the same order, one by one). Those are the only steps I took to fix all issues in that juju status and now the deployment is working fine.

description: updated
Revision history for this message
Alan Baghumian (alanbach) wrote (last edit ):

This exact same issue happens when adding new placement units. Tested with charm rev. 32 on focal/xena.

affects: charm-percona-cluster → charm-placement
Revision history for this message
Ian Booth (wallyworld) wrote :

What does "juju status --relations" show? We need to determine whether the relations are "incomplete" because juju isn't making them properly, ie they do not show as "Joined" in status, or is it that charms aren't properly processing the async events which are triggered as part of standing up the model.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

very good point, I should have attached a juju status --relations instead :facepalm:

since I fixed my deployment for the purpose of testing something else, I will be doing that today and then I can tear it down and redeploy to hit the same issue, and then attach a new juju status with --relations.

But in the meantime what I can say is that I had checked the juju status --relations and none of them (at that time) were stuck in "joining", I had also checked the relation data using juju show-unit for the affected units.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Re-deployed env and captured juju status --relations (attached).

This time keystone was not affected by the issue. I also don't see the "joined"/"joining" values in the output.

tags: added: cdo-qa foundation-engine
tags: added: foundations-engine
removed: foundation-engine
Revision history for this message
Alexander Balderson (asbalderson) wrote :

we've been flagging test runs in the SQA lab with this bug lately as we've seen a large number of cases where all but one unit in a service is up and running, and the one unit that isn't running says its missing a relation.

Most often these services are mysql-router's which are not getting the shared-db relation to finish or trigger a hook.

Here are some logs from a recent example, where neutron-api-mysql-router/0 (the leader unit) is waiting on "'shared-db' incomplete, Waiting for proxied DB creation from cluster" for a little over 3 hours.

We've recently added more logs to the juju-crashdump as well. Each units folder now has logs for juju show-machine and juju show-unit as well as the juju show-status-log.

Crashdump: https://oil-jenkins.canonical.com/artifacts/6834065e-9050-48a1-9da2-660752d77bba/generated/generated/openstack/juju-crashdump-openstack-2022-05-03-15.01.53.tar.gz

Controller Crashdump:https://oil-jenkins.canonical.com/artifacts/6834065e-9050-48a1-9da2-660752d77bba/generated/generated/juju_maas_controller/juju-crashdump-controller-2022-05-03-15.12.42.tar.gz

Testrun:https://solutions.qa.canonical.com/testruns/testRun/6834065e-9050-48a1-9da2-660752d77bba
All the logs for the run can be found at the "view artifacts for this run" at the bottom of the page.

Revision history for this message
Ian Booth (wallyworld) wrote :

We can see that the ovn-chassis side had run the relation hooks, but not the ovn-central side.

We need some additional information to diagnose this further - a database dump and extra tracing.

If possible, "juju dump-db -m <model>" would be good (you need JUJU_DEV_FEATURE_FLAGS set to "developer-mode"). But at the least, a json dump of the collections is needed.

Also required is trace level logging for these loggers:

juju.state.relationunits
juju.worker.uniter.operation
juju.worker.uniter.resolved
juju.worker.uniter.relation

The above will help us diagnose why units on one side of the relation do not get to run the relation hooks.

Changed in juju:
status: New → Incomplete
John A Meinel (jameinel)
Changed in juju:
importance: Undecided → High
milestone: none → 2.9-next
status: Incomplete → Triaged
John A Meinel (jameinel)
Changed in juju:
status: Triaged → Incomplete
Revision history for this message
Alexander Balderson (asbalderson) wrote :

Folling up with some logs
there are 2 major places we hit this, 1 is with ha cluster
and the other is with mysql-innodb-router

in these logs (with the extra logging above)vault-mysql-router is stuck waiting for the shared-db relation, but everything else reached a stable state:

crashdump: https://oil-jenkins.canonical.com/artifacts/09759a4c-0088-4bf4-9c97-82288f85f940/generated/generated/openstack/juju-crashdump-openstack-2022-09-06-19.30.38.tar.gz
db-dump: https://oil-jenkins.canonical.com/artifacts/09759a4c-0088-4bf4-9c97-82288f85f940/generated/generated/openstack/juju-dump-db-openstack-2022-09-06-19.30.38.tar.gz

testrun: https://solutions.qa.canonical.com/testruns/testRun/09759a4c-0088-4bf4-9c97-82288f85f940

Changed in juju:
status: Incomplete → New
Changed in juju:
status: New → Triaged
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

Was there a controller crash-dump for the run above?

John A Meinel (jameinel)
Changed in juju:
assignee: Joseph Phillips (manadart) → nobody
milestone: 2.9-next → 2.9.38
Revision history for this message
Jeffrey Chang (modern911) wrote :
Changed in juju:
milestone: 2.9.38 → 2.9.39
Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
Changed in juju:
milestone: 2.9.39 → 2.9.40
Changed in juju:
assignee: Heather Lanigan (hmlanigan) → Joseph Phillips (manadart)
Changed in juju:
milestone: 2.9.40 → 2.9.41
Changed in juju:
milestone: 2.9.41 → 2.9.42
Changed in juju:
milestone: 2.9.42 → 2.9.43
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I don't think this is a charm bug as it's also reported as affecting other charms 'randomly'. Please feel free to re-open if you have more information showing that it's more likely to be a charm bug.

Changed in charm-mysql-innodb-cluster:
status: New → Invalid
Changed in charm-percona-cluster:
status: New → Invalid
Changed in charm-placement:
status: New → Invalid
Changed in juju:
milestone: 2.9.43 → 2.9.44
Changed in juju:
milestone: 2.9.44 → 2.9.45
Revision history for this message
Jeffrey Chang (modern911) wrote :

Checked a handful recent occurrences with juju 2.9.43/44 on openstack yoga.
The applications has "incomplete relations" are the same for all of them.

ceilometer 18.0.0 blocked 3 ceilometer yoga/stable 527 no Incomplete relations: database, Run the ceilometer-upgrade action on the leader to initialize ceilometer and gnocchi
ceph-radosgw 17.2.5 waiting 3 ceph-radosgw quincy/stable 548 no Incomplete relations: mon
glance 24.2.0 waiting 3 glance yoga/stable 562 no Incomplete relations: storage-backend
nova-compute 25.1.1 waiting 6 nova-compute yoga/stable 664 no Incomplete relations: storage-backend, vault

Changed in juju:
milestone: 2.9.45 → 2.9.46
Revision history for this message
Ian Booth (wallyworld) wrote :

The next 2.9.46 candidate release will not include a fix for this bug and we don't plan on any more 2.9 releases. As such it is being removed from its 2.9 milestone.

If the bug is still important to you, let us know and we can consider it for inclusion on a 3.x milestone.

Changed in juju:
milestone: 2.9.46 → none
Revision history for this message
Jeffrey Chang (modern911) wrote :

@wallyworld
SolQA still see this in many test runs with Juju 3.x, like 2~3 times a week.
Please reconsider this for 3.x.

You might find all runs from https://solutions.qa.canonical.com/bugs/1967177

Revision history for this message
Ian Booth (wallyworld) wrote :

Thanks for the input. We'll look to see how future test runs behave and once we can gather enough evidence to understand the root cause, can assign the bug to the next release milestone.

Revision history for this message
Andy Speagle (aspeagle) wrote :

This is affecting us as well... I can see clearly in the juju logs on the unit that it fails to connect:

2024-02-29 00:27:01 ERROR unit.placement-mysql-router/5.juju-log server.go:316 db-router:70: Failed to bootstrap mysqlrouter: Error: Unable to connect to the metadata server: Error connecting to MySQL server at 10.100.33.27:0: Access denied for user 'mysqlrouteruser'@'10.100.33.2' (using password: YES) (1045)

But looking in the database... the grants look right. At the end of all of this we even see an established connection to the RW member of the DB cluster.

But... it's clearly not progressing beyond the bootstrap:
[metadata_cache:bootstrap]
cluster_type = gr
router_id = 35
user = mysql_router35_jboetej22v83
metadata_cluster = jujuCluster
ttl = 0.5
auth_cache_ttl = -1
auth_cache_refresh_interval = 2
use_gr_notifications = 0

[routing:bootstrap_rw]
bind_address = 127.0.0.1
bind_port = 3306
socket = /var/lib/mysql/placement-mysql-router/mysql.sock
destinations = metadata-cache://jujuCluster/?role=PRIMARY
routing_strategy = first-available
protocol = classic

[routing:bootstrap_ro]
bind_address = 127.0.0.1
bind_port = 3307
socket = /var/lib/mysql/placement-mysql-router/mysqlro.sock
destinations = metadata-cache://jujuCluster/?role=SECONDARY
routing_strategy = round-robin-with-fallback
protocol = classic

[routing:bootstrap_x_rw]
bind_address = 127.0.0.1
bind_port = 3308
socket = /var/lib/mysql/placement-mysql-router/mysqlx.sock
destinations = metadata-cache://jujuCluster/?role=PRIMARY
routing_strategy = first-available
protocol = x

[routing:bootstrap_x_ro]
bind_address = 127.0.0.1
bind_port = 3309
socket = /var/lib/mysql/placement-mysql-router/mysqlxro.sock
destinations = metadata-cache://jujuCluster/?role=SECONDARY
routing_strategy = round-robin-with-fallback
protocol = x

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.