Bug #2064127 “new added units try to join a new cluster” : Bugs : vault-charm

macchese (max-liccardo) on 2024-04-29

description:	updated
description:	updated
description:	updated

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-04-29:

#1

Hi @macchese

Please could you add a juju_status.txt of the model, and the logs from the vault units. Also "juju show-unit" for each of the vault units would be very helpful. This might explain why the vault units are not clustering.

Thanks.

Changed in vault-charm:
status:	New → Incomplete

Revision history for this message

macchese (max-liccardo) wrote on 2024-04-29:

#2

Download full text (3.9 KiB)

vault/5 is the "old" clustered unit, vualt/14 and vault/15 are the new added units that formed a new cluster

juju status vault
https://paste.ubuntu.com/p/xmXrxNw9dS/

juju show-unit vault/5 vault/14 vault/15 |pastebinit
https://paste.ubuntu.com/p/5xPHvGqHVF/

ubuntu@juju:~$ juju debug-log --include vault/15
unit-vault-15: 14:48:04 INFO unit.vault/15.juju-log Reactive main running for hook update-status
unit-vault-15: 14:48:04 ERROR unit.vault/15.juju-log Unable to find implementation for relation: peers of vault-ha
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Initializing Snap Layer
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Initializing Leadership Layer (is follower)
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:148:update_status
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:167:check_really_is_update_status
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:391:database_not_ready
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:743:prime_assess_status
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: hooks/relations/tls-certificates/provides.py:45:joined:certificates
unit-vault-15: 14:48:07 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
^C
ubuntu@juju:~$ juju debug-log --include vault/14
unit-vault-14: 14:48:38 INFO unit.vault/14.juju-log Reactive main running for hook update-status
unit-vault-14: 14:48:38 ERROR unit.vault/14.juju-log Unable to find implementation for relation: peers of vault-ha
unit-vault-14: 14:48:39 INFO unit.vault/14.juju-log Initializing Snap Layer
unit-vault-14: 14:48:39 INFO unit.vault/14.juju-log Initializing Leadership Layer (is follower)
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:148:update_status
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:167:check_really_is_update_status
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:391:database_not_ready
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:743:prime_assess_status
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: hooks/relations/tls-certificates/provides.py:45:joined:certificates
unit-vault-14: 14:48:41 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
^C
ubuntu@juju:~$ juju debug-log --include vault/5
unit-vault-5: 14:49:17 ERROR unit.vault/5.juju-log Unable to find implementation for relation: peers of vault-ha
unit-vault-5: 14:49:18 INFO unit.vault/5.juju-log Initializing Snap Layer
unit-vault-5: 14:49:18 INFO unit.vault/5.juju-log Initializing Leadership Layer (is leader)
unit-vault-5: 14:49:19 INFO unit.vault/5.juju-log Invoking reactive handler: reactive/vault_handlers.py:148:update_status
unit-vault-5: 14:49:19 INFO unit.vault/...

vault/5 is the "old" clustered unit, vualt/14 and vault/15 are the new added units that formed a new cluster

juju status vault
https://paste.ubuntu.com/p/xmXrxNw9dS/

juju show-unit vault/5 vault/14 vault/15 |pastebinit
https://paste.ubuntu.com/p/5xPHvGqHVF/

ubuntu@juju:~$ juju debug-log --include vault/15
unit-vault-15: 14:48:04 INFO unit.vault/15.juju-log Reactive main running for hook update-status
unit-vault-15: 14:48:04 ERROR unit.vault/15.juju-log Unable to find implementation for relation: peers of vault-ha
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Initializing Snap Layer
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Initializing Leadership Layer (is follower)
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:148:update_status
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:167:check_really_is_update_status
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:391:database_not_ready
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: reactive/vault_handlers.py:743:prime_assess_status
unit-vault-15: 14:48:06 INFO unit.vault/15.juju-log Invoking reactive handler: hooks/relations/tls-certificates/provides.py:45:joined:certificates
unit-vault-15: 14:48:07 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
^C
ubuntu@juju:~$ juju debug-log --include vault/14
unit-vault-14: 14:48:38 INFO unit.vault/14.juju-log Reactive main running for hook update-status
unit-vault-14: 14:48:38 ERROR unit.vault/14.juju-log Unable to find implementation for relation: peers of vault-ha
unit-vault-14: 14:48:39 INFO unit.vault/14.juju-log Initializing Snap Layer
unit-vault-14: 14:48:39 INFO unit.vault/14.juju-log Initializing Leadership Layer (is follower)
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:148:update_status
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:167:check_really_is_update_status
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:391:database_not_ready
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: reactive/vault_handlers.py:743:prime_assess_status
unit-vault-14: 14:48:40 INFO unit.vault/14.juju-log Invoking reactive handler: hooks/relations/tls-certificates/provides.py:45:joined:certificates
unit-vault-14: 14:48:41 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
^C
ubuntu@juju:~$ juju debug-log --include vault/5
unit-vault-5: 14:49:17 ERROR unit.vault/5.juju-log Unable to find implementation for relation: peers of vault-ha
unit-vault-5: 14:49:18 INFO unit.vault/5.juju-log Initializing Snap Layer
unit-vault-5: 14:49:18 INFO unit.vault/5.juju-log Initializing Leadership Layer (is leader)
unit-vault-5: 14:49:19 INFO unit.vault/5.juju-log Invoking reactive handler: reactive/vault_handlers.py:148:update_status
unit-vault-5: 14:49:19 INFO unit.vault/5.juju-log Invoking reactive handler: reactive/vault_handlers.py:167:check_really_is_update_status
unit-vault-5: 14:49:19 INFO unit.vault/5.juju-log Invoking reactive handler: reactive/vault_handlers.py:391:database_not_ready
unit-vault-5: 14:49:19 INFO unit.vault/5.juju-log Invoking reactive handler: reactive/vault_handlers.py:743:prime_assess_status
unit-vault-5: 14:49:19 INFO unit.vault/5.juju-log Invoking reactive handler: hooks/relations/tls-certificates/provides.py:45:joined:certificates
unit-vault-5: 14:49:20 INFO unit.vault/5.juju-log Get installed key for snap vault
unit-vault-5: 14:49:21 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)
unit-vault-5: 14:49:16 INFO unit.vault/5.juju-log Reactive main running for hook update-status

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-04-29:

#3

Thanks for the extra info, but I need to see the entire logs for the units. e.g. "juju debug-log --replay"; we need to work out why the unit isn't clustering properly.

Also, I noticed that the vault units are now /14 and /15. Are you having other issues that are causing you to remove/add units frequently? One thing I'm slight curious about is that perhaps IP addresses have been recycled?

Revision history for this message

macchese (max-liccardo) wrote on 2024-04-29:

#4

juju debug-log --replay --include vault >vault.log
https://paste.ubuntu.com/p/CJVm4BWbdv/

I removed a number of units in order to recreate the cluster, the first unit, vault/5, was never removed.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-04-29:

#5

unit-vault-14:

unit-vault-14: 13:29:56 INFO unit.vault/14.juju-log cluster:295: Invoking reactive handler: reactive/vault_handlers.py:523:join_raft_peers
unit-vault-14: 13:29:56 INFO unit.vault/14.juju-log cluster:295: Joining raft cluster address http://192.168.70.81:8200

Then at:

unit-vault-14: 13:34:14 INFO unit.vault/14.juju-log shared-db:324: Reactive main running for hook shared-db-relation-joined
...
unit-vault-14: 13:34:16 INFO unit.vault/14.juju-log shared-db:324: Invoking reactive handler: reactive/vault_handlers.py:360:mysql_setup
unit-vault-14: 13:34:16 INFO unit.vault/14.juju-log shared-db:324: Invoking reactive handler: reactive/vault_handlers.py:391:database_not_ready

then:

unit-vault-14: 13:35:54 INFO unit.vault/14.juju-log shared-db:324: Reactive main running for hook shared-db-relation-changed

However, the relation changed never gets completed.

Looking at the juju show-unit, the relation shared-db never gets allowed units for vault/14 and vault/15:

related-units:
      vault-mysql-router/6:
        in-scope: true
        data:
          allowed_units: vault/5
          db_host: 127.0.0.1
          db_port: "3306"
          egress-subnets: 192.168.70.81/32
          ingress-address: 192.168.70.81
          password: ***
          private-address: 192.168.70.81
          wait_timeout: "3600"
      vault-mysql-router/481:
        in-scope: true
        data:
          db_host: 127.0.0.1
          db_port: "3306"
          egress-subnets: 192.168.70.166/32
          ingress-address: 192.168.70.166
          password: ***
          private-address: 192.168.70.166
          wait_timeout: "3600"
      vault-mysql-router/482:
        in-scope: true
        data:
          db_host: 127.0.0.1
          db_port: "3306"
          egress-subnets: 192.168.70.167/32
          ingress-address: 192.168.70.167
          password: ***
          private-address: 192.168.70.167
          wait_timeout: "3600"

Thus, I think the problem may actually be either with the mysql-router or mysql-innodb-cluster charms.

Please could you add a FULL juju_status.yaml to the bug (i.e. --format=yaml) so that all the versions and relations can be inspected.

Thanks.

unit-vault-14:

unit-vault-14: 13:29:56 INFO unit.vault/14.juju-log cluster:295: Invoking reactive handler: reactive/vault_handlers.py:523:join_raft_peers
unit-vault-14: 13:29:56 INFO unit.vault/14.juju-log cluster:295: Joining raft cluster address http://192.168.70.81:8200

Then at:

unit-vault-14: 13:34:14 INFO unit.vault/14.juju-log shared-db:324: Reactive main running for hook shared-db-relation-joined
...
unit-vault-14: 13:34:16 INFO unit.vault/14.juju-log shared-db:324: Invoking reactive handler: reactive/vault_handlers.py:360:mysql_setup
unit-vault-14: 13:34:16 INFO unit.vault/14.juju-log shared-db:324: Invoking reactive handler: reactive/vault_handlers.py:391:database_not_ready

then:

unit-vault-14: 13:35:54 INFO unit.vault/14.juju-log shared-db:324: Reactive main running for hook shared-db-relation-changed

However, the relation changed never gets completed.

Looking at the juju show-unit, the relation shared-db never gets allowed units for vault/14 and vault/15:

related-units:
      vault-mysql-router/6:
        in-scope: true
        data:
          allowed_units: vault/5
          db_host: 127.0.0.1
          db_port: "3306"
          egress-subnets: 192.168.70.81/32
          ingress-address: 192.168.70.81
          password: ***
          private-address: 192.168.70.81
          wait_timeout: "3600"
      vault-mysql-router/481:
        in-scope: true
        data:
          db_host: 127.0.0.1
          db_port: "3306"
          egress-subnets: 192.168.70.166/32
          ingress-address: 192.168.70.166
          password: ***
          private-address: 192.168.70.166
          wait_timeout: "3600"
      vault-mysql-router/482:
        in-scope: true
        data:
          db_host: 127.0.0.1
          db_port: "3306"
          egress-subnets: 192.168.70.167/32
          ingress-address: 192.168.70.167
          password: ***
          private-address: 192.168.70.167
          wait_timeout: "3600"

Thus, I think the problem may actually be either with the mysql-router or mysql-innodb-cluster charms.

Please could you add a FULL juju_status.yaml to the bug (i.e. --format=yaml) so that all the versions and relations can be inspected.

Thanks.

Revision history for this message

macchese (max-liccardo) wrote on 2024-04-29:

#6

$ juju status --relations |pastebinit
https://paste.ubuntu.com/p/NbmsFnDRPd/

$ juju export-bundle |pastebinit
https://paste.ubuntu.com/p/9sXSt2pGDw/

I receive this error if
$ juju status --format=yaml
{}
ERROR cannot list storage details: getting details for storage block-devices/1: volume for storage instance "block-devices/1" not found

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-04-30:

#7

> I receive this error if
> $ juju status --format=yaml
> {}
ERROR cannot list storage details: getting details for storage block-devices/1: volume for storage instance "block-devices/1" not found

This is a bit concerning. Does your client juju version match your controller version?

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-04-30:

#8

Okay, for next bit of debugging; the "juju show-unit" for the mysql-innodb-cluster units, and the 3 vault-mysql-routers please. This will let us see what's going on on the relations between the units.

Essentially, what is happening is:

1. The new units are initialising to raft as the mysql relation isn't connected.
2. When the mysql shared-db relation connects the relevant information is sent back except for the allowed units.
3. Thus, the mysql relation never becomes "complete" and so the charm doesn't switch to the mysql cluster (which is what it should do).

Revision history for this message

macchese (max-liccardo) wrote on 2024-04-30:

#9

ubuntu@juju:~$ snap list juju
Name Version Rev Tracking Publisher Notes
juju 3.4.2 26968 3/stable canonical✓ -

ubuntu@juju:~$ juju status mysql-innodb-cluster vault|pastebinit
https://paste.ubuntu.com/p/p325mwfmgr/

ubuntu@juju:~$ juju show-unit mysql-innodb-cluster/13|pastebinit
https://paste.ubuntu.com/p/Jpb8fJXJRb/

ubuntu@juju:~$ juju show-unit mysql-innodb-cluster/14|pastebinit
https://paste.ubuntu.com/p/rZTwZbyk2y/

ubuntu@juju:~$ juju show-unit mysql-innodb-cluster/15|pastebinit
https://paste.ubuntu.com/p/VqnQnYCmgW/

ubuntu@juju:~$ juju show-unit vault-mysql-router/6|pastebinit
https://paste.ubuntu.com/p/jhn7npPcGG/

ubuntu@juju:~$ juju show-unit vault-mysql-router/481|pastebinit
https://paste.ubuntu.com/p/nCSxyc7xv4/

ubuntu@juju:~$ juju show-unit vault-mysql-router/482|pastebinit
https://paste.ubuntu.com/p/z4kqWfCvHM/

P.S: if useful I might send a dump of the vault mysql db

Revision history for this message

macchese (max-liccardo) wrote on 2024-05-02 (last edit on 2024-05-02):

#10

Hi Alex,
looking into one of myql-router unit-conf I see that MRUP_allowed_units and mysqlrouter_allowed_units are not the same on all themysql-innodb-cluster units and even some removed units still exist (vault-mysql-router/4 vault-mysql-router/5)

#juju show-unit vault-mysql-router/6 |egrep "mysql-innodb-cluster||MRUP_allowed_units|mysqlrouter_allowed_units"

mysql-innodb-cluster/13:
  MRUP_allowed_units: '"vault-mysql-router/4 vault-mysql-router/5 vault-mysql-router/6"'
  mysqlrouter_allowed_units: '"vault-mysql-router/4 vault-mysql-router/5 vault-mysql-router/6"'
mysql-innodb-cluster/14:
  MRUP_allowed_units: '"vault-mysql-router/6 vault-mysql-router/481"'
  mysqlrouter_allowed_units: '"vault-mysql-router/481 vault-mysql-router/482
mysql-innodb-cluster/15:
  MRUP_allowed_units: '"vault-mysql-router/4 vault-mysql-router/5 vault-mysql-router/6"'
  mysqlrouter_allowed_units: '"vault-mysql-router/4 vault-mysql-router/5 vault-mysql-router/6"'

what if I remove the actual well-clustered unit (vault/5) and add another one and then initialize the vault?

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-05-03:

#11

This looks to be a bug in the mysql-innodb-cluster charm, in that there are grants left over in the tables for various previous routers on IP addresses. e.g. if you login to the database using the root password [1], and then do:

SHOW GRANTS for 'vault'@'192.168.70.81'

it will list the grants to the database for that user on the IP address. I suspect with all the adding and removing of vault mysql-router instances that it's got out of sync. It my be useful to have a look at the database, and see what it says.

[1] to connect to the DB:

$ juju run <mysql-app>/<unit> -- leader-get | grep "mysql.passwd"
$ juju ssh <mysql-app>/<unit>

then:

mysql -u root -p <password from first command>

Revision history for this message

macchese (max-liccardo) wrote on 2024-05-06:

#12

mysql> SHOW GRANTS for 'vault'@'192.168.70.81';
+--------------------------------------------------------------+
| Grants for vault@192.168.70.81 |
+--------------------------------------------------------------+
| GRANT USAGE ON *.* TO `vault`@`192.168.70.81` |
| GRANT ALL PRIVILEGES ON `vault`.* TO `vault`@`192.168.70.81` |
+--------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> SHOW GRANTS for 'vault'@'192.168.70.166';
+---------------------------------------------------------------+
| Grants for vault@192.168.70.166 |
+---------------------------------------------------------------+
| GRANT USAGE ON *.* TO `vault`@`192.168.70.166` |
| GRANT ALL PRIVILEGES ON `vault`.* TO `vault`@`192.168.70.166` |
+---------------------------------------------------------------+
2 rows in set (0.00 sec)

mysql> SHOW GRANTS for 'vault'@'192.168.70.167';
+---------------------------------------------------------------+
| Grants for vault@192.168.70.167 |
+---------------------------------------------------------------+
| GRANT USAGE ON *.* TO `vault`@`192.168.70.167` |
| GRANT ALL PRIVILEGES ON `vault`.* TO `vault`@`192.168.70.167` |
+---------------------------------------------------------------+
2 rows in set (0.00 sec)

Revision history for this message

macchese (max-liccardo) wrote on 2024-05-06:

#13

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-05-06:

#14

Hi Max

That's really interesting; thanks for posting the grants, etc from the mysql tables. It turns out my intuition was completely wrong and the tables are fine. Therefore, the problem looks like it may be in the mysql-innodb-cluster/mysql-router charm in terms of cached data on the relations and therefore the vault charm isn't doing the final step of switching to the mysql database. i.e. the shared-db relation data between vault unit <-> mysql-router <-> mysql-innodb-cluster isn't an accurate picture of what's actually represented in the system.

I'm going to switch the bug from charm-vault to charm-mysql-router & charm-mysql-innodb-cluster, but I'm still not exactly sure what is going on.

To recap: the removal and then adding in new units is causing incorrect information to appear in the shared-db relation that causes the vault charm to make the wrong choice about which clustering technique to use between raft (the default) and mysql (if the shared-db relation has a complete data set).

Thank you again for your patience in diagnosing this issue.

Changed in vault-charm:
status:	Incomplete → Invalid
Changed in charm-mysql-router:
status:	New → Triaged
Changed in charm-mysql-innodb-cluster:
status:	New → Triaged

Revision history for this message

macchese (max-liccardo) wrote on 2024-05-06:

#15

hi alex,
really thank you.

I think all started when I had to (forced) remove all the etcd units and then I added another 3 of them. From that time the vault units stopped to work so I removed two vault units and then added again 2 units.

I'm sorry but now I have to recover from this situation because my openstack system is in production and we need to set up every application in HA mode (when possible).
Could you have some tricks to set up, even from scratch, a vault 3 units HA? I 'm thinking to remove the "old "vault unit, add another unit and then init&seal the 3 new vault units: what do you think?

Revision history for this message

macchese (max-liccardo) wrote on 2024-05-07:

#16

hi Alex,
might be https://bugs.launchpad.net/charm-mysql-router/+bug/1861523 the problem?
Maybe the mysql-router try to comunicate with mysql using VIP and so it fails because of GRANT not configured for the VIP?

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-05-09:

#17

Hi Max

> might be https://bugs.launchpad.net/charm-mysql-router/+bug/1861523 the problem?

I don't think that's the problem. Essentially, from the debug you've posted, the information that the mysql-router instances are presenting to the mysql-innodb-cluster charm across the relation look absolutely correct. However, what the mysql-innodb-cluster charm is sending back on the relation is completely wrong. The actual database grants are also completely correct.

In order to fix it we need to resolve the relation data, but I'm worried if we just do a "juju run -u ... -- relation-set ..." that it may get overwritten at the next hook, thus breaking things again. I'd really like to work down to the actual cause, so I'm running a minimal model to re-create the situation. I'm hoping to have some further facts about this issue soon.

Revision history for this message

macchese (max-liccardo) wrote on 2024-05-09:

#18

Hi alex,
I set up an lxd model using mysql (https://charmhub.io/mysql) instead of mysql-innodb-cluster and it always worked, even when I removed and added a numer of vault units. Maybe the mysql's charm is "better" than the inno-db-cluster one?
this was my test bundle
https://paste.ubuntu.com/p/88RGccxWbk/

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2024-05-10 (last edit on 2024-05-10):

#19

Hi Max

I'm now fairly certain what the issue is: stale data on the unit relations from mysql-innodb-cluster non-leader units, causing the mysql-router instances to not pick up the right information.

The juju show-unit outputs for the mysql-innodb-cluster units show different MRUP_allowed_units and mysqlrouter_allowed_units values across the 3 mysql-innodb-cluster units. I checked which one of the mysql-innodb-cluster units is the leader, and it's not the first one, it's the one with the most up to data values. Unfortunately, I don't think that the mysql-router aggregates the data across the unit data and thus I think it's picking the first one.

It took me a bit too long to realise, but it's basically this error: https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1989505

I've been able to replicate the issue by removing a unit (which still leaves the unit in the relation data) and then forcing a leadership change on the mysql-innodb-cluster app) and then adding a new unit; this then results in inconsistent data.

We can 'fix' your mysql-innodb-cluster installation by doing:

$ juju exec -u <mysql-router-app>/<unit> -- relation-get -r <rel-id> - mysql-innodb-cluster/<leader-unit-id>
MRUP_allowed_units: '"mysql-router/0 mysql-router/2 mysql-router/7"'
MRUP_password: '"rmMGyj9Nb7PHjtdTJ4TT3b3mB94m69Bf"'
db_host: '"172.20.0.8"'
egress-subnets: 172.20.0.8/32
ingress-address: 172.20.0.8
mysqlrouter_allowed_units: '"mysql-router/0 mysql-router/2 mysql-router/7"'
mysqlrouter_password: '"8CL2rdgZqGjyJwB3g9RfJ5spXh5smZ2r"'
private-address: 172.20.0.8
wait_timeout: "3600"

# then set a local variable.
$ allowed_units="mysql-router/0 mysql-router/2 mysql-router/7"

# finally fix the relation data for the mysql-innodb-cluster units
$ juju exec -u mysql-innodb-cluster/<non-leader-unit> -- relation-set -r <rel-id> MRUP_allowed_units="\"$allowed_units\"" mysqlrouter_allowed_units="\"$allowed_units\""
$ #... same again for the other unit.

<rel-id> is the relation id, which in your system is currently 332 based on the show-unit for the mysql-innodb-cluster units. (router-db to vault-mysql-router)

Note that the relation data values MUST be doubled-quoted (") in the relation data, hence the "\"...\"" escaping to force a " on the front and end of the string.

This should fix the relation data and allow the vault units to form back into the mysql cluster.

The other way of fixing this is to force a leader election such that the leader ends back on the first unit; then the related units will then pick up the 'right' data on the first relation they see (numerically ascending) and then start working properly. Obviously, it's tricky to achieve this, so I'd just go for the first solution.

Finally, re: the new mysql charm: yes, it's a newer charm (completely re-written) for managing mysql, but we've not actually validated it with OpenStack yet! Thus, I'm thrilled it worked, but can't say how it would be supported until we can get around to validating it.

We obviously do have to fix this bug in the mysql-innodb-cluster charm because there are lots of installations that could potentially be hit by it.

Hi Max

I'm now fairly certain what the issue is: stale data on the unit relations from mysql-innodb-cluster non-leader units, causing the mysql-router instances to not pick up the right information.

The juju show-unit outputs for the mysql-innodb-cluster units show different MRUP_allowed_units and mysqlrouter_allowed_units values across the 3 mysql-innodb-cluster units.  I checked which one of the mysql-innodb-cluster units is the leader, and it's not the first one, it's the one with the most up to data values.  Unfortunately, I don't think that the mysql-router aggregates the data across the unit data and thus I think it's picking the first one.

It took me a bit too long to realise, but it's basically this error: https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1989505

I've been able to replicate the issue by removing a unit (which still leaves the unit in the relation data) and then forcing a leadership change on the mysql-innodb-cluster app) and then adding a new unit; this then results in inconsistent data.

We can 'fix' your mysql-innodb-cluster installation by doing:

$ juju exec -u <mysql-router-app>/<unit> -- relation-get -r <rel-id> - mysql-innodb-cluster/<leader-unit-id>
MRUP_allowed_units: '"mysql-router/0 mysql-router/2 mysql-router/7"'
MRUP_password: '"rmMGyj9Nb7PHjtdTJ4TT3b3mB94m69Bf"'
db_host: '"172.20.0.8"'
egress-subnets: 172.20.0.8/32
ingress-address: 172.20.0.8
mysqlrouter_allowed_units: '"mysql-router/0 mysql-router/2 mysql-router/7"'
mysqlrouter_password: '"8CL2rdgZqGjyJwB3g9RfJ5spXh5smZ2r"'
private-address: 172.20.0.8
wait_timeout: "3600"

# then set a local variable.
$ allowed_units="mysql-router/0 mysql-router/2 mysql-router/7"

# finally fix the relation data for the mysql-innodb-cluster units
$ juju exec -u mysql-innodb-cluster/<non-leader-unit> -- relation-set -r <rel-id> MRUP_allowed_units="\"$allowed_units\""  mysqlrouter_allowed_units="\"$allowed_units\""
$ #... same again for the other unit.

<rel-id> is the relation id, which in your system is currently 332 based on the show-unit for the mysql-innodb-cluster units. (router-db to vault-mysql-router)

Note that the relation data values MUST be doubled-quoted (") in the relation data, hence the "\"...\"" escaping to force a " on the front and end of the string.

This should fix the relation data and allow the vault units to form back into the mysql cluster.

The other way of fixing this is to force a leader election such that the leader ends back on the first unit; then the related units will then pick up the 'right' data on the first relation they see (numerically ascending) and then start working properly.  Obviously, it's tricky to achieve this, so I'd just go for the first solution.

Finally, re: the new mysql charm: yes, it's a newer charm (completely re-written) for managing mysql, but we've not actually validated it with OpenStack yet!  Thus, I'm thrilled it worked, but can't say how it would be supported until we can get around to validating it.

We obviously do have to fix this bug in the mysql-innodb-cluster charm because there are lots of installations that could potentially be hit by it.

	Status	Importance	Assigned to
MySQL InnoDB Cluster Charm	Triaged	Undecided	Unassigned
MySQL Router Charm	Triaged	Undecided	Unassigned
vault-charm	Invalid	Undecided	Unassigned

vault-charm

new added units try to join a new cluster

Bug Description

Other bug subscribers

Remote bug watches