[CMR - 2.7.7] remove-relation causes the offer to be terminated

Bug #1879645 reported by Peter Jose De Sousa
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth
2.8
Fix Released
High
Ian Booth
OpenStack Percona Cluster Charm
Invalid
Wishlist
Unassigned

Bug Description

Hello,

Problem:

When relating MySQL and Vault using CMR if the relation is removed the offer goes into a 'terminated' state.

Details:

ubuntu@lma-stack-2:~$ juju add-relation vault mysql
ubuntu@lma-stack-2:~$ juju add-relation vault mysql
ERROR cannot add relation "vault:shared-db mysql:shared-db": remote offer mysql is terminated
ubuntu@lma-stack-2:~$ juju add-relation vault mysql
ERROR cannot add relation "vault:shared-db mysql:shared-db": remote offer mysql is terminated
ubuntu@lma-stack-2:~$ juju add-relation vault mysql
ERROR cannot add relation "vault:shared-db mysql:shared-db": remote offer mysql is terminated
ubuntu@lma-stack-2:~$ juju add-relation

Looking at juju status, I can see that mysql flip/flops between "active" and terminated, attempting to remove the mysql offer throws another error:

ubuntu@lma-stack-2:~$ juju remove-offer admin/kubernetes-lma.mysql-shared-db
ERROR cannot delete application offer "mysql-shared-db": offer has 2 relations
ubuntu@lma-stack-2:~$

Steps to reproduce:

1. Deploy CMR bundle with MySQL in one model and vault in another with saas offers and consumers. (See bundles below)
2. Wait for the cluster to settle
3. Remove the relation between vault and mysql (watch "juju remove-relation vault mysql")
4. The relation should remove successfully
5. Attempt to re-add the relation.

I was unable to remove the offer and re-add, but the relation never re-forms.

Bundles here:

https://pastebin.canonical.com/p/wQ2ffJ2y8C/

Workaround:

None at the time of writing.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Subscribing field critical

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Marked: https://bugs.launchpad.net/vault-charm/+bug/1878266 field critical along with this bug, if either can be resolved, we can unblock.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

The percona-cluster charm has no CMR testing or documented known-good use CMR cases (NotImplemented). This is more of a feature request.

Changed in charm-percona-cluster:
importance: Undecided → Wishlist
Revision history for this message
Ryan Beisner (1chb1n) wrote :

I would like to gather more detail about this use case so that we can form a specification to potentially drive feature development in the percona-cluster charm, if that is central to the use case.

Revision history for this message
Tim Penhey (thumper) wrote :

Peter, your description above does not give all the commands that we'd need to reproduce.

ubuntu@lma-stack-2:~$ juju add-relation vault mysql
ubuntu@lma-stack-2:~$ juju add-relation vault mysql
ERROR cannot add relation "vault:shared-db mysql:shared-db": remote offer mysql is terminated

Why are you adding the relation twice?

From the bundles, it looks like you are trying to relate the vault in the k8s model to the mysql in the LMA model, is that right?

Changed in juju:
status: New → Incomplete
Tim Penhey (thumper)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
Revision history for this message
Tim Penhey (thumper) wrote :

Here are some repoduction steps that I did locally:

juju bootstrap lxd test
juju deploy vault
juju add-model database
juju deploy percona-cluster mysql
juju config mysql min-cluster-size=3
juju add-unit mysql -n 2
juju offer mysql:shared-db
juju switch default
juju consume database.mysql
juju relate vault mysql
# added two more units of vault
juju add-unit vault -n 2

That was reasonably stable, except vault complated that the shared-db was incomplete, probably due to percona-cluster not being CMR aware.

juju remove-relation mysql vault

Relation was removed from consuming model, but was still shown in offering model.
Adding the relation again had SAAS status for mysql move to terminated, and the relation was removed.

Changed in juju:
milestone: none → 2.7.7
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Hi Tim,

I have updated the bug description with steps of how to reproduce the issue. The Vault issue is another issue.

The subnets here are different so, MySQL is in one subnet and Percona Cluster is in another - not sure if this contributes to the issue.

I was also unable to remove the offers in order to re-add them.

description: updated
description: updated
Ian Booth (wallyworld)
Changed in juju:
status: Incomplete → In Progress
importance: Undecided → High
Revision history for this message
Peter Jose De Sousa (pjds) wrote :

correction to above comment, Percona/Mysql is in one subnet and vault is in the other subnet.

Sorry and thanks!

Revision history for this message
Ian Booth (wallyworld) wrote :

I tried with juju 2.8 and couldn't reproduce.
Then with 2.7 I could reproduce.

So far, I have found that one of the vault units is not recorded as having left the relation in the offering model. This blocks the removal of the relation between the offered application and the remote proxy to the consuming app in the offering model. You can see that the relation is still there in the offering model by running juju offers. Note such cross model relations are not shown in status --relations.

You can force remove the offending relation by using the numeric id shown in juju offers.

$ juju remove-relation 1 --force

Once the relation is gone (force takes up to a minute) you can relate again successfully.

So the issue is why did only one of the vault units leaving scope fail to be communicated to the offering model. Interesting that juju show-status-log shows that all the relation departed/broken hooks have run; it's just the final cleanup that is not getting propagated, and until this step is done, the relation remains in the offering model. There's no such issue in the consuming model since the remove was done on that model and the processing happens locally.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Hey Ian

Many thanks for your work on this, I can confirm this does indeed work, it's interesting that when I remove the relation again, and add it again it doesn't require this intervention too.

It seems that vault remains in the broken state, but I have another bug for this issue.

Again, many thanks

Peter

Revision history for this message
Ian Booth (wallyworld) wrote :

I think it's timing related as to whether the issue shows up or not - it will work one time an fail the next. At least there's a cleanup step you can do to recover.

Still trying to get to the root cause. There's a lot of moving parts.

Revision history for this message
Ian Booth (wallyworld) wrote :

So I think I found the cause.

When the last unit leaves scope, the relation also is set to be deleted. If the relation deletion event is processed by the cross model workers before the unit change event arrives, the watcher for the unit change event is stopped and the unit event never gets processed (it looks like the event gets thrown away which is effectively what happens), and the offering model never gets told the last unit is gone, hence it retains the relation on its side.

I hacked together a quick POC fix by not stopping the unit change watcher when the relation dies and that fixed it. Need to do a proper fix.

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Hey Ian,

Many thanks for your work on this, I was still getting the terminated state on 2.8/edge when deploying across subnets.

I will be de-escalating these bugs now as we have found a different set of paths to resolve this and continue the deployment.

Thank you again!

Peter

Revision history for this message
Paul Collins (pjdc) wrote :

I just ran into a problem that I believe is this bug. Is there some way of removing the terminated SAAS from the consuming model? If not, is it safe to remove the model and recreate it?

Revision history for this message
Paul Collins (pjdc) wrote :

Some details: Controller is 2.9-rc2. Offering model is IAAS, offering charm is cs:postgresql (focal, channel=edge). Consuming model is CAAS, consuming charm is a mattermost k8s workload charm I'm developing.

I was removing my xenial deployment of postgresql from the IAAS model to redeploy it on focal, as well removing as the mattermost application from the CAAS model to test some charm changes, somewhat at the same time.

Revision history for this message
Ian Booth (wallyworld) wrote :

To remove the saas entry, simply juju remove-saas <foo>

It does seem like there's a common root cause here - if the relation remove events are processed as multiple units of work, the last ones can get missed.

Revision history for this message
Ian Booth (wallyworld) wrote :

This PR should fix the remove-relation case.

https://github.com/juju/juju/pull/11627

There's also a similar issue with destroy-model which will need a fair bit more work.
See bug 1871898

Revision history for this message
Paul Collins (pjdc) wrote :

> To remove the saas entry, simply juju remove-saas <foo>

Excellent, thanks! This worked great. I hadn't really internalized the SAAS being a separate entity, since in my (limited) experience so far with CMRs the SAASes just appear when they're needed.

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Billy Olsen (billy-olsen) wrote :

Marking as invalid on the percona-cluster charm as none of what is described is a charm related issue, and in fact was fixed with Juju. There exists the CMR nature of the charm, but that should be tracked in a separate request/bug.

Changed in charm-percona-cluster:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.