Stale CMR offers causing model to not destroy properly

Bug #1871898 reported by Jeff Hillman
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

Juju 2.7.5

Testing CMR. 1 model in openstack with own controller making offers. 1 model on Bare-Metal via MAAS based controller.

Offers are created in the openstack model and join successfully:

---

$ juju offers
Offer User Relation id Status Endpoint Interface Role Ingress subnets
easyrsa admin 24 joined client tls-certificates provider 172.16.7.0/24
etcd admin 23 joined db etcd provider 172.16.7.0/24
kubeapi-load-balancer admin 26 joined website http provider 172.16.7.0/24
kubernetes-master admin 25 joined kube-control kube-control provider 172.16.7.0/24

---

Relations join fine in the consuming model:

---

$ juju status --relations
Model Controller Cloud/Region Version SLA Timestamp
k8s-worker jhillman-maas jhillman-maas 2.7.5 unsupported 14:27:22-04:00

SAAS Status Store URL
easyrsa active openstack-regionone admin/default.easyrsa
etcd active openstack-regionone admin/default.etcd
kubeapi-load-balancer active openstack-regionone admin/default.kubeapi-load-balancer
kubernetes-master active openstack-regionone admin/default.kubernetes-master

App Version Status Scale Charm Store Rev OS Notes
containerd active 1 containerd jujucharms 61 ubuntu
flannel 0.11.0 active 1 flannel jujucharms 468 ubuntu
kubernetes-worker-bm 1.17.4 active 1 kubernetes-worker jujucharms 634 ubuntu

Unit Workload Agent Machine Public address Ports Message
kubernetes-worker-bm/0* active idle 0 172.16.7.254 80/tcp,443/tcp Kubernetes worker running.
  containerd/0* active idle 172.16.7.254 Container runtime available
  flannel/0* active idle 172.16.7.254 Flannel subnet 10.1.76.1/24

Machine State DNS Inst id Series AZ Message
0 started 172.16.7.254 agrippa bionic default Deployed

Relation provider Requirer Interface Type Message
easyrsa:client kubernetes-worker-bm:certificates tls-certificates regular
etcd:db flannel:etcd etcd regular
kubeapi-load-balancer:website kubernetes-worker-bm:kube-api-endpoint http regular
kubernetes-master:kube-control kubernetes-worker-bm:kube-control kube-control regular
kubernetes-worker-bm:cni flannel:cni kubernetes-cni subordinate
kubernetes-worker-bm:container-runtime containerd:containerd container-runtime subordinate
kubernetes-worker-bm:coordinator kubernetes-worker-bm:coordinator coordinator peer

---

Whether the consuming model is destroyed completely, or even if the relations are are removed by hand, the offering model still shows the offers being consumed and will not destroy. It eventually times out.

---

ubuntu@jhillman-maas:~/cmr$ juju remove-relation kubernetes-master kubernetes-worker-bm
ubuntu@jhillman-maas:~/cmr$ juju remove-relation kubeapi-load-balancer kubernetes-worker-bm
ubuntu@jhillman-maas:~/cmr$ juju remove-relation easyrsa kubernetes-worker-bm
ubuntu@jhillman-maas:~/cmr$ juju remove-relation etcd flannel

$ juju status --relations
Model Controller Cloud/Region Version SLA Timestamp
k8s-worker jhillman-maas jhillman-maas 2.7.5 unsupported 14:30:54-04:00

SAAS Status Store URL
easyrsa active openstack-regionone admin/default.easyrsa
etcd active openstack-regionone admin/default.etcd
kubeapi-load-balancer active openstack-regionone admin/default.kubeapi-load-balancer
kubernetes-master active openstack-regionone admin/default.kubernetes-master

App Version Status Scale Charm Store Rev OS Notes
containerd active 1 containerd jujucharms 61 ubuntu
flannel 0.11.0 blocked 1 flannel jujucharms 468 ubuntu
kubernetes-worker-bm 1.17.4 waiting 1 kubernetes-worker jujucharms 634 ubuntu

Unit Workload Agent Machine Public address Ports Message
kubernetes-worker-bm/0* waiting idle 0 172.16.7.254 80/tcp,443/tcp Waiting for cluster DNS.
  containerd/0* active idle 172.16.7.254 Container runtime available
  flannel/0* blocked idle 172.16.7.254 Waiting for etcd relation.

Machine State DNS Inst id Series AZ Message
0 started 172.16.7.254 agrippa bionic default Deployed

Relation provider Requirer Interface Type Message
kubernetes-worker-bm:cni flannel:cni kubernetes-cni subordinate
kubernetes-worker-bm:container-runtime containerd:containerd container-runtime subordinate
kubernetes-worker-bm:coordinator kubernetes-worker-bm:coordinator coordinator peer

---

Even after removing saas cleanly:

---

ubuntu@jhillman-maas:~/cmr$ juju remove-saas easyrsa
ubuntu@jhillman-maas:~/cmr$ juju remove-saas etcd
ubuntu@jhillman-maas:~/cmr$ juju remove-saas kubeapi-load-balancer
ubuntu@jhillman-maas:~/cmr$ juju remove-saas kubernetes-master
ubuntu@jhillman-maas:~/cmr$ juju status
Model Controller Cloud/Region Version SLA Timestamp
k8s-worker jhillman-maas jhillman-maas 2.7.5 unsupported 14:31:37-04:00

App Version Status Scale Charm Store Rev OS Notes
containerd active 1 containerd jujucharms 61 ubuntu
flannel 0.11.0 blocked 1 flannel jujucharms 468 ubuntu
kubernetes-worker-bm 1.17.4 waiting 1 kubernetes-worker jujucharms 634 ubuntu

Unit Workload Agent Machine Public address Ports Message
kubernetes-worker-bm/0* waiting idle 0 172.16.7.254 80/tcp,443/tcp Waiting for cluster DNS.
  containerd/0* active idle 172.16.7.254 Container runtime available
  flannel/0* blocked idle 172.16.7.254 Waiting for etcd relation.

Machine State DNS Inst id Series AZ Message
0 started 172.16.7.254 agrippa bionic default Deployed

---

Some offers still show as joined. This isn't consistent as to which ones stay. It is different each time:

---

$ juju offers
Offer User Relation id Status Endpoint Interface Role Ingress subnets
easyrsa admin 24 joined client tls-certificates provider 172.16.7.0/24
etcd admin 23 joined db etcd provider 172.16.7.0/24
kubeapi-load-balancer admin 26 joined website http provider 172.16.7.0/24
kubernetes-master -

---

This causes the offering model to not cleanly be removed. In fact, the controller has to be manually removed and re-bootstrapped to resolve this.

Tags: cross-model
Jeff Hillman (jhillman)
description: updated
description: updated
Revision history for this message
Ian Booth (wallyworld) wrote :

It's interesting that the cross model relations join ok but won't get removed. That fact that they join ok suggests it's not an issue with communication between controllers.

It would be interesting to see if --force applied to remove-saas and/or destroy-model and/or remove-relation causes cleanup to complete. The --force option will eventually cleanup the model on which it is run even if the other side is not responding.

Before doing that though, we need more info to diagnose the issue. We need logs fpr both models and their controller controller, after turning on extra debug logging on both models:

juju model-config logging-config="juju.apiserver.common.crossmodel=DEBUG;juju.apiserver.crossmodelrelations=DEBUG;juju.worker.uniter.remotestate=DEBUG;<root>=INFO;UNIT=DEBUG;"

Turn on the logging, set up and deployment, and then remove relation etc. We need to see what cmr related events get published between the models to understand why things are getting stuck.

tags: added: cross-model
Revision history for this message
Ian Booth (wallyworld) wrote :

It's important to note that --force is needed in circumstances where one of the controllers is killed or firewalled off or otherwise disconnected so that things cannot be torn down cleanly. But a normal remove-relation or destroy-model where everything is still connected should work without --force.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Once I turn on that verbose logging, where am I pulling the specific logs file from? I realize I can look in the unit logs or machine logs... is that required for this or is there a location on the controller (or somewhere else) that I can get the entire model log?

Revision history for this message
Jeff Hillman (jhillman) wrote :

FWIW --force did work just fine.

It should also be noted, in all attempts on this, the MAAS controller has stayed up, as well as the juju controller.

Revision history for this message
Ian Booth (wallyworld) wrote :

The model logs are located in /var/log/juju/models
There's a log file for each model.
There's also the controller log in /var/log/juju/machine-0.log
You could just archive the entire /var/log/juju directory on both controllers

Revision history for this message
Jeff Hillman (jhillman) wrote :
Download full text (18.9 KiB)

Still working with this. The last time I removed the model from bare-metal to k8s with CMR, it worked fine.

This time, I have CMR from ceph-mon bare-metal openstack to k8s running inside openstack.

K8s model destroyed fine, however, ceph-mon is failing to remove on the bare-metal model.

from 'juju debug-log --include ceph-mon'

---

unit-ceph-mon-0: 16:34:18 DEBUG juju.worker.uniter.remotestate got application change
unit-ceph-mon-0: 16:34:18 DEBUG juju.worker.uniter.remotestate got relations change: ok=true
unit-ceph-mon-0: 16:34:19 INFO juju.worker.uniter.operation skipped "client-relation-departed" hook (missing)
unit-ceph-mon-0: 16:34:19 INFO juju.worker.uniter.operation skipped "client-relation-departed" hook (missing)
unit-ceph-mon-0: 16:34:20 DEBUG juju.worker.uniter.remotestate got application change
unit-ceph-mon-0: 16:34:20 INFO juju.worker.uniter.operation skipped "client-relation-broken" hook (missing)
unit-ceph-mon-0: 16:34:22 INFO unit.ceph-mon/0.juju-log Unknown hook stop - skipping.
unit-ceph-mon-0: 16:34:25 INFO juju.worker.uniter.operation ran "stop" hook
unit-ceph-mon-0: 16:34:25 DEBUG juju.worker.uniter.remotestate got unit change
unit-ceph-mon-0: 16:34:25 INFO juju.worker.uniter unit "ceph-mon/0" shutting down: agent should be terminated
unit-ceph-mon-0: 16:33:40 DEBUG juju.worker.uniter.remotestate got relations change: ok=true
unit-ceph-mon-1: 16:33:40 DEBUG juju.worker.uniter.remotestate got relations change: ok=true
unit-ceph-mon-2: 16:33:40 DEBUG juju.worker.uniter.remotestate got relations change: ok=true
unit-ceph-mon-2: 16:33:40 INFO juju.worker.uniter.operation skipped "mds-relation-departed" hook (missing)
unit-ceph-mon-1: 16:33:40 INFO juju.worker.uniter.operation skipped "mds-relation-departed" hook (missing)
unit-ceph-mon-0: 16:33:40 INFO juju.worker.uniter.operation skipped "mds-relation-departed" hook (missing)
unit-ceph-mon-0: 16:33:41 DEBUG juju.worker.uniter.remotestate got application change
unit-ceph-mon-2: 16:33:41 DEBUG juju.worker.uniter.remotestate got application change
unit-ceph-mon-1: 16:33:41 DEBUG juju.worker.uniter.remotestate got application change
unit-ceph-mon-1: 16:33:41 DEBUG juju.worker.uniter.remotestate got relations change: ok=true
unit-ceph-mon-0: 16:33:41 DEBUG juju.worker.uniter.remotestate got relations change: ok=true
unit-ceph-mon-2: 16:33:41 DEBUG juju.worker.uniter.remotestate got relations change: ok=true
unit-ceph-mon-1: 16:33:41 INFO juju.worker.uniter.operation skipped "mds-relation-broken" hook (missing)
unit-ceph-mon-0: 16:33:41 INFO juju.worker.uniter.operation skipped "mds-relation-broken" hook (missing)
unit-ceph-mon-2: 16:33:41 INFO juju.worker.uniter.operation skipped "mds-relation-broken" hook (missing)
unit-ceph-mon-1: 16:33:42 INFO juju.worker.uniter.operation skipped "osd-relation-departed" hook (missing)
unit-ceph-mon-2: 16:33:43 INFO juju.worker.uniter.operation skipped "client-relation-departed" hook (missing)
unit-ceph-mon-0: 16:33:43 INFO juju.worker.uniter.operation skipped "osd-relation-departed" hook (missing)
unit-ceph-mon-1: 16:33:43 INFO juju.worker.uniter.operation skipped "radosgw-relation-departed" hook (missing)
unit-ceph-mon-2: 16:33:43 INFO...

Revision history for this message
Jeff Hillman (jhillman) wrote :

Logs from controller being uploaded. relevant file for the model is admin-openstack-58bd3f.log

Revision history for this message
Jeff Hillman (jhillman) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

Can we please get logs from both controllers? It won't be possible to fully diagnose otherwise.

Revision history for this message
Jeff Hillman (jhillman) wrote :

Logs being uploaded. sorry for the delay.

In these logs, both models failed to destroy on their own. so once they timed out, the --force was added. Both had the suggestion model-config for logging-mode enable at time of destroy.

Revision history for this message
Jeff Hillman (jhillman) wrote :

kubernetes model in this tarball is the relevant model. It has the offers.

Revision history for this message
Jeff Hillman (jhillman) wrote :

k8s-worker model in this tarball is the relevant model. It has the consumers.

Tim Penhey (thumper)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Ian Booth (wallyworld) wrote :

One of the causes here I think is the issue described in bug 1879645.

The issue is that removing a relation on the consuming side may not, depending on timing, send all events needed to the offering side to allow it to remove the relation from its model. If you run juju offers you can see the relation id numbers and use juju remove-relation <id> --force to get rid of them.

The destroy-model operation suffers from the same issue I think - it stops the workers processing the relation removal before all events can be sent to the offering side. This is potentially harder to fix. Here's a fix to at least make remove-relation more robust in getting both consumer and offer models to notice the relation is gone

https://github.com/juju/juju/pull/11627

Revision history for this message
Ian Booth (wallyworld) wrote :

I'll add to the 2.8 series to get the destroy-model issue looked at.
The remove-relation fix will land for 2.7.7
Until the destroy-model is fixed, you can either use --force on the offering model or remove all cmr relations first.

Changed in juju:
milestone: none → 2.8.1
importance: Medium → High
Ian Booth (wallyworld)
summary: - [2.7.5] Stale CMR offers causing model to not destroy properly
+ Stale CMR offers causing model to not destroy properly
Tim Penhey (thumper)
Changed in juju:
milestone: 2.8.1 → 2.8-next
Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Isn't this included on 2.8.1? I've just checked that juju git and 2.8.1 has commits below

https://github.com/juju/juju/pull/11627/commits

Revision history for this message
Ian Booth (wallyworld) wrote :

A partial fix is landed in 2.8.1 but the handling CMR cleanup when destroy model is used is a lot of work and not done yet.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Hey Ian,

the customer reported that --force doesn't work with 2.8.1

Can I have some advices how they can delete them?

Model Controller Cloud/Region Version SLA Timestamp Notes
testwebi XX XXXX 2.8.1 unsupported 11:07:25+02:00 attempt 1 to destroy model failed (will retry): model not empty, found 1 application (model not empty)

SAAS Status Store URL
kubernetes-worker terminated XX admin/testkube.kubernetes-worker

App Version Status Scale Charm Store Rev OS Notes
apache2 unknown 0 apache2 jujucharms 35 ubuntu

Revision history for this message
Ian Booth (wallyworld) wrote :

Cross model relation cleanup when destroying models is not fully guaranteed to always work.
You can try removing the individual apps/saas entities with --force first and then destroy the model.

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

Thanks Ian, but they can't delete it at all

the model they can't delete is created on 2.8.0.
Is version of juju when model created affected in this issue?

I may let them delete them from db...? or waiting 2.8-next would be safe?

Revision history for this message
Ian Booth (wallyworld) wrote :

What happens when they try to use remove-application --force? Are there errors/issues logged?

We will need more info to diagnose, ideally the output of "juju dump-db" (after setting JUJU_DEV_FEATURE_FLAGS=developer-mode) with any secrets redacted plus TRACE logging turned on for "juju.state" before "remove-application --force" is run so we can see what's going on, ie juju model-config logging-config="juju.state=TRACE;<root>=INFO"

The version of Juju when the model was created shouldn't matter. It's safe to leave the orphaned model but it would be good to find out what's wrong, but we'll need the above info to help do that.

Revision history for this message
Jay Kuri (jk0ne) wrote :

We have an environment with the same issue. The application on the other side of a CMR was removed, along with it's relation, but the offering model doesn't seem to have gotten the message.

This attachment contains a juju dump-db of an environment experiencing the problem. It believes it has one more relation than it actually has, and no variation of juju remove-* (with or without --force) will convince it otherwise.

Also, the application itself can't be removed due to the relation it believes is active.

Revision history for this message
Ian Booth (wallyworld) wrote :

Thanks for the db dump. Do you also have the debug-log output for the controller model after turning on trace logging for "juju.state" before "remove-* --force", ie juju model-config logging-config="juju.state=TRACE;<root>=INFO"

Revision history for this message
Ian Booth (wallyworld) wrote :

To confirm, on the above model, you have run juju offers and seen there's a relation with id 123 on the postgresql-odm-kb offer and then you tried:

$ juju remove-relation 123 --force

Right?
Having the debug logs would help a lot in solving the issue.

Revision history for this message
Ian Booth (wallyworld) wrote :

One reason for needing the debug logs is that I don't see any scheduled relation cleanup operations in the database dump. If a relation is already dying (which it will be after any remove-relation is first run), --force is required to schedule a cleanup operation. It may be that the cleanup expired before the dump was done. So running "remove-relation 123 --force" and then give it say 20s to kick things off and then a database dump and logs would be good.

Revision history for this message
Ian Booth (wallyworld) wrote :

Here's a PR which is based on looking at the database dump and making an educated guess as to what needs fixing. The logic in the PR is designed to cope with a relation claiming to still have units in scope but there are none. That normally can't happen but it could be that using --force in various ways partially cleaned up and left inconsistent data.

https://github.com/juju/juju/pull/12470

The fix can't be guaranteed to address the original issue reported when the bug was opened but hopefully closes another corner case.

Changed in juju:
milestone: 2.8-next → 2.8.8
Revision history for this message
Ian Booth (wallyworld) wrote :

I'll mark as Fix Committed since the PR has landed and we can open a new bug if the issue isn't fully fixed.

Changed in juju:
status: Triaged → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.