Unable to remove-offer

Bug #1976311 reported by Simon Déziel
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
juju
Fix Released
High
Ian Booth

Bug Description

Once an offer has be used by a relation, it seems that the offer cannot be removed. Steps to reproduce:

Deploy the attached .yaml in 2 different models (`juju deploy -m test ./ovn-chassis-and-lxd.yaml` and `juju deploy -m ctrl ./ovn-central-and-vault.yaml`) then:

# offer ovn-dedicated-chassis' certificates interface for consumption by vault
juju offer test.ovn-dedicated-chassis:certificates

# relate vault to the remote ovn-dedicated-chassis
juju add-relation -m ctrl vault admin/test.ovn-dedicated-chassis

# why is this not working?
juju remove-relation -m ctrl vault admin/test.ovn-dedicated-chassis
> ERROR "admin/test.ovn-dedicated-chassis" is not a valid application name

# try a different way (that seems to have worked)
juju remove-relation -m ctrl vault:certificates ovn-dedicated-chassis:certificates

# now it fails
juju remove-offer admin/test.ovn-dedicated-chassis
> ERROR cannot delete application offer "ovn-dedicated-chassis": offer has 1 relation

Additional details:

$ juju offers -m test
Offer User Relation id Status Endpoint Interface Role Ingress subnets
ovn-dedicated-chassis admin 1 joined certificates tls-certificates requirer
$ juju remove-relation -m test 1
$ juju remove-offer admin/test.ovn-dedicated-chassis --debug
21:00:35 INFO juju.cmd supercommand.go:56 running juju [2.9.29 54b87ef5071691a2f089e26395908794321009a7 gc go1.17.9]
21:00:35 DEBUG juju.cmd supercommand.go:57 args: []string{"/snap/juju/19053/bin/juju", "remove-offer", "admin/test.ovn-dedicated-chassis", "--debug"}
21:00:35 INFO juju.juju api.go:78 connecting to API addresses: [172.17.40.184:17070]
21:00:35 DEBUG juju.api apiclient.go:1153 successfully dialed "wss://172.17.40.184:17070/api"
21:00:35 INFO juju.api apiclient.go:688 connection established to "wss://172.17.40.184:17070/api"
21:00:35 DEBUG juju.api monitor.go:35 RPC connection died
ERROR cannot delete application offer "ovn-dedicated-chassis": offer has 1 relation
21:00:35 DEBUG cmd supercommand.go:537 error stack:
/build/snapcraft-juju-a0467a25b9d0f69bced9dc04f12ad55f/parts/juju/src/rpc/params/params.go:103: cannot delete application offer "ovn-dedicated-chassis": offer has 1 relation
$ juju version
2.9.29-ubuntu-arm64

Tags: teardown
Revision history for this message
Simon Déziel (sdeziel) wrote :
Revision history for this message
Simon Déziel (sdeziel) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

To answer:

# why is this not working?
juju remove-relation -m ctrl vault admin/test.ovn-dedicated-chassis
> ERROR "admin/test.ovn-dedicated-chassis" is not a valid application name

You don't use the offer URL as an argument to "remove-relation". The relation is between the local application and the "SAAS" application as shown in status. Hence the second way

juju remove-relation -m ctrl vault:certificates ovn-dedicated-chassis:certificates

is the correct syntax. This is because you may well have consumed the offer (one or more) times using a local alias, and so must use the actual SAAS name from the consuming model, not the offer reference.

Revision history for this message
Ian Booth (wallyworld) wrote :

# now it fails
juju remove-offer admin/test.ovn-dedicated-chassis
> ERROR cannot delete application offer "ovn-dedicated-chassis": offer has 1 relation

When did you run this command? Did you give the relation time to be fully torn down? For the relation to disappear, both the offering and consuming sides need to run the relation departed/broken hooks, and if there's an error doing that, the relation will remain.

Did either the offering or consuming sides show any charm hook errors for the relevant units? A show-status-log should show that the hooks have been run.

On the offering and consuming model, status --format yaml would show if the relation in question has life = 1 ("dying"). If so, that indicates that the hooks have failed to run properly.

Did the controller logs have any errors shown?

Revision history for this message
Ian Booth (wallyworld) wrote :

Deploying the attached bundles, the ovn-dedicated-chassis app did not complete deployment because it was missing a relation to 'ovsdb'. Nonetheless, I ran the remove-relation command and watched the status on the offering model and the offer connections changed from 1/1 to 0/0. At that point, I was able to remove the offer.

I also verified the the expected relation departed hooks ran, eg

running certificates-relation-departed hook for remote-5d4c7a1cda27412c85cbb773969d64a0/0
running certificates-relation-departed hook for ovn-dedicated-chassis/0

One thing is that Juju currently is bad at surfacing what it is doing and what progress it is making when things are removed. It's not at all obvious in status that a unit or application or relation removal is running or not, and the error message about not removing the offer because of a relation should indicate that the relation is dying and in the process of being removed etc.

There is work planned for this cycle to address these sorts of issues - better report and surface progress, improve associated error messages etc.

I'll mark this as Incomplete for now - please re-open if the above comments are not relevant, ie if you wait a bit and ensure the relation hooks run, the relation should eventually disappear for you as it did for me.

Changed in juju:
status: New → Incomplete
tags: added: teardown
Revision history for this message
Simon Déziel (sdeziel) wrote :
Download full text (37.5 KiB)

Thanks Ian for taking the time to try reproducing it on your side! FYI, I can consistently reproduce this

# model creation and bundle deployment
$ ...

# let's go with the manual steps
$ juju add-relation -m ctrl vault admin/test.ovn-dedicated-chassis

$ juju status --relations -m ctrl

Model Controller Cloud/Region Version SLA Timestamp
ctrl overlord maas/default 2.9.29 unsupported 17:02:38Z

SAAS Status Store URL
ovn-dedicated-chassis blocked overlord admin/test.ovn-dedicated-chassis

App Version Status Scale Charm Channel Rev Exposed Message
ovn-central 20.03.2 active 3 ovn-central stable 16 no Unit is ready (leader: ovnnb_db, ovnsb_db northd: active)
postgresql 12.11 active 1 postgresql stable 239 no Live master (12.11)
vault 1.5.9 active 1 vault stable 54 no Unit is ready (active: true, mlock: disabled)

Unit Workload Agent Machine Public address Ports Message
ovn-central/0* active idle 1 2602:fc62:b:3002:0:1:0:2 6641/tcp,6642/tcp Unit is ready (leader: ovnnb_db, ovnsb_db northd: active)
ovn-central/1 active idle 2 2602:fc62:b:3003:0:1:0:5 6641/tcp,6642/tcp Unit is ready
ovn-central/2 active idle 3 2602:fc62:b:3002:0:1:: 6641/tcp,6642/tcp Unit is ready
postgresql/0* active idle 0/lxd/0 172.17.32.8 5432/tcp Live master (12.11)
vault/0* active idle 0/lxd/1 172.17.32.7 8200/tcp Unit is ready (active: true, mlock: disabled)

Machine State DNS Inst id Series AZ Message
0 started 2602:fc62:b:3002:0:1:0:1 r02-amd64-04 focal default Deployed
0/lxd/0 started 172.17.32.8 juju-112cdc-0-lxd-0 focal default Container started
0/lxd/1 started 172.17.32.7 juju-112cdc-0-lxd-1 focal default Container started
1 started 2602:fc62:b:3002:0:1:0:2 r02-amd64-05 focal default Deployed
2 started 2602:fc62:b:3003:0:1:0:5 r03-amd64-06 focal default Deployed
3 started 2602:fc62:b:3002:0:1:: r02-amd64-03 focal default Deployed

Relation provider Requirer Interface Type Message
ovn-central:ovsdb-peer ovn-central:ovsdb-peer ovsdb-cluster peer
postgresql:coordinator postgresql:coordinator coordinator peer
postgresql:db vault:db pgsql regular
postgresql:replication postgresql:replication pgpeer peer
vault:certificates ovn-central:certificates tls-certificates regular
vault:certificates ovn-dedicated-chassis:certificates tls-certificates regular
vault:cluster vault:cluster vault-ha peer

$ juju status --relations -m test
Model Controller Cloud/Region Version SLA Timestamp
test overlord maas/default 2.9.29 unsupported 17:02:43Z

App ...

Revision history for this message
Chris Johnston (cjohnston) wrote :
Download full text (7.7 KiB)

I can reproduce the same thing with kubernetes-control-plane and coredns.

At 20:16:30 I ran:
$ juju remove-relation -m kubernetes coredns kubernetes-control-plane

I waited until all units stopped executing. No units went into error.

$ juju show-status-log coredns/0
Time Type Status Message
27 Jun 2022 20:09:49Z workload unknown
27 Jun 2022 20:09:49Z juju-unit executing running coredns-pebble-ready hook
27 Jun 2022 20:10:18Z juju-unit idle
27 Jun 2022 20:10:18Z workload blocked Forbidden to apply RBAC Policies.
27 Jun 2022 20:10:33Z juju-unit executing running config-changed hook
27 Jun 2022 20:10:45Z juju-unit idle
27 Jun 2022 20:10:50Z juju-unit executing running dns-provider-relation-created hook
27 Jun 2022 20:10:51Z juju-unit executing running dns-provider-relation-joined hook for remote-4b130fdcd28e44c28be54a9806de2fe9/0
27 Jun 2022 20:10:52Z juju-unit executing running dns-provider-relation-changed hook for remote-4b130fdcd28e44c28be54a9806de2fe9/0
27 Jun 2022 20:10:53Z juju-unit executing running dns-provider-relation-joined hook for remote-4b130fdcd28e44c28be54a9806de2fe9/1
27 Jun 2022 20:10:54Z juju-unit executing running dns-provider-relation-changed hook for remote-4b130fdcd28e44c28be54a9806de2fe9/1
27 Jun 2022 20:10:55Z juju-unit executing running dns-provider-relation-joined hook for remote-4b130fdcd28e44c28be54a9806de2fe9/2
27 Jun 2022 20:10:56Z juju-unit executing running dns-provider-relation-changed hook for remote-4b130fdcd28e44c28be54a9806de2fe9/2
27 Jun 2022 20:13:01Z juju-unit idle
27 Jun 2022 20:16:31Z juju-unit executing running dns-provider-relation-departed hook for remote-4b130fdcd28e44c28be54a9806de2fe9/0
27 Jun 2022 20:16:32Z juju-unit executing running dns-provider-relation-departed hook for remote-4b130fdcd28e44c28be54a9806de2fe9/1
27 Jun 2022 20:16:33Z juju-unit executing running dns-provider-relation-departed hook for remote-4b130fdcd28e44c28be54a9806de2fe9/2
27 Jun 2022 20:16:34Z juju-unit executing running dns-provider-relation-broken hook
27 Jun 2022 20:20:41Z juju-unit idle
27 Jun 2022 20:24:36Z workload active

$ juju show-status-log -m kubernetes kubernetes-control-plane/0
Time Type Status Message
27 Jun 2022 19:54:34Z workload active Kubernetes control-plane running.
27 Jun 2022 19:54:34Z juju-unit idle
27 Jun 2022 19:55:59Z workload waiting Waiting for 1 kube-system pod to start
27 Jun 2022 20:00:55Z workload active Kubernetes control-plane running.
27 Jun 2022 20:05:29Z workload waiting Waiting for 1 kube-system pod to start
27 Jun 2022 20:05:43Z juju-unit executing running config-changed hook
27 Jun 2022 20:06:30Z workload maintenance Restarting snap.kubelet.daemon service
27 Jun 2022 20:06:34Z juju-unit idle
27 Jun 2022 20:10:50Z juju-unit executing running dns-provider-relation-created hook
27 Jun 2022 20:11:33Z workload active Kubernetes control-plane running.
27 Jun 2022 20:11:47Z juju-unit executing running dns-provider-relation-joined hook for coredns/0
27 Jun 2022 20:12:35Z workload maintenance Restarting sn...

Read more...

Revision history for this message
Ian Booth (wallyworld) wrote :

I tried to replicate the test scenario and got slightly different results.
When I removed the "coredns kubernetes-control-plane" relation, the consuming model ran the relation broken/departed hooks on the k8s control plan units as expected, and the relation was removed.

For my setup, on the offering side, the unit agent does not run the relation removal hooks because the coredns/0 unit goes into error with "crash loop backoff". Looking at the k8s namespace, we see the coredns-0 pod is failing liveness/readiness probes

arning Unhealthy 16m (x2 over 16m) kubelet Liveness probe failed: Get "http://192.168.108.66:38813/v1/health?level=alive": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning Unhealthy 11m (x52 over 16m) kubelet Readiness probe failed: Get "http://192.168.108.66:38813/v1/health?level=ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning BackOff 6m47s (x8 over 7m29s) kubelet Back-off restarting failed container
  Warning Unhealthy 109s (x15 over 16m) kubelet Startup probe failed: Get "http://192.168.108.66:3856/startup": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This is different to what you saw where the relation departed/broken hooks did appear to run.

Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

I set up a smaller scale test using a 2.9.32 LXD controller and adding a microk8s model. The k8s model hosted charmed-osm-mariadb-k8s and offered it; a machine model was running mediawiki with a cross model relation to the mariadb-k8s offer. Removing the relation from the consuming side correctly removed the offer connection and the offer could be removed.

A second attempt to reproduce with CDK and coredns was more successful - I might have enough dig a little deeper into it.

Revision history for this message
Ian Booth (wallyworld) wrote :

This PR should hopefully fix the issue
https://github.com/juju/juju/pull/14252

Changed in juju:
milestone: none → 2.9.33
assignee: nobody → Ian Booth (wallyworld)
importance: Undecided → High
status: Incomplete → In Progress
Revision history for this message
Ian Booth (wallyworld) wrote :

The PR has landed, hopefully it makes a difference in the field.

Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Simon Déziel (sdeziel) wrote :

Thanks Ian, that's much appreciated!

Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers