Bug #1940983 “ceph mon does not render data to ceph-rados after ...” : Bugs : Ceph Monitor Charm

Narinder Gupta (narindergupta) on 2021-08-24

description:

updated

Revision history for this message

Narinder Gupta (narindergupta) wrote on 2021-08-24:

#1

-I am working on rados multi site configuration for mirroring the object storage and trying to convert the existing environment. here ceph-rados and ceph-mon are in different model and exposed data through offer and on ceph-radosgw consumes the relation. First-time deployment does work fine and when we redeploy the ceph-radosgw then see ceph-mon is not connecting to ceph-radosgw. I see remote id of the offer remains the same no matter you delete, add, or create multiple offers of the same API. In this case, cep-mon is writing the data to relation. But reading it from ceph-radosgw it gives error permission denied. I believe somehow data is written into an old relation structure with an old id. That is the reason the new radios unit can not read. Is there a way to change the remote-id of the offer so that I can test my theory.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2021-08-25:

#2

In a live debug session we observed that the ceph-mon is providing information back across the relation to the ceph-radosgw units, however the ceph-radosgw units never see the data and no -relation-changed hooks are firing on the ceph-radosgw units.

Revision history for this message

Aurelien Lourot (aurelien-lourot) wrote on 2021-08-25:

#3

Adding Juju as an affected project as the issue seems to be Juju not firing the hooks that are supposed to be fired.

Ponnuvel Palaniyappan (pponnuvel) on 2021-08-25

tags:

added: sts

Revision history for this message

Dominique Poulain (dominique-poulain) wrote on 2021-08-25:

#5

For documentary purposes: Vinod (vinodwa) earlier subscribed field-critical as the issue is blocking a multi-site RadosGW deployment for a customer.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2021-08-26:

#6

currently, this does not appear to be an issue with the charm itself. Marking as invalid while the investigation continues looking at the juju side of the house.

Changed in charm-ceph-mon:
status:	New → Invalid

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-08-26:

#7

From the logs it appears that we have not removed the offer connection associated with the relation that was departed.

This line from the log:

2021-08-25 14:44:03 ERROR juju.worker.modelcache worker.go:366 watcher error: error loading entities for model b5c17291-b581-4144-8148-a81c1e9cd4dd: failed to initialise backing for remoteApplications:remote-2af79c70af4f497b8f3626e47c0134e2: relation "remote-2af79c70af4f497b8f3626e47c0134e2:ceph use01-ceph-mon:client" not found, getting new watcher

Is telling us that:
- this offer
- with this offer connection
- cannot find the connection's relation

John A Meinel (jameinel) on 2021-08-26

Changed in juju:
importance:	Undecided → High
milestone:	none → 2.9.13
status:	New → In Progress
importance:	High → Critical

Revision history for this message

Simon Richardson (simonrichardson) wrote on 2021-08-26:

#8

Juju PR[1] to ensure that we don't return an error from the AllWatcher if we don't find a KeyRelation based on a RelationKey.

As explained in the PR, other places in the Juju codebase handles the not found correctly. The undlerlying bug isn't solved here though... why aren't we getting a valid relation key from the connection (or is it empty?).

1. https://github.com/juju/juju/pull/13283

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-08-27 (last edit on 2021-08-27):

#9

There's several issues at play here. Firstly, if offers and/or offered applications and/or consumed offers (SAAS entities) are (force) removed, it appears either side of the relation can get a little out of sync, eg the removal of relation settings on one side happens while an asynchronous read of those settings is occurring on the other side. This can lead to the bouncing and restarting of the process involved in keeping things in sync, which prevents the proper cleanup of a removed application / unit. There's 2 PRs so far to address the current known issues:

https://github.com/juju/juju/pull/13283
https://github.com/juju/juju/pull/13284

There's another PR under development to fix a third issue where the unit agent gets wedged if a relation gets force removed and the agent doesn't get notified of that and it has a pending hook for that relation. This will end up blocking all subsequent hook invocations.

Another issue is that when a relation is added, the relation joined hook is not firing. In the latest status output we have, this occurs in several places, both in cross model relations and also subordinate relations in the same model. eg

same model:
cinder:juju-info rsyslog-forwarder-ha:juju-info juju-info subordinate joining

cross model:
use01-ceph-mon-client:client cinder-ceph-stable2:ceph ceph-client regular joining

Looking at the data model, in both cases, the expected records appear to be there in the relevant collections:
- relationscopes
- settings
- unitstates

Looking at the cross model case, logging seems to indicate that the relation was joined

unit-cinder-ceph-stable2-2 2021-08-26 02:18:12 INFO juju.worker.uniter.relation statetracker.go:195 joined relation "cinder-ceph-stable2:ceph use01-ceph-mon-client:client"

The update to juju status to set the relation state to "joined" from "joining" occurs only if the unit joining is the leader. juju status appears to show that there are no cinder unit leaders yet there is a leadership lease for cinder

{"_id":"42c0192f-6be6-4478-84c5-ad33f6a42394:application-leadership#cinder#","namespace":"application-leadership","model-uuid":"42c0192f-6be6-4478-84c5-ad33f6a42394","lease":"cinder","holder":"cinder/0","txn-revno":{"$numberLong":"6"},"txn-queue":[]}

The agents have been taken offline so that explains why status shows a lack of current leadership. It's hard to tell what was in play when the relation was established. But the status history does appear to show the relation-joined hooks were run for the relation in the same model (even though status still shows "joining")

eg

- _id: 611d2dda1a26d7cdb2d7d028
  globalkey: u#rsyslog-forwarder-ha/11
  model-uuid: 42c0192f-6be6-4478-84c5-ad33f6a42394
  status: executing
  statusdata: {}
  statusinfo: running juju-info-relation-joined hook for cinder/1

But I can't see a log entry for the cross model, even though the data looks correct.

There's not enough logging in the right places in the unit agent to definitively say right now what the root cause is for these discrepancies. We may need to add more logging as well as the above PRs.

There's several issues at play here. Firstly, if offers and/or offered applications and/or consumed offers (SAAS entities) are (force) removed, it appears either side of the relation can get a little out of sync, eg the removal of relation settings on one side happens while an asynchronous read of those settings is occurring on the other side. This can lead to the bouncing and restarting of the process involved in keeping things in sync, which prevents the proper cleanup of a removed application / unit. There's 2 PRs so far to address the current known issues:

https://github.com/juju/juju/pull/13283
https://github.com/juju/juju/pull/13284

There's another PR under development to fix a third issue where the unit agent gets wedged if a relation gets force removed and the agent doesn't get notified of that and it has a pending hook for that relation. This will end up blocking all subsequent hook invocations.

Another issue is that when a relation is added, the relation joined hook is not firing. In the latest status output we have, this occurs in several places, both in cross model relations and also subordinate relations in the same model. eg

same model:
cinder:juju-info rsyslog-forwarder-ha:juju-info juju-info subordinate joining

cross model:  
use01-ceph-mon-client:client cinder-ceph-stable2:ceph ceph-client regular joining

Looking at the data model, in both cases, the expected records appear to be there in the relevant collections:
- relationscopes
- settings
- unitstates

Looking at the cross model case, logging seems to indicate that the relation was joined

unit-cinder-ceph-stable2-2 2021-08-26 02:18:12 INFO juju.worker.uniter.relation statetracker.go:195 joined relation "cinder-ceph-stable2:ceph use01-ceph-mon-client:client"

The update to juju status to set the relation state to "joined" from "joining" occurs only if the unit joining is the leader. juju status appears to show that there are no cinder unit leaders yet there is a leadership lease for cinder

{"_id":"42c0192f-6be6-4478-84c5-ad33f6a42394:application-leadership#cinder#","namespace":"application-leadership","model-uuid":"42c0192f-6be6-4478-84c5-ad33f6a42394","lease":"cinder","holder":"cinder/0","txn-revno":{"$numberLong":"6"},"txn-queue":[]}

The agents have been taken offline so that explains why status shows a lack of current leadership. It's hard to tell what was in play when the relation was established. But the status history does appear to show the relation-joined hooks were run for the relation in the same model (even though status still shows "joining")

eg

- _id: 611d2dda1a26d7cdb2d7d028
  globalkey: u#rsyslog-forwarder-ha/11
  model-uuid: 42c0192f-6be6-4478-84c5-ad33f6a42394
  status: executing
  statusdata: {}
  statusinfo: running juju-info-relation-joined hook for cinder/1

But I can't see a log entry for the cross model, even though the data looks correct.

There's not enough logging in the right places in the unit agent to definitively say right now what the root cause is for these discrepancies. We may need to add more logging as well as the above PRs.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-08-27:

#10

The logs we have (logsink from machine 2) have entries indicating that some units in a cross model relation are entering the scope of the relation, which is a prerequisite for running the relation-joined hook.

eg

2021-08-26 02:18:59 DEBUG juju.apiserver.common.crossmodel crossmodel.go:124 remote application for changed relation remote-5816942fe3a54d858170eb1f7fd25eb9:ceph use01-ceph-mon:client is remote-5816942fe3a54d858170eb1f7fd25eb9 in model b5c17291-b581-4144-8148-a81c1e9cd4dd
2021-08-26 02:18:59 DEBUG juju.apiserver.common.crossmodel crossmodel.go:166 remote-5816942fe3a54d858170eb1f7fd25eb9/18 entering scope (map[egress-subnets:172.25.80.132/32 ingress-address:172.25.80.132 private-address:172.25.80.132])

Note

"..... remote application for changed relation xxxx ..."
and
" xxxx/0 entering scope ...."

The thing is that these are for the offering model. I don't see any corresponding log entries to say that the (say ceph related) units on the consuming side have received change events. It could be that one or more the issues with "not found" errors is preventing the event processing from completing, though am not 100% sure yet.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-08-27:

#11

I did manage to construct a scenario illustrating a sync issue on the consuming model. It's not directly the same issue as observed on site, but it might share some of the underlying cause.

To reproduce:

deploy a cross model scenario, eg mediawiki -> mariadb (offer)
in the consuming (mediawiki) model, juju remove-saas mariadb
then add the relation back again, juju relate mediawiki:db othermodel.mariadb

The relation is re-created, but the status on the mariadb saas entry in the consuming model is "unknown", so there's an issue pulling those details from the offering model.
We'll investigate...

Revision history for this message

Joseph Phillips (manadart) wrote on 2021-08-27:

#12

This looks very similar, and I reproduced it first time. Though on the offer side, it shows "1" rather than "1/1" for scale.

Steps:
https://pastebin.canonical.com/p/7brpNMkVjh/

Offer-side status:
https://pastebin.canonical.com/p/QnKpWTvmjk/

Consume-side status:
https://pastebin.canonical.com/p/XN2vNPQFgM/

Pertinent logs:
https://pastebin.canonical.com/p/F7dcxRjTyg/

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-08-30:

#13

With regard to the issue when re-adding a consumed application shows the wrong status (it shows "unknown"), it turns out that was a cosmetic issue rather than a root cause for the problems seen on site.

There's a PR to fix
https://github.com/juju/juju/pull/13290

The fix also includes an enhancement to the show-status-log command to allow the status history of SAAS and local applications to be viewed.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-08-30:

#14

Download full text (6.2 KiB)

Just started looking at the latest logs

consumer model running ceph-radosgw

2021-08-27 17:14:55 INFO juju.worker.uniter.relation statetracker.go:159 joining relation "ceph-radosgw:mon use01-ceph-mon-radosgw:radosgw"
2021-08-27 17:14:56 INFO juju.worker.uniter.relation statetracker.go:195 joined relation "ceph-radosgw:mon use01-ceph-mon-radosgw:radosgw"

^^^
Appears to show the relation joining and then joined. This should be reflected in juju status and the status-log for the units.

offer model running use-ceph-mon offered as use01-ceph-mon-radosgw

2021-08-27 17:13:23 INFO juju unit_agent.go:253 Starting unit workers for "use01-ceph-mon/0"
2021-08-27 17:13:23 INFO juju.agent.setup agentconf.go:128 setting logging config to "<ROOT>=INFO;unit=DEBUG"
2021-08-27 17:26:08 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading package lists...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Building dependency tree...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading state information...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:26:09 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:26:09.842+0000 7f73d4959700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:26:09 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:26:09.842+0000 7f73d4959700 -1 AuthRegistry(0x7f73d0058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:26:10 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:26:10 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 updated caps for client.radosgw.gateway
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading package lists...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Building dependency tree...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading state information...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:26:14.094+0000 7f6971dba700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:26:14.094+0000 7f6971dba700 -1 AuthRegistry(0x7f696c058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 updated caps for client.radosgw.gateway
2021-08-27 17:27:20 DEBU...

Just started looking at the latest logs

consumer model running ceph-radosgw

2021-08-27 17:14:55 INFO juju.worker.uniter.relation statetracker.go:159 joining relation "ceph-radosgw:mon use01-ceph-mon-radosgw:radosgw"
2021-08-27 17:14:56 INFO juju.worker.uniter.relation statetracker.go:195 joined relation "ceph-radosgw:mon use01-ceph-mon-radosgw:radosgw"

^^^
Appears to show the relation joining and then joined. This should be reflected in juju status and the status-log for the units.

offer model running use-ceph-mon offered as use01-ceph-mon-radosgw

2021-08-27 17:13:23 INFO juju unit_agent.go:253 Starting unit workers for "use01-ceph-mon/0"
2021-08-27 17:13:23 INFO juju.agent.setup agentconf.go:128 setting logging config to "<ROOT>=INFO;unit=DEBUG"
2021-08-27 17:26:08 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading package lists...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Building dependency tree...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading state information...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:26:09 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:26:09.842+0000 7f73d4959700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:26:09 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:26:09.842+0000 7f73d4959700 -1 AuthRegistry(0x7f73d0058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:26:10 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:26:10 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 updated caps for client.radosgw.gateway
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading package lists...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Building dependency tree...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading state information...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:26:14.094+0000 7f6971dba700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:26:14.094+0000 7f6971dba700 -1 AuthRegistry(0x7f696c058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 updated caps for client.radosgw.gateway
2021-08-27 17:27:20 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading package lists...
2021-08-27 17:27:21 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Building dependency tree...
2021-08-27 17:27:21 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading state information...
2021-08-27 17:27:21 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:27:21 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:27:21.914+0000 7f90c852c700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:27:21 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:27:21.914+0000 7f90c852c700 -1 AuthRegistry(0x7f90c0058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:27:22 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:27:22 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 updated caps for client.radosgw.gateway
2021-08-27 17:27:25 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading package lists...
2021-08-27 17:27:25 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Building dependency tree...
2021-08-27 17:27:25 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading state information...
2021-08-27 17:27:25 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:27:26 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:27:26.198+0000 7f8576a17700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:27:26 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:27:26.198+0000 7f8576a17700 -1 AuthRegistry(0x7f8570058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:27:26 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:27:26 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 updated caps for client.radosgw.gateway

^^^^ appears to show relation-joined hook being retried due to an error (juju retries 3 times) amd then marks the hook as in error state

I would expect to see the above reflected in juju status and the status logs for the units. The database dumps are quite large so are interesting to grep through - juju status and status-log would make confirming this easier.

It's interesting that the logs above appear to show the relation-joined hook running on the consuming side when the initial report was that this wasn't happening. Is it possible to fix the error (/var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory) on the offering side and juju resolve the error (assuming status does reflect this assumed scenario)?

Revision history for this message

John A Meinel (jameinel) wrote on 2021-08-31:

#15

Likely also related to bug #1936460

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-07:

#16

Juju 2.9.13 will add extra logging to help diagnose the issue.

state.relationunits=TRACE
juju.worker.uniter.relation=TRACE
juju.api.watcher.remoterelations=TRACE
juju.worker.remoterelations=TRACE
juju.apiserver.common.crossmodel=TRACE

possibly also (if needed)
juju.worker.uniter.resolver=TRACE
juju.worker.uniter.operation=TRACE

There's also a number of small fixes in Juju 2.9.13 which will help with issues we saw in the logs; we cannot say for sure if the root cause here has been found and fixed as we are still unable to reproduce locally.

We need to start from a clean slate - the consuming model has no SAAS entities; the offering model has no offer connections. One step at a time, deploy the radosgw charm in the consuming mode, consume the offer, relate to the offer. Between each step, in both the offer and consumer model run juju status --format yaml and juju dump-db. Gather the logs for each model for just the time window where the above was done.

The above will greatly help us get further along in identifying the issue.

Canonical Juju QA Bot (juju-qa-bot) on 2021-09-08

Changed in juju:
milestone:	2.9.13 → 2.9.14

Canonical Juju QA Bot (juju-qa-bot) on 2021-09-10

Changed in juju:
milestone:	2.9.14 → 2.9.15

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-16:

#17

Some more progress today.

It looks like one of the problems that has come up in trying to reproduce the issue has been reported here:
https://bugs.launchpad.net/juju/+bug/1873472

Trying to remove one offer if there's a second offer against the same application causes a "state changing too quickly" error. This then means that --force is needed to remove the offer. Ideally --force should not be used unless things are wedged because we want to give units a chance to shutdown cleanly.

This PR fixes the above problem:
https://github.com/juju/juju/pull/13337

There'a also an issue where making an offer twice but with different endpoints overwrites the first offer. It's more cosmetic than anything, because if there are relations it won't do the update. It still should be fixed, bug here:
https://bugs.launchpad.net/juju/+bug/1943785

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-17:

#18

Even better progress today

TL;DR; I found and fixed a major root cause of why things stop working when an offer is removed and re-added. My testing shows that all of the key scenarios I have tried are now working. ie consuming or relating again on the consuming side wakes everything up as expected and the relation is joined again. Previously the consuming side would be stuck.

There might still be issues to solve after this but it's a good fix.

I was going to have a PR up by now but I am not happy with an aspect of the underlying watcher to monitor saas changes and when I tweaked it, it broke things so I need to fix that up.

The main issue is when an offer is removed, the consuming model marks the saas proxy as "terminated". And when a new replacement offer with the same name as before is consumed, 1) there's no trigger to re-fire the watcher, 2) when when the watcher is triggered the worker to handle the saas application is still running from before but is set up for the old offer uuid so things get confused.

The changes to fix are minimal, but I want to get the watcher right before I put up a PR which will be done for the 2.9.15 release.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-21:

#19

This PR:

https://github.com/juju/juju/pull/13345

fixes a few core issues as described in the PR description. These fixes are for the consuming model.

There's still a couple of fixes needed. On the offering model, force removing an offer which has connections does not fully remove all of the connection artefacts, which results in subsequent connections from the consuming model not to trigger the required relation processing on the offering side.

As well, there's a hook error (permission denied) observed on the consuming side when an offer is force removed. I hope that only one or 2 more PRs will be required.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-22:

#20

Hopefully this latest PR

https://github.com/juju/juju/pull/13349

will fix the remaining observed issues. It adds some fallback forced removal of relations/units if the models get out of sync, and also ensures all artefacts are removed from the offering model when the offer is deleted.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-22:

#21

I've thought of a scenario where disconnection between models at the time of removing a consumed offer might result in things staying stuck if the offer is consumed again. It is unlikely to happen in this case since it's a single controller deployment, but needs investigation nonetheless.

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-23:

#22

Another PR to address some issues on the consuming side. mainly dealing with things being removed as other processing is happening.

https://github.com/juju/juju/pull/13353

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-23:

#23

And one more to handle consume side add/remove more robustly

https://github.com/juju/juju/pull/13354

Revision history for this message

Ian Booth (wallyworld) wrote on 2021-09-24:

#24

Local testing seems to indicate the process of redeploy is now more robust.
Marking as fix committed but we can revisit if there are still issues.

Changed in juju:
assignee:	nobody → Ian Booth (wallyworld)
status:	In Progress → Fix Committed

Canonical Juju QA Bot (juju-qa-bot) on 2021-09-27

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-29:

#25

So we have an example where adding units ended up confusing the state of everything.
https://bugs.launchpad.net/juju/+bug/1945514

Ceph Monitor Charm

ceph mon does not render data to ceph-rados after redployment of ceph-radosgw only

Bug Description

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Fix Released	Critical	Ian Booth	Canonical Juju 2.9.15
	Ceph Monitor Charm	Invalid	Undecided	Unassigned