ceph mon does not render data to ceph-rados after redployment of ceph-radosgw only

Bug #1940983 reported by Narinder Gupta
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Ian Booth
Ceph Monitor Charm
Invalid
Undecided
Unassigned

Bug Description

Deploy OpenStack with ceph-radsosgw with cep-mon with CMR and everything works well. Then redeploy the ceph-radosgw after removing all units of ceph-radosgw and not ceph-mon does not render data to ceph-radosgw.

and ceph-radosgw is in a waiting state mentioning incomplete relation mon.

App Version Status Scale Charm Store Channel Rev OS Message
ceph-access-policy-routing active 3 advanced-routing charmstore stable 4 ubuntu Unit is ready
ceph-radosgw 15.2.13 waiting 3 ceph-radosgw charmstore stable 297 ubuntu Incomplete relations: mon
hacluster-radosgw active 3 hacluster charmstore stable 76 ubuntu Unit is ready and clustered

Unit Workload Agent Machine Public address Ports Message
ceph-radosgw/3* waiting idle 37 172.25.0.65 443/tcp Incomplete relations: mon
  ceph-access-policy-routing/8 active idle 172.25.0.65 Unit is ready
  hacluster-radosgw/3* active idle 172.25.0.65 Unit is ready and clustered
ceph-radosgw/4 waiting idle 38 172.25.0.68 443/tcp Incomplete relations: mon
  ceph-access-policy-routing/6* active idle 172.25.0.68 Unit is ready
  hacluster-radosgw/4 active idle 172.25.0.68 Unit is ready and clustered
ceph-radosgw/5 waiting idle 39 172.25.0.69 443/tcp Incomplete relations: mon
  ceph-access-policy-routing/7 active idle 172.25.0.69 Unit is ready
  hacluster-radosgw/5 active idle 172.25.0.69 Unit is ready and clustered

Tags: sts
description: updated
Revision history for this message
Narinder Gupta (narindergupta) wrote :

-I am working on rados multi site configuration for mirroring the object storage and trying to convert the existing environment. here ceph-rados and ceph-mon are in different model and exposed data through offer and on ceph-radosgw consumes the relation. First-time deployment does work fine and when we redeploy the ceph-radosgw then see ceph-mon is not connecting to ceph-radosgw. I see remote id of the offer remains the same no matter you delete, add, or create multiple offers of the same API. In this case, cep-mon is writing the data to relation. But reading it from ceph-radosgw it gives error permission denied. I believe somehow data is written into an old relation structure with an old id. That is the reason the new radios unit can not read. Is there a way to change the remote-id of the offer so that I can test my theory.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

In a live debug session we observed that the ceph-mon is providing information back across the relation to the ceph-radosgw units, however the ceph-radosgw units never see the data and no -relation-changed hooks are firing on the ceph-radosgw units.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Adding Juju as an affected project as the issue seems to be Juju not firing the hooks that are supposed to be fired.

tags: added: sts
Revision history for this message
Dominique Poulain (dominique-poulain) wrote :

For documentary purposes: Vinod (vinodwa) earlier subscribed field-critical as the issue is blocking a multi-site RadosGW deployment for a customer.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

currently, this does not appear to be an issue with the charm itself. Marking as invalid while the investigation continues looking at the juju side of the house.

Changed in charm-ceph-mon:
status: New → Invalid
Revision history for this message
Joseph Phillips (manadart) wrote :

From the logs it appears that we have not removed the offer connection associated with the relation that was departed.

This line from the log:

2021-08-25 14:44:03 ERROR juju.worker.modelcache worker.go:366 watcher error: error loading entities for model b5c17291-b581-4144-8148-a81c1e9cd4dd: failed to initialise backing for remoteApplications:remote-2af79c70af4f497b8f3626e47c0134e2: relation "remote-2af79c70af4f497b8f3626e47c0134e2:ceph use01-ceph-mon:client" not found, getting new watcher

Is telling us that:
- this offer
- with this offer connection
- cannot find the connection's relation

John A Meinel (jameinel)
Changed in juju:
importance: Undecided → High
milestone: none → 2.9.13
status: New → In Progress
importance: High → Critical
Revision history for this message
Simon Richardson (simonrichardson) wrote :

Juju PR[1] to ensure that we don't return an error from the AllWatcher if we don't find a KeyRelation based on a RelationKey.

As explained in the PR, other places in the Juju codebase handles the not found correctly. The undlerlying bug isn't solved here though... why aren't we getting a valid relation key from the connection (or is it empty?).

1. https://github.com/juju/juju/pull/13283

Revision history for this message
Ian Booth (wallyworld) wrote (last edit ):

There's several issues at play here. Firstly, if offers and/or offered applications and/or consumed offers (SAAS entities) are (force) removed, it appears either side of the relation can get a little out of sync, eg the removal of relation settings on one side happens while an asynchronous read of those settings is occurring on the other side. This can lead to the bouncing and restarting of the process involved in keeping things in sync, which prevents the proper cleanup of a removed application / unit. There's 2 PRs so far to address the current known issues:

https://github.com/juju/juju/pull/13283
https://github.com/juju/juju/pull/13284

There's another PR under development to fix a third issue where the unit agent gets wedged if a relation gets force removed and the agent doesn't get notified of that and it has a pending hook for that relation. This will end up blocking all subsequent hook invocations.

Another issue is that when a relation is added, the relation joined hook is not firing. In the latest status output we have, this occurs in several places, both in cross model relations and also subordinate relations in the same model. eg

same model:
cinder:juju-info rsyslog-forwarder-ha:juju-info juju-info subordinate joining

cross model:
use01-ceph-mon-client:client cinder-ceph-stable2:ceph ceph-client regular joining

Looking at the data model, in both cases, the expected records appear to be there in the relevant collections:
- relationscopes
- settings
- unitstates

Looking at the cross model case, logging seems to indicate that the relation was joined

unit-cinder-ceph-stable2-2 2021-08-26 02:18:12 INFO juju.worker.uniter.relation statetracker.go:195 joined relation "cinder-ceph-stable2:ceph use01-ceph-mon-client:client"

The update to juju status to set the relation state to "joined" from "joining" occurs only if the unit joining is the leader. juju status appears to show that there are no cinder unit leaders yet there is a leadership lease for cinder

{"_id":"42c0192f-6be6-4478-84c5-ad33f6a42394:application-leadership#cinder#","namespace":"application-leadership","model-uuid":"42c0192f-6be6-4478-84c5-ad33f6a42394","lease":"cinder","holder":"cinder/0","txn-revno":{"$numberLong":"6"},"txn-queue":[]}

The agents have been taken offline so that explains why status shows a lack of current leadership. It's hard to tell what was in play when the relation was established. But the status history does appear to show the relation-joined hooks were run for the relation in the same model (even though status still shows "joining")

eg

- _id: 611d2dda1a26d7cdb2d7d028
  globalkey: u#rsyslog-forwarder-ha/11
  model-uuid: 42c0192f-6be6-4478-84c5-ad33f6a42394
  status: executing
  statusdata: {}
  statusinfo: running juju-info-relation-joined hook for cinder/1

But I can't see a log entry for the cross model, even though the data looks correct.

There's not enough logging in the right places in the unit agent to definitively say right now what the root cause is for these discrepancies. We may need to add more logging as well as the above PRs.

Revision history for this message
Ian Booth (wallyworld) wrote :

The logs we have (logsink from machine 2) have entries indicating that some units in a cross model relation are entering the scope of the relation, which is a prerequisite for running the relation-joined hook.

eg

2021-08-26 02:18:59 DEBUG juju.apiserver.common.crossmodel crossmodel.go:124 remote application for changed relation remote-5816942fe3a54d858170eb1f7fd25eb9:ceph use01-ceph-mon:client is remote-5816942fe3a54d858170eb1f7fd25eb9 in model b5c17291-b581-4144-8148-a81c1e9cd4dd
2021-08-26 02:18:59 DEBUG juju.apiserver.common.crossmodel crossmodel.go:166 remote-5816942fe3a54d858170eb1f7fd25eb9/18 entering scope (map[egress-subnets:172.25.80.132/32 ingress-address:172.25.80.132 private-address:172.25.80.132])

Note

"..... remote application for changed relation xxxx ..."
and
" xxxx/0 entering scope ...."

The thing is that these are for the offering model. I don't see any corresponding log entries to say that the (say ceph related) units on the consuming side have received change events. It could be that one or more the issues with "not found" errors is preventing the event processing from completing, though am not 100% sure yet.

Revision history for this message
Ian Booth (wallyworld) wrote :

I did manage to construct a scenario illustrating a sync issue on the consuming model. It's not directly the same issue as observed on site, but it might share some of the underlying cause.

To reproduce:

deploy a cross model scenario, eg mediawiki -> mariadb (offer)
in the consuming (mediawiki) model, juju remove-saas mariadb
then add the relation back again, juju relate mediawiki:db othermodel.mariadb

The relation is re-created, but the status on the mariadb saas entry in the consuming model is "unknown", so there's an issue pulling those details from the offering model.
We'll investigate...

Revision history for this message
Joseph Phillips (manadart) wrote :

This looks very similar, and I reproduced it first time. Though on the offer side, it shows "1" rather than "1/1" for scale.

Steps:
https://pastebin.canonical.com/p/7brpNMkVjh/

Offer-side status:
https://pastebin.canonical.com/p/QnKpWTvmjk/

Consume-side status:
https://pastebin.canonical.com/p/XN2vNPQFgM/

Pertinent logs:
https://pastebin.canonical.com/p/F7dcxRjTyg/

Revision history for this message
Ian Booth (wallyworld) wrote :

With regard to the issue when re-adding a consumed application shows the wrong status (it shows "unknown"), it turns out that was a cosmetic issue rather than a root cause for the problems seen on site.

There's a PR to fix
https://github.com/juju/juju/pull/13290

The fix also includes an enhancement to the show-status-log command to allow the status history of SAAS and local applications to be viewed.

Revision history for this message
Ian Booth (wallyworld) wrote :
Download full text (6.2 KiB)

Just started looking at the latest logs

consumer model running ceph-radosgw

2021-08-27 17:14:55 INFO juju.worker.uniter.relation statetracker.go:159 joining relation "ceph-radosgw:mon use01-ceph-mon-radosgw:radosgw"
2021-08-27 17:14:56 INFO juju.worker.uniter.relation statetracker.go:195 joined relation "ceph-radosgw:mon use01-ceph-mon-radosgw:radosgw"

^^^
Appears to show the relation joining and then joined. This should be reflected in juju status and the status-log for the units.

offer model running use-ceph-mon offered as use01-ceph-mon-radosgw

2021-08-27 17:13:23 INFO juju unit_agent.go:253 Starting unit workers for "use01-ceph-mon/0"
2021-08-27 17:13:23 INFO juju.agent.setup agentconf.go:128 setting logging config to "<ROOT>=INFO;unit=DEBUG"
2021-08-27 17:26:08 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading package lists...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Building dependency tree...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 Reading state information...
2021-08-27 17:26:09 DEBUG unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:26:09 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:26:09.842+0000 7f73d4959700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:26:09 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 2021-08-27T17:26:09.842+0000 7f73d4959700 -1 AuthRegistry(0x7f73d0058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:26:10 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:26:10 WARNING unit.use01-ceph-mon/0.radosgw-relation-joined logger.go:60 updated caps for client.radosgw.gateway
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading package lists...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Building dependency tree...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 Reading state information...
2021-08-27 17:26:13 DEBUG unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 0 upgraded, 0 newly installed, 0 to remove and 81 not upgraded.
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:26:14.094+0000 7f6971dba700 -1 auth: unable to find a keyring on /var/lib/ceph/mon/ceph-/keyring: (2) No such file or directory
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 2021-08-27T17:26:14.094+0000 7f6971dba700 -1 AuthRegistry(0x7f696c058fe0) no keyring found at /var/lib/ceph/mon/ceph-/keyring, disabling cephx
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 exported keyring for client.radosgw.gateway
2021-08-27 17:26:14 WARNING unit.use01-ceph-mon/0.radosgw-relation-changed logger.go:60 updated caps for client.radosgw.gateway
2021-08-27 17:27:20 DEBU...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :

Likely also related to bug #1936460

Revision history for this message
Ian Booth (wallyworld) wrote :

Juju 2.9.13 will add extra logging to help diagnose the issue.

state.relationunits=TRACE
juju.worker.uniter.relation=TRACE
juju.api.watcher.remoterelations=TRACE
juju.worker.remoterelations=TRACE
juju.apiserver.common.crossmodel=TRACE

possibly also (if needed)
juju.worker.uniter.resolver=TRACE
juju.worker.uniter.operation=TRACE

There's also a number of small fixes in Juju 2.9.13 which will help with issues we saw in the logs; we cannot say for sure if the root cause here has been found and fixed as we are still unable to reproduce locally.

We need to start from a clean slate - the consuming model has no SAAS entities; the offering model has no offer connections. One step at a time, deploy the radosgw charm in the consuming mode, consume the offer, relate to the offer. Between each step, in both the offer and consumer model run juju status --format yaml and juju dump-db. Gather the logs for each model for just the time window where the above was done.

The above will greatly help us get further along in identifying the issue.

Changed in juju:
milestone: 2.9.13 → 2.9.14
Changed in juju:
milestone: 2.9.14 → 2.9.15
Revision history for this message
Ian Booth (wallyworld) wrote :

Some more progress today.

It looks like one of the problems that has come up in trying to reproduce the issue has been reported here:
https://bugs.launchpad.net/juju/+bug/1873472

Trying to remove one offer if there's a second offer against the same application causes a "state changing too quickly" error. This then means that --force is needed to remove the offer. Ideally --force should not be used unless things are wedged because we want to give units a chance to shutdown cleanly.

This PR fixes the above problem:
https://github.com/juju/juju/pull/13337

There'a also an issue where making an offer twice but with different endpoints overwrites the first offer. It's more cosmetic than anything, because if there are relations it won't do the update. It still should be fixed, bug here:
https://bugs.launchpad.net/juju/+bug/1943785

Revision history for this message
Ian Booth (wallyworld) wrote :

Even better progress today

TL;DR; I found and fixed a major root cause of why things stop working when an offer is removed and re-added. My testing shows that all of the key scenarios I have tried are now working. ie consuming or relating again on the consuming side wakes everything up as expected and the relation is joined again. Previously the consuming side would be stuck.

There might still be issues to solve after this but it's a good fix.

I was going to have a PR up by now but I am not happy with an aspect of the underlying watcher to monitor saas changes and when I tweaked it, it broke things so I need to fix that up.

The main issue is when an offer is removed, the consuming model marks the saas proxy as "terminated". And when a new replacement offer with the same name as before is consumed, 1) there's no trigger to re-fire the watcher, 2) when when the watcher is triggered the worker to handle the saas application is still running from before but is set up for the old offer uuid so things get confused.

The changes to fix are minimal, but I want to get the watcher right before I put up a PR which will be done for the 2.9.15 release.

Revision history for this message
Ian Booth (wallyworld) wrote :

This PR:

https://github.com/juju/juju/pull/13345

fixes a few core issues as described in the PR description. These fixes are for the consuming model.

There's still a couple of fixes needed. On the offering model, force removing an offer which has connections does not fully remove all of the connection artefacts, which results in subsequent connections from the consuming model not to trigger the required relation processing on the offering side.

As well, there's a hook error (permission denied) observed on the consuming side when an offer is force removed. I hope that only one or 2 more PRs will be required.

Revision history for this message
Ian Booth (wallyworld) wrote :

Hopefully this latest PR

https://github.com/juju/juju/pull/13349

will fix the remaining observed issues. It adds some fallback forced removal of relations/units if the models get out of sync, and also ensures all artefacts are removed from the offering model when the offer is deleted.

Revision history for this message
Ian Booth (wallyworld) wrote :

I've thought of a scenario where disconnection between models at the time of removing a consumed offer might result in things staying stuck if the offer is consumed again. It is unlikely to happen in this case since it's a single controller deployment, but needs investigation nonetheless.

Revision history for this message
Ian Booth (wallyworld) wrote :

Another PR to address some issues on the consuming side. mainly dealing with things being removed as other processing is happening.

https://github.com/juju/juju/pull/13353

Revision history for this message
Ian Booth (wallyworld) wrote :

And one more to handle consume side add/remove more robustly

https://github.com/juju/juju/pull/13354

Revision history for this message
Ian Booth (wallyworld) wrote :

Local testing seems to indicate the process of redeploy is now more robust.
Marking as fix committed but we can revisit if there are still issues.

Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
John A Meinel (jameinel) wrote :

So we have an example where adding units ended up confusing the state of everything.
https://bugs.launchpad.net/juju/+bug/1945514

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.