Cross-model relation data still being provided for removed unit

Bug #1903216 reported by Paul Goins
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
Medium
Unassigned

Bug Description

I'm having troubles on a cloud where nagios is still receiving cross-relation data from a unit which no longer exists.

Here is an example of what I mean: https://pastebin.ubuntu.com/p/fJBBzrGjtk/

The customer-cloud-kubernetes-worker-11 unit has been removed for weeks, however nagios continues to see its data, and thus I cannot remove the associated nagios alerts for the removed unit since they will be recreated.

Tags: sts
Revision history for this message
Paul Goins (vultaire) wrote :

Juju controller version is 2.8.3. The k8s and k8s-lma models are also at 2.8.3.

Revision history for this message
Ian Booth (wallyworld) wrote :

Assuming nagios is the offer, you can run juju offers --application nagios to see what's connected.

Are there other units you want to keep in the relation? If the offering and consuming side can't see each other when a unit is removed, the offering side can't be notified that the unit has gone away and so will keep it around. I don't think there's a clean way to remove the orphaned unit stub from the offering model. What you could do is remove-relation --force <id> on the offering model and then on the consuming model add the relation again.

Revision history for this message
Pen Gale (pengale) wrote :

It seems like this is a situation that is going to come up with some frequency, just because you can't control what is happening on someone else's model.

Is the --force flag the right way to work around it? If so, is this documented somewhere?

Revision history for this message
Ian Booth (wallyworld) wrote :

--force is needed because if there are orphaned units, there's no way they can shutdown cleanly with all the expected hooks.
This specific case is not called out explicitly (it is but one of many use cases where --forcr is helpful), but --force is documented
https://juju.is/docs/removing-things

Revision history for this message
John A Meinel (jameinel) wrote :

It doesn't feel like this has a clear answer and resolution so I'm marking this Incomplete, but it also doesn't look like we would implement changes on our end at this point.

Changed in juju:
status: New → Incomplete
importance: Undecided → Medium
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I have seen this problem happen in a customer environment. Paul's description matches exactly. Juju version 2.9.31.

Among several removed CMRs, only one that had been removed months ago still remained (other CMRs were removed successfully before and after the troublesome one). Nagios unit would still report the removed units in its dashboard. juju run relation-ids and relation-list commands list both the monitors:108 (the troublesome one) and all the remote units of that relation that had been removed, respectively.

I checked the logs (in DEBUG level) and mongodb extensively. There are no logs that relate to something gone wrong or out of the ordinary. In mongodb, relation 108 does not exist and inspecting the entire db there is absolutely no mention of it there. Everything had been cleaned up.

However, a few things to point out:

1) At some point "remove-relation --force" was run, and the problem still persisted

2) After (1) above, the mongodb dump was collected for analysis. So we do not have data to compare and confirm what "remove-relation --force" really removed/cleaned up

3) Ultimately the final solution was to restar the jujud agent in nagios/0 unit. This removed the monitors:108 entry, all the remote units listing, and cleared up the invalid monitored entries in nagios dashboard. The customer states that this had already been attempted in the past, but it was before "remove-relation --force".

It sounds to me that there is something that can be done here. If remove-relation --force can forcefully clean up orphaned stuff as @wallyworld says, then it could also forcefully clear the cache, effectively as if jujud had been restarted, so the cache has to be repopulated and it should no longer contain the orphaned units that were cleaned up. Alternatinately perhaps it could selectively remove the forcefully removed units from the cache.

I'm reopening this for further analysis now that we have more clues. As to what caused the issue in the first place, I suppose it could be random, like a RPC or HTTP lost, given that the customer reported that no other deleted CMRs hung around.

Changed in juju:
status: Expired → New
tags: added: sts
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Actually, if a RPC or HTTP request/response was lost, isn't there a retransmission or acknowledgment mechanism in place to avoid one side of the relation not knowing that it was removed?

Revision history for this message
Ian Booth (wallyworld) wrote :

The underlying transport might well retry requests at that level, but for multi-controller cross model relations, there's no guarantee that the other controller is still running when the teardown happens. Hence --force is an option to ask the controller to tear down one side of the relation without getting blocked if the coordination with the other remote units fails.

can we get info from the mongo dump that was done after relation remove force but before the agent restart?

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

@wallyworld sent you an internal email regarding the dump

Revision history for this message
Ian Booth (wallyworld) wrote :

I can see from the mongo db dump that in the nagios model, the relation for (at least) one of the units from the removed k8s model still exists. The relation is #165 and as well as the core relation entity existing, the nagios/0 unit state still references a number of the removed consuming units, remote-a57284f04d11431b8ad94b77de6ece98/1 ... remote-a57284f04d11431b8ad94b77de6ece98/14

The relation is marked as dying so what appears to have happened is the destroyed model on the consuming side got removed before the offering side was done removing its artefacts. One way this can happen is if --force is used but it's not clear the root cause.

The other artefacts for the removed consuming unit include the tokens to map the entities between the models.

So we have these mongo collections with "orphaned" data:

- relations
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:nagios:monitors remote-a57284f04d11431b8ad94b77de6ece98:monitors"

- applicationOfferConnections
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:nagios:monitors remote-a57284f04d11431b8ad94b77de6ece98:monitors"

- relationscopes
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/19"

- remoteApplications
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:remote-a57284f04d11431b8ad94b77de6ece98"

- remoteEntities
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:application-remote-a57284f04d11431b8ad94b77de6ece98"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:relation-nagios.monitors#remote-a57284f04d11431b8ad94b77de6ec

- settings
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#remote-a57284f04d11431b8ad94b77de6ece98"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/16"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/17"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/18"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/19"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/20"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/21"

- unitstates
The record with id "82dcf2b0-8352-4695-873c-5f791b279bb7:u#nagios/0#charm" has a "relation-state" map with key "108"

To start with, you could try to remove the dying relation

juju remove-relation 165 --force

Depending on what gets cleaned up, you may then need to do some mongo surgery to manually remove the above records. The unitstates record it not removed, just the affected map entry in the "relation-state" map. Any manual db changes would need to have the 3 controller agents stopped first and restarted after.

Changed in juju:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.