Bug #1903216 “Cross-model relation data still being provided for...” : Bugs : Canonical Juju

Revision history for this message

Paul Goins (vultaire) wrote on 2020-11-05:

#1

Juju controller version is 2.8.3. The k8s and k8s-lma models are also at 2.8.3.

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-11-06:

#2

Assuming nagios is the offer, you can run juju offers --application nagios to see what's connected.

Are there other units you want to keep in the relation? If the offering and consuming side can't see each other when a unit is removed, the offering side can't be notified that the unit has gone away and so will keep it around. I don't think there's a clean way to remove the orphaned unit stub from the offering model. What you could do is remove-relation --force <id> on the offering model and then on the consuming model add the relation again.

Revision history for this message

Pen Gale (pengale) wrote on 2020-11-11:

#3

It seems like this is a situation that is going to come up with some frequency, just because you can't control what is happening on someone else's model.

Is the --force flag the right way to work around it? If so, is this documented somewhere?

Revision history for this message

Ian Booth (wallyworld) wrote on 2020-11-12:

#4

--force is needed because if there are orphaned units, there's no way they can shutdown cleanly with all the expected hooks.
This specific case is not called out explicitly (it is but one of many use cases where --forcr is helpful), but --force is documented
https://juju.is/docs/removing-things

Revision history for this message

John A Meinel (jameinel) wrote on 2021-06-23:

#5

It doesn't feel like this has a clear answer and resolution so I'm marking this Incomplete, but it also doesn't look like we would implement changes on our end at this point.

Changed in juju:
status:	New → Incomplete
importance:	Undecided → Medium

Revision history for this message

Launchpad Janitor (janitor) wrote on 2021-08-23:

#6

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-12-08:

#7

I have seen this problem happen in a customer environment. Paul's description matches exactly. Juju version 2.9.31.

Among several removed CMRs, only one that had been removed months ago still remained (other CMRs were removed successfully before and after the troublesome one). Nagios unit would still report the removed units in its dashboard. juju run relation-ids and relation-list commands list both the monitors:108 (the troublesome one) and all the remote units of that relation that had been removed, respectively.

I checked the logs (in DEBUG level) and mongodb extensively. There are no logs that relate to something gone wrong or out of the ordinary. In mongodb, relation 108 does not exist and inspecting the entire db there is absolutely no mention of it there. Everything had been cleaned up.

However, a few things to point out:

1) At some point "remove-relation --force" was run, and the problem still persisted

2) After (1) above, the mongodb dump was collected for analysis. So we do not have data to compare and confirm what "remove-relation --force" really removed/cleaned up

3) Ultimately the final solution was to restar the jujud agent in nagios/0 unit. This removed the monitors:108 entry, all the remote units listing, and cleared up the invalid monitored entries in nagios dashboard. The customer states that this had already been attempted in the past, but it was before "remove-relation --force".

It sounds to me that there is something that can be done here. If remove-relation --force can forcefully clean up orphaned stuff as @wallyworld says, then it could also forcefully clear the cache, effectively as if jujud had been restarted, so the cache has to be repopulated and it should no longer contain the orphaned units that were cleaned up. Alternatinately perhaps it could selectively remove the forcefully removed units from the cache.

I'm reopening this for further analysis now that we have more clues. As to what caused the issue in the first place, I suppose it could be random, like a RPC or HTTP lost, given that the customer reported that no other deleted CMRs hung around.

I have seen this problem happen in a customer environment. Paul's description matches exactly. Juju version 2.9.31.

Among several removed CMRs, only one that had been removed months ago still remained (other CMRs  were removed successfully before and after the troublesome one). Nagios unit would still report the removed units in its dashboard. juju run relation-ids and relation-list commands list both the monitors:108 (the troublesome one) and all the remote units of that relation that had been removed, respectively.

I checked the logs (in DEBUG level) and mongodb extensively. There are no logs that relate to something gone wrong or out of the ordinary. In mongodb, relation 108 does not exist and inspecting the entire db there is absolutely no mention of it there. Everything had been cleaned up.

However, a few things to point out:

1) At some point "remove-relation --force" was run, and the problem still persisted

2) After (1) above, the mongodb dump was collected for analysis. So we do not have data to compare and confirm what "remove-relation --force" really removed/cleaned up

3) Ultimately the final solution was to restar the jujud agent in nagios/0 unit. This removed the monitors:108 entry, all the remote units listing, and cleared up the invalid monitored entries in nagios dashboard. The customer states that this had already been attempted in the past, but it was before "remove-relation --force".

It sounds to me that there is something that can be done here. If remove-relation --force can forcefully clean up orphaned stuff as @wallyworld says, then it could also forcefully clear the cache, effectively as if jujud had been restarted, so the cache has to be repopulated and it should no longer contain the orphaned units that were cleaned up. Alternatinately perhaps it could selectively remove the forcefully removed units from the cache.

I'm reopening this for further analysis now that we have more clues. As to what caused the issue in the first place, I suppose it could be random, like a RPC or HTTP lost, given that the customer reported that no other deleted CMRs hung around.

Changed in juju:
status:	Expired → New
tags:	added: sts

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-12-08:

#8

Actually, if a RPC or HTTP request/response was lost, isn't there a retransmission or acknowledgment mechanism in place to avoid one side of the relation not knowing that it was removed?

Revision history for this message

Ian Booth (wallyworld) wrote on 2022-12-08:

#9

The underlying transport might well retry requests at that level, but for multi-controller cross model relations, there's no guarantee that the other controller is still running when the teardown happens. Hence --force is an option to ask the controller to tear down one side of the relation without getting blocked if the coordination with the other remote units fails.

can we get info from the mongo dump that was done after relation remove force but before the agent restart?

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2023-01-03:

#10

@wallyworld sent you an internal email regarding the dump

Revision history for this message

Ian Booth (wallyworld) wrote on 2023-01-19:

#11

I can see from the mongo db dump that in the nagios model, the relation for (at least) one of the units from the removed k8s model still exists. The relation is #165 and as well as the core relation entity existing, the nagios/0 unit state still references a number of the removed consuming units, remote-a57284f04d11431b8ad94b77de6ece98/1 ... remote-a57284f04d11431b8ad94b77de6ece98/14

The relation is marked as dying so what appears to have happened is the destroyed model on the consuming side got removed before the offering side was done removing its artefacts. One way this can happen is if --force is used but it's not clear the root cause.

The other artefacts for the removed consuming unit include the tokens to map the entities between the models.

So we have these mongo collections with "orphaned" data:

- relations
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:nagios:monitors remote-a57284f04d11431b8ad94b77de6ece98:monitors"

- applicationOfferConnections
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:nagios:monitors remote-a57284f04d11431b8ad94b77de6ece98:monitors"

- relationscopes
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/19"

- remoteApplications
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:remote-a57284f04d11431b8ad94b77de6ece98"

- remoteEntities
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:application-remote-a57284f04d11431b8ad94b77de6ece98"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:relation-nagios.monitors#remote-a57284f04d11431b8ad94b77de6ec

- settings
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#remote-a57284f04d11431b8ad94b77de6ece98"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/16"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/17"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/18"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/19"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/20"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/21"

- unitstates
The record with id "82dcf2b0-8352-4695-873c-5f791b279bb7:u#nagios/0#charm" has a "relation-state" map with key "108"

To start with, you could try to remove the dying relation

juju remove-relation 165 --force

Depending on what gets cleaned up, you may then need to do some mongo surgery to manually remove the above records. The unitstates record it not removed, just the affected map entry in the "relation-state" map. Any manual db changes would need to have the 3 controller agents stopped first and restarted after.

I can see from the mongo db dump that in the nagios model, the relation for (at least) one of the units from the removed k8s model still exists. The relation is #165 and as well as the core relation entity existing, the nagios/0 unit state still references a number of the removed consuming units, remote-a57284f04d11431b8ad94b77de6ece98/1 ... remote-a57284f04d11431b8ad94b77de6ece98/14

The relation is marked as dying so what appears to have happened is the destroyed model on the consuming side got removed before the offering side was done removing its artefacts. One way this can happen is if --force is used but it's not clear the root cause.

The other artefacts for the removed consuming unit include the tokens to map the entities between the models.

So we have these mongo collections with "orphaned" data:

- relations
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:nagios:monitors remote-a57284f04d11431b8ad94b77de6ece98:monitors"

- applicationOfferConnections
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:nagios:monitors remote-a57284f04d11431b8ad94b77de6ece98:monitors"

- relationscopes
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/19"

- remoteApplications
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:remote-a57284f04d11431b8ad94b77de6ece98"

- remoteEntities
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:application-remote-a57284f04d11431b8ad94b77de6ece98"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:relation-nagios.monitors#remote-a57284f04d11431b8ad94b77de6ec

- settings
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#remote-a57284f04d11431b8ad94b77de6ece98"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/16"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/17"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/18"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/19"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/20"
id: "82dcf2b0-8352-4695-873c-5f791b279bb7:r#165#provider#remote-a57284f04d11431b8ad94b77de6ece98/21"

- unitstates
The record with id "82dcf2b0-8352-4695-873c-5f791b279bb7:u#nagios/0#charm" has a "relation-state" map with key "108"

To start with, you could try to remove the dying relation

juju remove-relation 165 --force

Depending on what gets cleaned up, you may then need to do some mongo surgery to manually remove the above records. The unitstates record it not removed, just the affected map entry in the "relation-state" map. Any manual db changes would need to have the 3 controller agents stopped first and restarted after.

Joseph Phillips (manadart) on 2023-02-09

Changed in juju:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2023-04-11:

#12

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status:	Incomplete → Expired

Canonical Juju

Cross-model relation data still being provided for removed unit

Bug Description

Other bug subscribers

Remote bug watches