Comment 7 for bug 1903216

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I have seen this problem happen in a customer environment. Paul's description matches exactly. Juju version 2.9.31.

Among several removed CMRs, only one that had been removed months ago still remained (other CMRs were removed successfully before and after the troublesome one). Nagios unit would still report the removed units in its dashboard. juju run relation-ids and relation-list commands list both the monitors:108 (the troublesome one) and all the remote units of that relation that had been removed, respectively.

I checked the logs (in DEBUG level) and mongodb extensively. There are no logs that relate to something gone wrong or out of the ordinary. In mongodb, relation 108 does not exist and inspecting the entire db there is absolutely no mention of it there. Everything had been cleaned up.

However, a few things to point out:

1) At some point "remove-relation --force" was run, and the problem still persisted

2) After (1) above, the mongodb dump was collected for analysis. So we do not have data to compare and confirm what "remove-relation --force" really removed/cleaned up

3) Ultimately the final solution was to restar the jujud agent in nagios/0 unit. This removed the monitors:108 entry, all the remote units listing, and cleared up the invalid monitored entries in nagios dashboard. The customer states that this had already been attempted in the past, but it was before "remove-relation --force".

It sounds to me that there is something that can be done here. If remove-relation --force can forcefully clean up orphaned stuff as @wallyworld says, then it could also forcefully clear the cache, effectively as if jujud had been restarted, so the cache has to be repopulated and it should no longer contain the orphaned units that were cleaned up. Alternatinately perhaps it could selectively remove the forcefully removed units from the cache.

I'm reopening this for further analysis now that we have more clues. As to what caused the issue in the first place, I suppose it could be random, like a RPC or HTTP lost, given that the customer reported that no other deleted CMRs hung around.