k8s resources dangling when removing stateful set based charm

Bug #1870457 reported by Harry Pidcock
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

juju deploy cs:~charmed-osm/mongodb-k8s-24
# wait for all green
juju remove-application mongodb-k8s
# wait for application to disappear from juju status
juju deploy cs:~charmed-osm/mongodb-k8s-24
# deploy should be stuck in terminating state
juju debug-log
# has errors like "operator %q exists and is terminating due to dangling %s resource(s)"

Tags: k8s
Ian Booth (wallyworld)
tags: added: k8s
Revision history for this message
Ian Booth (wallyworld) wrote :

I tried to reproduce this on 2.8-beta1 edge snap.
As soon as the application disappeared from status, I re-deployed and everything seemed to work.

The only log noise I saw was below. It seems the firewaller can better handle an app being removed.
And same with operator - it gets an app lifecycle event, exists with a not found error, and then restarts. Each restart the initialisation fails as would be expected with a not found. A few tweaks should fix that.

tracer: ++ queue handler reactive/mongo.py:34:set_mongodb_active
application-mongodb-k8s: 14:15:47 INFO unit.mongodb-k8s/0.juju-log Invoking reactive handler: reactive/mongo.py:34:set_mongodb_active
application-mongodb-k8s: 14:15:47 INFO unit.mongodb-k8s/0.juju-log status-set: active: ready
controller-0: 14:15:48 WARNING juju.worker.caasfirewaller processing change for application "mongodb-k8s", application "mongodb-k8s" not found
application-mongodb-k8s: 14:15:48 ERROR juju.worker.dependency "operator" manifold worker returned unexpected error: application "mongodb-k8s" not found
application-mongodb-k8s: 14:15:51 ERROR juju.worker.dependency "operator" manifold worker returned unexpected error: failed to initialize caasoperator for "mongodb-k8s": application "mongodb-k8s" not found
application-mongodb-k8s: 14:15:54 ERROR juju.worker.dependency "operator" manifold worker returned unexpected error: failed to initialize caasoperator for "mongodb-k8s": application "mongodb-k8s" not found
application-mongodb-k8s: 14:15:59 ERROR juju.worker.dependency "operator" manifold worker returned unexpected error: failed to initialize caasoperator for "mongodb-k8s": application "mongodb-k8s" not found
application-mongodb-k8s: 14:16:07 INFO unit.mongodb-k8s/1.juju-log Reactive main running for hook install

Changed in juju:
status: Triaged → Incomplete
Ian Booth (wallyworld)
Changed in juju:
status: Incomplete → Triaged
Revision history for this message
Ian Booth (wallyworld) wrote :

So, the caas operator worker calls client.Charm() which returns a params.Error with code not found. This should have been converted into a errors.NotFound

In the defer of the caas operator loop, there's a check for IsNotFound() and a return of ErrTerminateAgent if so. But due to the issue above, the worker still exists with the original not found error and hence bounces a few times until something finally kills it.

Fixing the NotFound error from the Charm() API call reproduces the issue - the caas operator worker is killed immediately and the next attempt to deploy the charm happens while the original pod is still terminating (it is gone from the juju model but still in the cluster but terminating).

So the delayed shutdown of the caas operator worker due to the mishandled error is contributing to the problem being masked. Ideally, the app would not be removed from the juju model until all the cluster resources are gone. More investigation needed.

Revision history for this message
Ian Booth (wallyworld) wrote :

One issue (not tot the root cause though) is that the remote state watcher doesn't properly handle an app being removed. There's a relatively simple fix for that to remove some of the noise.

Ian Booth (wallyworld)
Changed in juju:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Harry Pidcock (hpidcock)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.