k8s provider: replicaset of 2 but there are more units than the 2

Bug #1950705 reported by Haw Loeung
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Yang Kelvin Liu

Bug Description

Hi,

Filing this bug as advised by Ian, we have a K8s model where it's a scale of 2 but we have more than 2 units:

del Controller Cloud/Region Version SLA Timestamp
stg-wordpress-k8s prodstack-is-2 k8s-is-external/default 2.9.18 unsupported 02:12:38Z

App Version Status Scale Charm Store Channel Rev OS Address Message
wordpress .../wordpress:edge active 3/2 wordpress local stable 2 kubernetes 10.85.0.215

Unit Workload Agent Address Ports Message
wordpress/82 terminated idle 80/TCP unit stopped by the cloud
wordpress/83 terminated idle 80/TCP unit stopped by the cloud
wordpress/84 terminated idle 80/TCP unit stopped by the cloud
wordpress/85 terminated idle 80/TCP unit stopped by the cloud
wordpress/86 terminated idle 80/TCP unit stopped by the cloud
wordpress/87 terminated idle 80/TCP unit stopped by the cloud
wordpress/88 terminated idle 80/TCP unit stopped by the cloud
wordpress/89 terminated idle 80/TCP unit stopped by the cloud
wordpress/90 terminated idle 80/TCP unit stopped by the cloud
wordpress/91 terminated idle 80/TCP unit stopped by the cloud
wordpress/92 terminated idle 80/TCP unit stopped by the cloud
wordpress/93 terminated idle 80/TCP unit stopped by the cloud
wordpress/94 terminated idle 80/TCP unit stopped by the cloud
wordpress/95 terminated idle 80/TCP unit stopped by the cloud
wordpress/96 terminated idle 80/TCP unit stopped by the cloud
wordpress/97 terminated idle 80/TCP unit stopped by the cloud
wordpress/98 terminated idle 80/TCP unit stopped by the cloud
wordpress/99 terminated idle 80/TCP unit stopped by the cloud
wordpress/100 terminated idle 80/TCP unit stopped by the cloud
wordpress/101 terminated idle 80/TCP unit stopped by the cloud
wordpress/102 terminated idle 80/TCP unit stopped by the cloud
wordpress/103 terminated idle 80/TCP unit stopped by the cloud
wordpress/104 terminated idle 80/TCP unit stopped by the cloud
wordpress/105 terminated idle 80/TCP unit stopped by the cloud
wordpress/106 terminated idle 80/TCP unit stopped by the cloud
wordpress/107 terminated idle 80/TCP unit stopped by the cloud
wordpress/108 terminated idle 80/TCP unit stopped by the cloud
wordpress/109 terminated idle 80/TCP unit stopped by the cloud
wordpress/110* terminated idle 80/TCP unit stopped by the cloud
wordpress/111 terminated idle 80/TCP unit stopped by the cloud
wordpress/112 terminated idle 80/TCP unit stopped by the cloud
wordpress/113 terminated idle 80/TCP unit stopped by the cloud
wordpress/114 terminated idle 80/TCP unit stopped by the cloud
wordpress/115 terminated idle 80/TCP unit stopped by the cloud
wordpress/116 terminated idle 80/TCP unit stopped by the cloud
wordpress/117 terminated idle 80/TCP unit stopped by the cloud
wordpress/118 terminated idle 80/TCP unit stopped by the cloud
wordpress/119 terminated idle 80/TCP unit stopped by the cloud
wordpress/120 terminated idle 80/TCP unit stopped by the cloud
wordpress/121 terminated idle 80/TCP unit stopped by the cloud
wordpress/122 terminated idle 80/TCP unit stopped by the cloud
wordpress/123 terminated idle 80/TCP unit stopped by the cloud
wordpress/124 terminated idle 80/TCP unit stopped by the cloud
wordpress/125 terminated idle 80/TCP unit stopped by the cloud
wordpress/126 terminated idle 80/TCP unit stopped by the cloud
wordpress/127 terminated idle 80/TCP unit stopped by the cloud
wordpress/128 terminated idle 80/TCP unit stopped by the cloud
wordpress/129 terminated idle 80/TCP unit stopped by the cloud
wordpress/130 terminated idle 80/TCP unit stopped by the cloud
wordpress/131 terminated idle 80/TCP unit stopped by the cloud
wordpress/132 terminated idle 80/TCP unit stopped by the cloud
wordpress/133 terminated idle 80/TCP unit stopped by the cloud
wordpress/134 terminated idle 80/TCP unit stopped by the cloud
wordpress/135 terminated idle 80/TCP unit stopped by the cloud
wordpress/136 terminated idle 80/TCP unit stopped by the cloud
wordpress/137 terminated idle 80/TCP unit stopped by the cloud
wordpress/138 error idle 10.86.56.178 80/TCP hook failed: "leader-settings-changed"
wordpress/139 terminated idle 80/TCP unit stopped by the cloud
wordpress/140 terminated idle 80/TCP unit stopped by the cloud
wordpress/141 terminated idle 80/TCP unit stopped by the cloud
wordpress/142 terminated idle 80/TCP unit stopped by the cloud
wordpress/143 terminated idle 80/TCP unit stopped by the cloud
wordpress/144 terminated idle 80/TCP unit stopped by the cloud
wordpress/147 error idle 10.86.56.216 80/TCP hook failed: "leader-settings-changed"
wordpress/148 error idle 10.86.77.184 80/TCP hook failed: "config-changed"

Juju should be removing / cleaning up these units so there isn't currently 65.

Revision history for this message
Loïc Gomez (kotodama) wrote :

Hi,

We're having this stale/terminated unit issue on another k8s model:
mm-pd-bot/15 terminated idle 2160/TCP unit stopped by the cloud

This is an env having only one pod, we tried to scale down the application to 0 with:
$ juju remove-unit --application mm-pd-bot --num-units 1
This removed the valid pod and the terminated pod was still there.

From the operator pod logs, it's being loop-restarted:
https://paste.ubuntu.com/p/4SWdk2nDGW/

It's also preventing mojo run / applying updates to the bundle without adding additional-ready-states=terminated to the manifest.

Is there a way to remove stale units/pods from juju when they're unknown to kubectl ?

Revision history for this message
John A Meinel (jameinel) wrote :

I don't believe things marked "terminated" hang around forever. I believe the deployment logic from Kubernetes will naturally start a new pod before killing the old ones while the pod spec is changing (hence the 'active 3/2').
However, the status content below showing lots of "terminated" appears like the workload pod is misconfigured causing a lot of pods to start up and then die and then a new pod to be provisioned to replace it.

Changed in juju:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Tom Haddon (mthaddon) wrote :

In reply to the comment about things marked terminated not hanging around forever, in the case of the mm-pd-bot/15 unit it's been in this state for almost a week now:

https://pastebin.ubuntu.com/p/MSBV3zmTn4/

Can you let us know what other information you need to help figure out what needs to be done for juju to remove the reference to the terminated unit?

Changed in juju:
status: Incomplete → New
Changed in juju:
assignee: nobody → Yang Kelvin Liu (kelvin.liu)
status: New → In Progress
milestone: none → 2.9.19
Changed in juju:
milestone: 2.9.19 → 2.9.20
Revision history for this message
Yang Kelvin Liu (kelvin.liu) wrote :

https://github.com/juju/juju/pull/13511 will be landed to 2.9 to fix this issue.

Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.