Juju incorrectly thinks K8s unit still exists

Bug #1898630 reported by Kenneth Koski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

An example can be found here:

https://github.com/juju-solutions/bundle-kubeflow/pull/225/checks?check_run_id=1211505199

Notice that in the `Deploy Kubeflow` step, these messages get repeated until `juju-wait` times out:

    DEBUG:root:dex-auth/2 workload status is terminated since 2020-10-05 21:19:50+00:00
    DEBUG:root:dex-auth/2 juju agent status is failed since 2020-10-05 21:19:58+00:00

Then under the `Debug failures` step, `juju status` shows this output:

    dex-auth/2 terminated failed 10.1.89.52 5556/TCP unit stopped by the cloud
    dex-auth/4* active idle 10.1.89.54 5556/TCP

Meanwhile, listing all pods in microk8s shows only this dex-auth pod:

    kubeflow dex-auth-64b88fb58-xpp7f 1/1 Running 0 9m3s

So Juju seems to think that dex-auth/2 is still around, even though it isn't.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.8.5
status: New → Triaged
importance: Undecided → High
Thomas Miller (tlmiller)
Changed in juju:
assignee: nobody → Thomas Miller (tlmiller)
status: Triaged → In Progress
Changed in juju:
milestone: 2.8.5 → 2.8.6
Revision history for this message
Thomas Miller (tlmiller) wrote :

Hey Ken,

I have been looking into the bug. I have tracked the problem down to a reconciliation problem in Juju where it looks at how many units where around V what is should have and then puts the extra units into this removed state that required manual cleanup. I am working with Ian now to figure out what the correct logic should be and then will submit the PR.

Cheers
Tom

Revision history for this message
Ian Booth (wallyworld) wrote :

The Juju logic seems correct - if a pod goes to terminating state, Juju will mark the corresponding unit as "stopped". It stays like that until the pod is finally fully removed, and then the unit gets removed too. Sometimes the pod can stay in terminating state for a while, so the juju unit remains as well. If the system is really busy, that can also affect how long it takes for the state of juju and k8s to reconcile.

We've not seem this same issue on recent kubeflow test runs. We can mark the bug as Incomplete until get get a reproduction with the latest kubeflow tests.

Changed in juju:
status: In Progress → Incomplete
assignee: Thomas Miller (tlmiller) → nobody
Changed in juju:
milestone: 2.8.6 → 2.8.7
Pen Gale (pengale)
Changed in juju:
milestone: 2.8.7 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.