pylibjuju `model.wait_for_idle` observes different application status than `juju status`

Bug #1981833 reported by Andrew Scribner
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Caner Derici

Bug Description

I'm debugging some CI where I use python-libjuju's `model.wait_for_idle(status='active', raise_on_error=True, ...)` on a model that currently has this status:

```
juju status

Model Controller Cloud/Region Version SLA Timestamp
kubeflow uk8sx microk8s/localhost 2.9.32 unsupported 10:21:39-04:00

App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook .../notebooks/admission-web... error 1 admission-webhook 0 no creating or updating custom resource definitions: ensuring
custom resource definition "poddefaults.kubeflow.org" with version "v1beta1": cannot convert v1beta1 crd to v1: custom resource definition group "kubeflow.org" not valid

Unit Workload Agent Address Ports Message
admission-webhook/0* waiting idle waiting for container
```

My expectation was that this would raise because the application is in error, but it does not. When I debug and step through the test, I see the app here showing app.status='active'. Any idea why this might occur?

This can be recreated by:
* bootstrap juju on a kubernetes cloud >=v1.22
* juju deploy admission-webhook --channel 1.4/stable
* wait for charm to produce the error (the error is because one of the objects it asks Juju to create is deprecated in k8s 1.22)
* try wait_for_idle, such as like this using pytest-operator:

from pytest_operator.plugin import OpsTest

async def test_is_active(ops_test: OpsTest):
    await ops_test.model.wait_for_idle(
        apps=["admission-webhook"],
        status="active",
        raise_on_blocked=True,
        raise_on_error=True,
        timeout=1500,
    )

Caner Derici (cderici)
Changed in juju:
assignee: nobody → Caner Derici (cderici)
Revision history for this message
Caner Derici (cderici) wrote :

I don't think this is pylibjuju's fault, it looks like a bug in Juju status.

For some reason Juju is reporting the application status through the api as "active", even though in the `juju status` we see an "error".

Observe that even though the application appears to be in an error state the `juju status error` command does not list the `admission-webhook` application, while the `juju status active` does.

Investigating further why this is happening.

Changed in juju:
status: New → Triaged
Revision history for this message
Natasha Ho (natashaho) wrote :

I am experiencing something similar. I was comparing `kubectl get pods` and `juju status`. `juju status` reported the app as active around 30s before the pods are actually running.

```
$ uk get pods
NAME READY STATUS RESTARTS AGE
modeloperator-c85789ff5-8lkn2 1/1 Running 0 77s
spark-k8s-0 0/2 Running 0 33s
$ juju status
Model Controller Cloud/Region Version SLA Timestamp
test-spark-operator-integration-0v5s micro microk8s/localhost 2.9.32 unsupported 11:41:35-04:00

App Version Status Scale Charm Channel Rev Address Exposed Message
spark-k8s active 1 spark-k8s 0 10.152.183.100 no

Unit Workload Agent Address Ports Message
spark-k8s/0* active idle 10.1.64.95
```
juju version: 2.9.32-ubuntu-amd64
microk8s version: 1.22/stable

Revision history for this message
Natasha Ho (natashaho) wrote :

Following up on the admission-webhook example. I tried connecting to the model using libjuju. The app and unit are in active state according to libjuju, while `juju status` shows error or waiting statuses.
```
>>> from juju import jasyncio
>>> from juju.model import Model
>>> async def connect_current_model():
... model = Model()
... try:
... # connect to the current model with the current user, per the Juju CLI
... await model.connect()
... print('There are {} applications'.format(len(model.applications)))
... return model
... except:
... print("Failed to connect")
...
>>> current_model = jasyncio.run(connect_current_model())
>>> current_model.state.applications['admission-webhook'].status
'active'
>>> current_model.state.units['admission-webhook/0'].agent_status
'idle'
>>> current_model.state.units['admission-webhook/0'].workload_status
'active'
```

Revision history for this message
Caner Derici (cderici) wrote :

Hey Natasha, thanks for the follow up, I'm currently actively working on this.

So the problem is on the Juju side, it's not about the libjuju. And it only happens for caas units (i.e. it works correctly on lxd). You can observe it without the libjuju too, by running `juju status error` or `juju status waiting` on the cli and see that it doesn't list the application, however, `juju status active` shows it, funny enough, with a status output that's exactly how you described, showing "error" in the state.

The status code for units is sort of convoluted, so for some reason I've yet to understand I see the unit's status as `active` in the State, while I see that it's in the error state in the output. I'm trying to get to the bottom of it.

Changed in juju:
status: Triaged → In Progress
importance: Undecided → High
milestone: none → 2.9-next
Revision history for this message
John A Meinel (jameinel) wrote :

As a question to ask, there is the "status of the Unit Agent" vs the "status of the Unit" (eg, one is talking about the 'jujud' and the other is talking about the exit of a charm hook)

I wonder if
  raise_on_error=True

Isn't looking at all the possible fields that can go into error. (it could also be the overall application status, vs the unit status, etc.)

Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9-next → none
Revision history for this message
Andrew Scribner (ca-scribner) wrote :

Is there any progress on this? This is impacting the robustness of our CI a fair bit - for example, calling wait_for_idle() on something that takes a long time to deploy (like a workload with a large container image) essentially doesn't work. wait_for_idle() sees everything as idle for the waiting period and passes.

Revision history for this message
Caner Derici (cderici) wrote :

Hey Andrew, sorry it appears that this was buried underneath other things for a while.

wait_for_idle was part of many bug related discussions lately, we've been fixing a bunch of things about it, and I'm about to push a fix about wait_for_idle discussed in https://github.com/juju/python-libjuju/pull/841 (that targets 2.9, which is what you're using).

However, if I remember correctly, I'm afraid this bug was not particularly about pylibjuju, but it was about our API communication with `juju status` being weird about reporting units that are in the error status. I'll re-prioritize and re-page this problem and see if I can reproduce it after the wait_for_idle fixes and continue from there to fix this asap for you. Sorry this was a nuisance for you quite a while.

In the meantime if you're hitting issues other than this LP bug about the wait_for_idle (like the one you're talking about in #6, sounds to me like a premature timeout), feel free to open an issue on pylibjuju's repo**.

** https://github.com/juju/python-libjuju/issues/new?assignees=&labels=bug&template=BugReport.yml&title=%5BBug%5D%3A+

Revision history for this message
Andrew Scribner (ca-scribner) wrote :

np, thanks for following up!

I think #6 is this bug. If you wait_for_idle() an application that has a slow-to-deploy container (maybe with a big image or slow network), wait_for_idle() sees that application as Active/Idle (as can be seen in wait_for_idle()'s print statements) even though the application is really Idle but working on deploying the container (can be seen by running `juju status` in a terminal in parallel)

Revision history for this message
Caner Derici (cderici) wrote :

It turns out the filtering problem I mentioned earlier in #4 is a tangential issue about the FullStatus endpoint (that I'll be pushing a fix for soon).

But for the main issue here we just pushed a fix[*], it's under review now. @Andrew if you can give it a shot whenever you can to validate that would be very useful to confirm the fix. Thanks!

https://github.com/juju/python-libjuju/pull/849

Revision history for this message
Caner Derici (cderici) wrote :

We have confirmation from @tiradojm that the fix is working, so it's landed. I'll mark this as fix committed, so @ca-scribner hopefully it'll work for your setup as well, but feel free to test this again with the fix and dispute if it doesn't. Cheers!

Changed in juju:
status: In Progress → Fix Committed
milestone: none → 2.9.43
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.