pylibjuju `model.wait_for_idle` does not seem to align with `juju status`

Bug #2034562 reported by Mehdi B.
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Undecided
Caner Derici

Bug Description

Hello,

Given the following juju deployment:

status:
    app: blocked
    units:
        0: active / idle (since +120sec)
        1: active / idle (since +120sec)
        2: active / idle (since +120sec)

I would expect the following to succeed:
await ops_test.model.wait_for_idle(
    apps=[app], status="active", timeout=1000, wait_for_exact_units=3, idle_period=120
)

But it does not and rather **consistently** hits a TimeoutError (1000sec) waiting for model, even though continuously reporting:
INFO juju.model:model.py:2618 Waiting for model:
  opensearch/1 [idle] active:
  opensearch/2 [idle] active:
  opensearch/3 [idle] active:

-----
The same issue happens **intermittently** even when the whole combination "app + unit" is fully active and for a duration greater than idle_period.

-----
juju: 2.9.44

Thank you

Mehdi B. (medib)
description: updated
Mehdi B. (medib)
description: updated
Changed in juju:
assignee: nobody → Caner Derici (cderici)
status: New → Triaged
Revision history for this message
Caner Derici (cderici) wrote :

Hi Mehdi, thanks for reporting this!

I'm not quite surprised about the first example where the status of the application seems to be "blocked", while the units are "active / idle". Wait_for_idle, unfortunately has no distinction of app vs units when it comes to waiting for a particular status. I.e., if you're waiting for a particular status, then both the application and also the units (if you're waiting for a particular number of units) need to be in that status. We're currently working on a new set of "wait_for" methods to provide a more granular control for this.

Your second example where you said you're getting a Timeout even when everything's active, is more concerning. And I'd like to see more details about that, on how to reproduce etc.

One thing to note here is that I see that you set the idle_period to 120 seconds. That means you want everything to be in the idle (with status=active) for 120 seconds before the wait_for_idle decides that it's done waiting. Unless you have a good reason for this, it might be a good idea to reduce that number (default is 15 seconds, I think), because during that 120 seconds the units might occasionally change into different states for maintenance/update purposes etc which would reset that 120sec timer. That might be the reason for your 1000 seconds timeout is getting triggered.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.