pylibjuju `model.wait_for_idle` does not seem to align with `juju status`
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Triaged
|
Undecided
|
Caner Derici |
Bug Description
Hello,
Given the following juju deployment:
status:
app: blocked
units:
0: active / idle (since +120sec)
1: active / idle (since +120sec)
2: active / idle (since +120sec)
I would expect the following to succeed:
await ops_test.
apps=[app], status="active", timeout=1000, wait_for_
)
But it does not and rather **consistently** hits a TimeoutError (1000sec) waiting for model, even though continuously reporting:
INFO juju.model:
opensearch/1 [idle] active:
opensearch/2 [idle] active:
opensearch/3 [idle] active:
-----
The same issue happens **intermittently** even when the whole combination "app + unit" is fully active and for a duration greater than idle_period.
-----
juju: 2.9.44
Thank you
description: | updated |
description: | updated |
Changed in juju: | |
assignee: | nobody → Caner Derici (cderici) |
status: | New → Triaged |
Hi Mehdi, thanks for reporting this!
I'm not quite surprised about the first example where the status of the application seems to be "blocked", while the units are "active / idle". Wait_for_idle, unfortunately has no distinction of app vs units when it comes to waiting for a particular status. I.e., if you're waiting for a particular status, then both the application and also the units (if you're waiting for a particular number of units) need to be in that status. We're currently working on a new set of "wait_for" methods to provide a more granular control for this.
Your second example where you said you're getting a Timeout even when everything's active, is more concerning. And I'd like to see more details about that, on how to reproduce etc.
One thing to note here is that I see that you set the idle_period to 120 seconds. That means you want everything to be in the idle (with status=active) for 120 seconds before the wait_for_idle decides that it's done waiting. Unless you have a good reason for this, it might be a good idea to reduce that number (default is 15 seconds, I think), because during that 120 seconds the units might occasionally change into different states for maintenance/update purposes etc which would reset that 120sec timer. That might be the reason for your 1000 seconds timeout is getting triggered.