Timed out waiting for model 'openstack' due to executing unit agents

Bug #2061166 reported by Marian Gasparovic
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Snap
Confirmed
High
Unassigned

Bug Description

sunbeam enable ldap errored out with "Timed out while waiting for model 'openstack' to be ready"

There is nothing obvious in juju status openstack, it looks like some unit is executing every time sunbeam checks the status.

One of the testruns - https://solutions.qa.canonical.com/testruns/82f9489c-03ee-41ba-b432-f586e3f38fb7 and its artefacts - https://oil-jenkins.canonical.com/artifacts/82f9489c-03ee-41ba-b432-f586e3f38fb7/index.html

Tags: cdo-qa
Revision history for this message
James Page (james-page) wrote :

I think you've hit the nail on the head with:

"it looks like some unit is executing every time sunbeam checks the status."

Sunbeam uses the libjuju wait_for_idle method to determine when a particular state has been reached - this includes a list of applications for which all units must be in the 'active' state but also that the agents must all be idle - which means no hooks executing.

When the number of units increases, the probability of the check always seeing something executing also increases (think update-status which runs every 5 mins, but across many units this could be *allthetime* somewhere in the model).

My gut feel is to not check for the idle agent status, and just rely on the units and the charm code supporting them knowing when active really means active.

Revision history for this message
James Page (james-page) wrote :

Marking as confirmed as we've seen lots of instances of this both in Solutions QA and from Sunbeam users.

summary: - Timed out waiting for model 'openstack' after enable ldap
+ Timed out waiting for model 'openstack' due to executing unit agents
Changed in snap-openstack:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I see a lot of those in my CI runs. For very fast and performant machines it happens much less (about 20% of the time), but still happens. For older machines which barely fit the requirements (but do) it happens a lot (about 60% of the time).

Indeed there's not obvious issue, it is just that it never ends deploying the control plane and sunbeam eventually gives up.

Most of the times, if you watch juju status you'll see that everything seems fine and active.

I have a lot of stored logs for such cases, I'm not attaching them yet but let me know if I should.

Revision history for this message
James Page (james-page) wrote :

Faster machines spend less time executing hooks, so that might make sense.

I think we need to tweak the way we wait for a model to be done slightly.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.