openstack: unit creation serialized?

Bug #1692493 reported by James Page
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Medium
Heather Lanigan
OpenStack Charm Test Infra
Opinion
High
Unassigned

Bug Description

version: 2.2-beta4
provider: openstack (ocata based cloud)

When deploying models, it would appear that unit creation is serialised - each unit appears to get tracked in more detail during creation:

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.21 f780bc5f-336a-4beb-b6e4-849eea5b518d zesty nova ACTIVE
1 started 10.5.0.3 76be5c85-049c-4bc6-965f-5bb73ea10a7e zesty nova ACTIVE
2 pending 10.5.0.4 a247ca2e-a433-4a48-b15d-bba797aa47b1 zesty nova ACTIVE
3 pending 10.5.0.19 062685c3-a9b6-4653-b24f-a3e5b4adae13 zesty nova ACTIVE
4 pending 10.5.0.17 af55210e-a3fc-4888-80a1-3a08f0cdbe50 zesty nova ACTIVE
5 pending pending zesty "instance \"82e35d37-2d1b-42a8-8f42-947902d76352\" has status BUILD", wait 10 seconds before retry, attempt 1

which is great; however it now takes alot longer to deploy models when compared to 2.1.x or 1.25.x which did some level of parallel creation of units for applications.

Was this an intended change?

Revision history for this message
James Page (james-page) wrote :

(I've not tried on other providers yet).

Ryan Beisner (1chb1n)
tags: added: uosci
Changed in charm-test-infra:
importance: Undecided → High
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1692493] Re: openstack: unit creation serialized?

I believe the issue is that Openstack may fail provisioning during the
Build stage, (there have been asks to retry provisiong automatically at a
later stage).

I believe we've always provisioned one machine at a time but we were
assuming that they would succeed once they got to BUILD, but now we're
waiting longer. It may be that we need a different fix for automatic
retries.

I believe there was a quick 'maybe we should do this differently', as we
currently do one machine, and retry it, rather than make several requests
(batch/all?) and then iterate over the various results looking for what
needs to be retried.

John
=:->

On May 22, 2017 17:35, "Ryan Beisner" <email address hidden> wrote:

> ** Tags added: uosci
>
> ** Also affects: charm-test-infra
> Importance: Undecided
> Status: New
>
> ** Changed in: charm-test-infra
> Importance: Undecided => High
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1692493
>
> Title:
> openstack: unit creation serialized?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-test-infra/+bug/1692493/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

When you say 'a lot longer' do you have numbers? Is it 1min to start each
machine, so we now take 50min when we would have kicked off each machine in
~10s before and thus been done in <10min before?

John
=:->

On Mon, May 22, 2017 at 6:04 PM, John Meinel <email address hidden> wrote:

> I believe the issue is that Openstack may fail provisioning during the
> Build stage, (there have been asks to retry provisiong automatically at a
> later stage).
>
> I believe we've always provisioned one machine at a time but we were
> assuming that they would succeed once they got to BUILD, but now we're
> waiting longer. It may be that we need a different fix for automatic
> retries.
>
> I believe there was a quick 'maybe we should do this differently', as we
> currently do one machine, and retry it, rather than make several requests
> (batch/all?) and then iterate over the various results looking for what
> needs to be retried.
>
> John
> =:->
>
> On May 22, 2017 17:35, "Ryan Beisner" <email address hidden> wrote:
>
>> ** Tags added: uosci
>>
>> ** Also affects: charm-test-infra
>> Importance: Undecided
>> Status: New
>>
>> ** Changed in: charm-test-infra
>> Importance: Undecided => High
>>
>> --
>> You received this bug notification because you are subscribed to juju.
>> Matching subscriptions: juju bugs
>> https://bugs.launchpad.net/bugs/1692493
>>
>> Title:
>> openstack: unit creation serialized?
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/charm-test-infra/+bug/1692493/+subscriptions
>>
>

Revision history for this message
James Page (james-page) wrote :

I'll have to quantify 'a lot' better - but that's relatively easy todo

Ryan Beisner (1chb1n)
tags: added: openstack-provider usability
Revision history for this message
Ryan Beisner (1chb1n) wrote :

@jam - Side question: Does the MAAS provider bring up machines in a serial fashion? It seems like not, but I'm not certain.

Revision history for this message
Tim Penhey (thumper) wrote :

@1chb1n all providers bring up machines in the same way from Juju's point of view.

They are all started in serial.

We have some work that has bubbled to the top with regard to refactoring the juju provisioner code, which would parallelise some of those tasks, but we aren't going to do that for 2.2.

Changed in juju:
status: New → Incomplete
Revision history for this message
James Page (james-page) wrote :

I did a quick comparison using three unit amulet functional tests from the percona-cluster charm; under 1.25.10:

DEBUG:runner:2017-03-02 13:06:51 Deployment complete in 491.09 seconds

under 2.2-beta4:

DEBUG:runner:2017-05-23 09:50:49 Deployment complete in 752.88 seconds

so a total time diff of +260 seconds to get to the point where the model is considered to be deployed (all units in active state).

Ryan Beisner (1chb1n)
Changed in juju:
status: Incomplete → New
Revision history for this message
Ryan Beisner (1chb1n) wrote :

With 2.2b4, there is about a 50% increase in the time it takes to deploy our models on the Juju OpenStack provider. These are models which deploy and tear down many times per day and per week in order to deliver test results in the OpenStack Charms CI gate.

This is not observed at 2.1.3 or 1.25.10.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

The impact might seem low, with just a few extra minutes, but with the velocity that we have in our CI, all driven by Juju, this increase will be quite impactful to developers and other users of the CI system.

Changed in charm-test-infra:
status: New → Confirmed
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Background:

The OpenStack Provider is waiting for instances to spin up successfully now instead of assuming that all instances will do so, as of 2.2 beta4. The code had been attempting this before, however it wasn't checking for failures in deployment of an instance correctly. The fix resolved bug 1425808.

Revision history for this message
Ashley Lai (alai) wrote :

I did a quick test deploying a simple openstack bundle on the same maas. Below is the time it took to complete the deployment.

juju 2.1.3-xenial-amd64:
  37m23.062s

juju 2.2-beta4-xenial-amd64:
  39m20.475s

Revision history for this message
Ryan Beisner (1chb1n) wrote :

@alai - good to know the Juju MAAS provider timing isn't affected much.

This bug is about the Juju OpenStack provider. Do you have a system to exercise that to try to confirm?

Revision history for this message
Ashley Lai (alai) wrote :

@beisner - sorry I missed that. I can set that up and test it.

Revision history for this message
Ashley Lai (alai) wrote :

Comment #10 explained the performance issue. Bootstrapping on openstack provider failed on me, see bug below.

https://bugs.launchpad.net/juju/+bug/1696487

Revision history for this message
Anastasia (anastasia-macmood) wrote :

As per comment # 6, we will aim to address as part of re-working provisioner code.

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
tags: added: provisioner-rework
Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Ryan Beisner (1chb1n) wrote :

For temporary work-around purposes in charm-test-infra, we're looking into increasing timeout thresholds for issues such as is observed in failures for these reviews:

https://review.openstack.org/#/c/479440/

https://review.openstack.org/#/c/476151/

Changed in juju:
status: Triaged → In Progress
assignee: nobody → Heather Lanigan (hmlanigan)
Ryan Beisner (1chb1n)
Changed in charm-test-infra:
status: Confirmed → Opinion
Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Initial PR: https://github.com/juju/juju/pull/7838
Additional PRs will be forthcoming with refinements.

Time to deploy 16 machines in an OpenStack clouds with juju:

juju 2.2-beta3 8m21s <-- before fix for bug 1425808.
juju 2.2.5 12m40s <-- current
juju 2.3-beta2 4m52s <-- with parallel provisioning

16 machines are deployed with the openstack-novalxd bundle.

Improvements with other clouds:

AWS:
juju 2.2.5 4m36s
juju 2.3-beta2 3m17s

LXD:
juju 2.2.5 3m57s
juju 2.3-beta2 2m57s

Google:
juju 2.2.5 5m21s
juju 2.3-beta2 2m10s

Changed in juju:
status: In Progress → Fix Committed
milestone: none → 2.3-beta2
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.