convergence: creating large resource groups (500+) are slow then fail

Bug #1637486 reported by Steve Baker
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Expired
High
Unassigned

Bug Description

Consider the following test template:
heat_template_version: 2016-10-14

parameters:
  count:
    default: 2
    type: number

resources:
  random_group:
    type: OS::Heat::ResourceGroup
    properties:
      count: {get_param: count}
      resource_def:
        type: OS::Heat::TestResource
        properties:
          action_wait_secs:
            create: 1
            delete: 1

For count=1600 on the legacy engine, creating the stack takes 90 seconds, and create time grows linearly with count.

For count=500 on the convergence engine, creating the stack takes 90 seconds. A count higher than 500 generally results in the stack timing out, sometimes failing early.

Previously I've observed contention of updating Stack.current_deps, so this could be the root cause. According to Zane, current_deps should only be written to once if it is being faithful to the POC https://github.com/zaneb/heat-convergence-prototype/blob/master/converge/stack.py#L234

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

A chart generated from the data in this spreadsheet. The second table is from the template in the description.

https://docs.google.com/a/redhat.com/spreadsheets/d/1VBWBza5NNtxR_49d7tBJhxErNRCix9ny_WTbuv07azI/edit?usp=sharing

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

The timing for the above chart were done with the following commands:

  time openstack stack create bg -t big_group_simple.yaml --wait --parameter count=500
  time openstack stack delete --wait --yes bg

Rico Lin (rico-lin)
Changed in heat:
importance: High → Medium
assignee: Anant Patil (ananta) → nobody
Revision history for this message
Zane Bitter (zaneb) wrote :

Needs investigation to confirm that this is still the case, but I'd consider this high-priority if it is.

Changed in heat:
milestone: none → queens-1
importance: Medium → High
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-1 → queens-2
Revision history for this message
Rico Lin (rico-lin) wrote :

with devstack env. I'm able to run with following result:

count=500 3.74s user 0.18s system 0% cpu 8:39.94 total
count=1000 5.95s user 0.31s system 0% cpu 17:13.74 total
count=1900 9.53s user 0.50s system 0% cpu 32:45.36 total

Revision history for this message
Rico Lin (rico-lin) wrote :

On one node devstack with Convergence mode

Revision history for this message
Zane Bitter (zaneb) wrote :

So that's more than 5 times slower (yikes!) than even Steve's original tests. OTOH it looks like you're able to create 1900 resources, so I guess the timeout is gone. I'm not sure what to conclude here.

I think the real test is going to be to compare convergence against legacy. This takes differences in the underlying hardware largely out of play. Is it still 3 times slower or not?

Revision history for this message
Zane Bitter (zaneb) wrote :

> Previously I've observed contention of updating Stack.current_deps, so this could be the root cause.
> According to Zane, current_deps should only be written to once if it is being faithful to the POC

AFAICT we do only write them in one place (when the initial stack create/update is started), and that appears to have been the case in Newton as well - in fact going all the way back to http://git.openstack.org/cgit/openstack/heat/commit/?id=5189bbebabbbe2c672c8bb492c496e5780892d01

So I'm curious how we could be getting contention there.

Rico Lin (rico-lin)
Changed in heat:
milestone: queens-2 → queens-3
Revision history for this message
Rico Lin (rico-lin) wrote :

move to status Incomplete unless we can get more information on this

Changed in heat:
status: New → Incomplete
milestone: queens-3 → none
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Heat because there has been no activity for 60 days.]

Changed in heat:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.