functional tests are taking too long and cause false failures

Bug #1433555 reported by Angus Salkeld
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Critical
Steve Baker

Bug Description

We need to investigate if we can logically test these better (without varying so much timewise).

http://logs.openstack.org/36/116336/36/gate/check-heat-dsvm-functional-mysql/02df233/console.html
http://logs.openstack.org/36/116336/38/check/check-heat-dsvm-functional-mysql/b9c991d/console.html

heat_integrationtests.scenario.test_neutron_loadbalancer.NeutronLoadBalancerTest.test_neutron_loadbalancer 332.804
2015-03-18 10:03:36.421 | 2015-03-18 10:03:35.022 | heat_integrationtests.functional.test_resource_group.ResourceGroupTest.test_update_increase_decrease_count 268.368
2015-03-18 10:03:36.422 | 2015-03-18 10:03:35.024 | heat_integrationtests.scenario.test_server_cfn_init.CfnInitIntegrationTest.test_server_cfn_init 241.258
2015-03-18 10:03:36.423 | 2015-03-18 10:03:35.026 | heat_integrationtests.functional.test_instance_group.InstanceGroupUpdatePolicyTest.test_instance_group_update_replace_huge_batch_size 205.132
2015-03-18 10:03:36.423 | 2015-03-18 10:03:35.028 | heat_integrationtests.functional.test_instance_group.InstanceGroupUpdatePolicyTest.test_instance_group_update_replace_huge_min_in_service 93.108
2015-03-18 10:03:36.423 | 2015-03-18 10:03:35.030 | heat_integrationtests.functional.test_instance_group.InstanceGroupUpdatePolicyTest.test_instance_group_update_no_replace 192.948
2015-03-18 10:03:36.424 | 2015-03-18 10:03:35.032 | heat_integrationtests.functional.test_instance_group.InstanceGroupUpdatePolicyTest.test_instance_group_update_replace_with_adjusted_capacity 160.822
2015-03-18 10:03:36.424 | 2015-03-18 10:03:35.033 | heat_integrationtests.functional.test_autoscaling.AutoscalingGroupUpdatePolicyTest.test_instance_group_update_replace_huge_min_in_service 149.795
2015-03-18 10:03:41.865 | 2015-03-18 10:03:35.035 | heat_integrationtests.functional.test_heat_autoscaling.HeatAutoscalingTest.test_path_attrs_nested 147.330
2015-03-18 10:03:41.865 | 2015-03-18 10:03:35.037 | heat_integrationtests.functional.test_autoscaling.AutoscalingGroupUpdatePolicyTest.test_instance_group_update_no_replace 133.241

Angus Salkeld (asalkeld)
Changed in heat:
importance: Undecided → High
status: New → Triaged
milestone: none → kilo-rc1
description: updated
Changed in heat:
assignee: nobody → Kanagaraj Manickam (kanagaraj-manickam)
Changed in heat:
assignee: Kanagaraj Manickam (kanagaraj-manickam) → Angus Salkeld (asalkeld)
status: Triaged → In Progress
Zane Bitter (zaneb)
Changed in heat:
importance: High → Critical
Revision history for this message
Zane Bitter (zaneb) wrote :

Seeing a lot of timeouts on this in the gate. Given that this was initially reported against the RPC nested stacks patch, it seems like there is a good chance that is what has slowed things down to the point where timeouts have become an issue. It's not surprising that this would cause extra load (since we have only a single thread, and it is designed to make things scale *out* more efficiently), but I suspect a comparison with previous runtimes may reveal it to be disproportionate.

The slowest tests from a recent failed run were:

  test_update_nochange_resource_needs_update [1049.335623s]
  AutoScalingSignalTest.test_signal_with_policy_update [832.644550s]
  all of AutoscalingGroupUpdatePolicyTest
  test_instance_group_update_no_replace [698.874717s]
  test_instance_group_update_replace_huge_min_in_service [508.550756s]

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/165944

Revision history for this message
Steve Baker (steve-stevebaker) wrote : Re: [Bug 1433555] Re: functional tests are taking too long and cause false failures

On 20/03/15 06:56, Zane Bitter wrote:
> The slowest tests from a recent failed run were:
>
> test_update_nochange_resource_needs_update [1049.335623s]
> AutoScalingSignalTest.test_signal_with_policy_update [832.644550s]
> all of AutoscalingGroupUpdatePolicyTest
> test_instance_group_update_no_replace [698.874717s]
> test_instance_group_update_replace_huge_min_in_service [508.550756s]
>

I'm wondering if the RPC nested stacks patch has added a
seldom-triggered regression.

Running the above tests locally I was seeing the nested stack remain
IN_PROGRESS for a very long time even though all its resources were
COMPLETE. When I ran the tests repeatedly while running resource-list it
heisenbugged and stopped showing that behavior, so I couldn't investigate.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Here is an odd 3 minute delay of the JobServerGroup going to CREATE_COMPLETE

Sometimes its instant. I'm fairly sure all of JobServerGroup's resources are CREATE_COMPLETE when this happens, but I'll try to get a capture of proof.

$ heat event-list a30c2247-9e87-4337-b236-13a83d74b5cc
+-----------------+--------------------------------------+----------------------------+--------------------+----------------------+
| resource_name | id | resource_status_reason | resource_status | event_time |
+-----------------+--------------------------------------+----------------------------+--------------------+----------------------+
| JobServerConfig | 9fcdcc2d-0c22-4c57-a857-4650416b822e | state changed | CREATE_IN_PROGRESS | 2015-03-19T20:13:50Z |
| JobServerConfig | 222e0bf9-9dbe-43e8-8312-55ee43060b68 | state changed | CREATE_COMPLETE | 2015-03-19T20:13:51Z |
| JobServerGroup | 200f192a-a027-47a1-b9e4-bb3418673e9e | state changed | CREATE_IN_PROGRESS | 2015-03-19T20:13:51Z |
| JobServerGroup | 58dfc6c2-002b-45e4-8204-fbd97c57532d | state changed | CREATE_COMPLETE | 2015-03-19T20:16:52Z |

Revision history for this message
Angus Salkeld (asalkeld) wrote :

Maybe the new oslo objects are not doing refresh() correctly. I noticed yesterday that they are not calling through to the model refresh().

Revision history for this message
Zane Bitter (zaneb) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/166030

Changed in heat:
assignee: Angus Salkeld (asalkeld) → Steve Baker (steve-stevebaker)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/166030
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=9bb663d81b44cdc08cf994e98b7f4bb11b2dc1b6
Submitter: Jenkins
Branch: master

commit 9bb663d81b44cdc08cf994e98b7f4bb11b2dc1b6
Author: Steve Baker <email address hidden>
Date: Fri Mar 20 11:33:03 2015 +1300

    Call db refresh in Stack object refresh

    The current Stack object refresh doesn't actually do any refreshing,
    which explains the functional test speed regression since the RPC
    nested stack code relies on stack.refresh() to poll for state change.

    With this change AutoScalingSignalTest.test_signal_with_policy_update
    locally takes ~60s consistently. Without this change ~50% of test runs
    have some form of execution delay, taking 80s -> 300s+.

    Change-Id: Ie06851fd5b0d8b802ebf701e8bf621562d9b6d88
    Closes-Bug: #1433555

Changed in heat:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: kilo-rc1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.