OSWaitCondition test timeout

Bug #1738653 reported by Rico Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
In Progress
Medium
Rabi Mishra
heat-tempest-plugin
Fix Released
High
Zane Bitter

Bug Description

OSWaitCondition failed with the following message

http://logs.openstack.org/68/528568/1/check/heat-functional-orig-mysql-lbaasv2/e969bea/job-output.txt.gz#_2017-12-17_16_58_01_563687

2017-12-17 16:58:01.563687 | primary | 2017-12-17 16:58:01.563 | heat_integrationtests.functional.test_os_wait_condition.OSWaitCondition.test_create_stack_with_multi_signal_waitcondition
2017-12-17 16:58:01.565477 | primary | 2017-12-17 16:58:01.565 | -------------------------------------------------------------------------------------------------------------------------
2017-12-17 16:58:01.567376 | primary | 2017-12-17 16:58:01.566 |
2017-12-17 16:58:01.580484 | primary | 2017-12-17 16:58:01.580 | Captured traceback:
2017-12-17 16:58:01.582051 | primary | 2017-12-17 16:58:01.581 | ~~~~~~~~~~~~~~~~~~~
2017-12-17 16:58:01.583672 | primary | 2017-12-17 16:58:01.583 | Traceback (most recent call last):
2017-12-17 16:58:01.585671 | primary | 2017-12-17 16:58:01.585 | File "heat_integrationtests/functional/test_os_wait_condition.py", line 107, in test_create_stack_with_multi_signal_waitcondition
2017-12-17 16:58:01.587494 | primary | 2017-12-17 16:58:01.586 | self.stack_create(template=self.template, parameters=params)
2017-12-17 16:58:01.588978 | primary | 2017-12-17 16:58:01.588 | File "heat_integrationtests/common/test.py", line 621, in stack_create
2017-12-17 16:58:01.590387 | primary | 2017-12-17 16:58:01.589 | self._wait_for_stack_status(**kwargs)
2017-12-17 16:58:01.591814 | primary | 2017-12-17 16:58:01.591 | File "heat_integrationtests/common/test.py", line 374, in _wait_for_stack_status
2017-12-17 16:58:01.593562 | primary | 2017-12-17 16:58:01.593 | fail_regexp, is_action_cancelled):
2017-12-17 16:58:01.595718 | primary | 2017-12-17 16:58:01.594 | File "heat_integrationtests/common/test.py", line 337, in _verify_status
2017-12-17 16:58:01.597327 | primary | 2017-12-17 16:58:01.596 | stack_status_reason=stack.stack_status_reason)
2017-12-17 16:58:01.598884 | primary | 2017-12-17 16:58:01.598 | heat_integrationtests.common.exceptions.StackBuildErrorException: Stack OSWaitCondition-1697744023/48f388e7-8deb-46bf-a1cb-d228015b7422 is in CREATE_FAILED status due to 'Resource CREATE failed: WaitConditionTimeout: resources.wait_condition: 4 of 25 received - Signal 1 received;signal2;Signal 5 received;signal4'

Rico Lin (rico-lin)
description: updated
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-3 → queens-rc1
Rabi Mishra (rabi)
Changed in heat:
milestone: queens-rc1 → rocky-1
Revision history for this message
Rico Lin (rico-lin) wrote :
Changed in heat:
assignee: Rico Lin (rico-lin) → Rabi Mishra (rabi)
milestone: rocky-1 → queens-rc1
status: Triaged → In Progress
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-rc1 → rocky-1
Zane Bitter (zaneb)
tags: added: gate-failure
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/543064

Revision history for this message
Zane Bitter (zaneb) wrote :
Changed in heat-tempest-plugin:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Zane Bitter (zaneb)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat-tempest-plugin (master)

Change abandoned by Rabi Mishra (<email address hidden>) on branch: master
Review: https://review.openstack.org/531079

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/543064
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e4d2452f2f669e8c709cda5043c9ab67d40a4f2f
Submitter: Zuul
Branch: master

commit e4d2452f2f669e8c709cda5043c9ab67d40a4f2f
Author: Zane Bitter <email address hidden>
Date: Fri Feb 9 20:20:22 2018 -0500

    Always use string ID for WaitConditionHandle signals

    Using an integer as the ID - which is what we currently do for
    OS::Heat::WaitConditionHandles without an ID explicitly supplied by the
    signaller - means that we can implicitly create an entry that will collide
    with an explicitly specified string ID when we serialise to and then
    deserialise from JSON to store the metadata in the database. These types of
    collisions are silent and avoid being logged (despite the fact that we
    attempt to catch and log collisions), which makes any issues very, very
    difficult to debug.

    Change-Id: Ia613e79dc9757c4594b84ed2105bfc1357d1e037
    Related-Bug: #1738653

Revision history for this message
Zane Bitter (zaneb) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat-tempest-plugin (master)

Reviewed: https://review.openstack.org/543063
Committed: https://git.openstack.org/cgit/openstack/heat-tempest-plugin/commit/?id=2cff12bceb4b568cd8673c9ffa5668d37fcc9da9
Submitter: Zuul
Branch: master

commit 2cff12bceb4b568cd8673c9ffa5668d37fcc9da9
Author: Zane Bitter <email address hidden>
Date: Fri Feb 9 19:57:44 2018 -0500

    Avoid race in OSWaitCondition test

    While sending a signal to a WaitCondition is synchronous, the actual update
    of the WaitConditionHandle metadata happens asynchronously since the fix
    for bug 1394095. As a result, it's not guaranteed that even the first 6
    signals (which are sent in serially, as opposed to the later ones which are
    deliberately sent in parallel) will be stored in the same order that they
    are sent.

    Crucially, that means that one or more of the signals explicitly sent with
    id 5 may arrive when there have been only three previous signals stored.
    This means that the next signal to arrive with an implicit ID will be the
    fifth signal stored, and therefore also get id 5. Of course we have a log
    message to indicate when an existing signal is overwritten by another with
    the same ID, and we are not seeing it except in the intended case where we
    explicitly send in the same ID twice. That's because the keys have
    different types in the data dict - the explicitly specified ID is the
    string "5", but the implicitly calculated one is the integer 5. But - get
    this - when we serialise the data to JSON both keys are serialised to the
    string "5", and upon deserialisation they collide and one is silently
    dropped on the floor.

    So if the signal with the explicit ID "5" is stored just before the one
    with reason "signal 4", then "signal 4" will effectively be silently
    ignored as the 5th signal to arrive - a slot already filled. And since that
    signal is ignored, the next signal will also be treated as the 5th to
    arrive and ignored, and so on. This leads inexorably to the dreaded
    "WaitConditionTimeout: resources.wait_condition: 4 of 25 received" error.

    For this reason, it's a bad idea to mix explicit IDs that are also integers
    with implicitly assigned IDs. Use an ID that won't collide instead.

    Change-Id: I507b43dba8dcea87d3e0c179f7ca6b34b2b31a12
    Closes-Bug: #1738653

Changed in heat-tempest-plugin:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/558826

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/pike)

Reviewed: https://review.openstack.org/558826
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3ac0bbedccb50e4e5b6bad663a872985d908e82a
Submitter: Zuul
Branch: stable/pike

commit 3ac0bbedccb50e4e5b6bad663a872985d908e82a
Author: Zane Bitter <email address hidden>
Date: Wed Apr 4 09:51:12 2018 -0400

    Avoid race in OSWaitCondition test

    While sending a signal to a WaitCondition is synchronous, the actual update
    of the WaitConditionHandle metadata happens asynchronously since the fix
    for bug 1394095. As a result, it's not guaranteed that even the first 6
    signals (which are sent in serially, as opposed to the later ones which are
    deliberately sent in parallel) will be stored in the same order that they
    are sent.

    Crucially, that means that one or more of the signals explicitly sent with
    id 5 may arrive when there have been only three previous signals stored.
    This means that the next signal to arrive with an implicit ID will be the
    fifth signal stored, and therefore also get id 5. Of course we have a log
    message to indicate when an existing signal is overwritten by another with
    the same ID, and we are not seeing it except in the intended case where we
    explicitly send in the same ID twice. That's because the keys have
    different types in the data dict - the explicitly specified ID is the
    string "5", but the implicitly calculated one is the integer 5. But - get
    this - when we serialise the data to JSON both keys are serialised to the
    string "5", and upon deserialisation they collide and one is silently
    dropped on the floor.

    So if the signal with the explicit ID "5" is stored just before the one
    with reason "signal 4", then "signal 4" will effectively be silently
    ignored as the 5th signal to arrive - a slot already filled. And since that
    signal is ignored, the next signal will also be treated as the 5th to
    arrive and ignored, and so on. This leads inexorably to the dreaded
    "WaitConditionTimeout: resources.wait_condition: 4 of 25 received" error.

    For this reason, it's a bad idea to mix explicit IDs that are also integers
    with implicitly assigned IDs. Use an ID that won't collide instead.

    This patch is backported from
    https://git.openstack.org/cgit/openstack/heat-tempest-plugin/commit/?id=2cff12bceb4b568cd8673c9ffa5668d37fcc9da9

    Change-Id: Ie2608285ba9c0ec3f1e4a8bbf1a147ce35ccae00
    Depends-On: https://review.openstack.org/550682
    Closes-Bug: #1738653

tags: added: in-stable-pike
Rico Lin (rico-lin)
Changed in heat:
milestone: rocky-1 → rocky-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 9.0.4

This issue was fixed in the openstack/heat 9.0.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat-tempest-plugin 0.2.0

This issue was fixed in the openstack/heat-tempest-plugin 0.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.