Creation of stack with WaitCondition/WaitHandle resources periodically fails because of concurrent transactions

Bug #1497274 reported by Anastasia Kuznetsova
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Sergey Kraynev

Bug Description

Steps to reproduce:
1. Create stack using following template:
heat_template_version: 2013-05-23
parameters:
  image:
    type: string
    description: Name of image to use for server
  flavor:
    type: string
    description: Flavor to use for server
  timeout:
    type: number
    description: Timeout for WaitCondition, depends on your image and environment
  network:
    type: string
resources:
  wait_condition:
    type: OS::Heat::WaitCondition
    depends_on: instance1
    properties:
      handle: {get_resource: wait_handle}
      count: 5
      timeout: {get_param: timeout}
  wait_handle:
    type: OS::Heat::WaitConditionHandle
  instance1:
    type: OS::Nova::Server
    properties:
      image: {get_param: image}
      flavor: {get_param: flavor}
      networks: [{network: {get_param: network}}]
      user_data_format: RAW
      user_data:
        str_replace:
          template: |
            #!/bin/sh
            wc_notify --data-binary '{"status": "SUCCESS"}'
            wc_notify --data-binary '{"status": "SUCCESS", "reason": "signal2"}'
            wc_notify --data-binary '{"status": "SUCCESS", "reason": "signal3", "data": "data3"}'
            wc_notify --data-binary '{"status": "SUCCESS", "reason": "signal4", "data": "data4"}'
            wc_notify --data-binary '{"status": "SUCCESS", "id": "5"}'
            wc_notify --data-binary '{"status": "SUCCESS", "id": "5"}'
          params:
            wc_notify: { get_attr: ['wait_handle', 'curl_cli'] }
outputs:
  curl_cli:
    value: { get_attr: ['wait_handle', 'curl_cli'] }
  wc_data:
    value: { get_attr: ['wait_condition', 'data'] }
2. Wait for stack creation

Observed result:
Stack creation fails because not all signals were received.
E.g:
test_stack 16 minutes Create Failed Resource CREATE failed: WaitConditionTimeout: resources.wait_condition: 4 of 5 received - Signal 1 received;signal3;signal2;Signal 5 received
wait_condition - 16 minutes Create Failed WaitConditionTimeout: resources.wait_condition: 4 of 5 received - Signal 1 received;signal3;signal2;Signal 5 received
wait_handle 21 minutes Signal Complete Signal: status:SUCCESS reason:Signal 5 received
wait_handle 21 minutes Signal Complete Signal: status:SUCCESS reason:signal4
wait_handle 21 minutes Signal Complete Signal: status:SUCCESS reason:Signal 4 received
wait_handle 21 minutes Signal Complete Signal: status:SUCCESS reason:signal3
wait_handle 21 minutes Signal Complete Signal: status:SUCCESS reason:signal2
wait_handle 21 minutes Signal Complete Signal: status:SUCCESS reason:Signal 1 received

Changed in heat:
importance: Undecided → High
status: New → Triaged
milestone: none → next
tags: added: kilo-backport-potential
Changed in heat:
assignee: nobody → Peter Razumovsky (prazumovsky)
Revision history for this message
Steven Hardy (shardy) wrote :

It'd be interesting to see some analysis/logs around the "concurrent transactions" part of this - the report above is quite confusing - we see 5 "Signal N received" at 21 minutes, but the stack failed with a timeout at 16 minutes, with 4 of 5 recieved?

Revision history for this message
Sergey Kraynev (skraynev) wrote :

Steven: I reproduced it on our local deployment. However I can not get the same issue on my devstack, so I will try to get reproduce and then post results here.
What kind of 'concurrent transactions' part do you mean?
I think, that the one of the best confirmation is event logs, which reports about wrong numbers of received signals.

On devstack it should looks like:

http://paste.openstack.org/show/474592/

P.s. and I see it on my devstack installation.

on real deployment I see another event logs:

http://paste.openstack.org/show/474661/

So it looks like processing all signals started in the same time, because they have similar reasons (except signal with predefined reasons in request body).
Then, when handle_signal call parent method it refresh metadata (it may be different, with moment when we set number in signal reason by len(rsrc_metadta) ) . And then we write it in parallel, when each signal have different old metadata. F.e.:
1. signal A get number 1 and old data equal {}
     signal B get number 1 (because come in the same, before A was written) and old data equal {}
2. signal A is aded to metadata and it looks like {signal: A}
3. then with short delay start writing data of signal B. In result we will get metadata equal {signal: B} , because first value will be overwritten

description: updated
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I can definitely see that concurrent signals would have race issues and would need to be changed to use resource locking like this[1]. Also we may need a more general solution to this fix [2]

[1] https://review.openstack.org/#/c/208778/2/heat/engine/service_software_config.py,cm
[2] https://review.openstack.org/#/c/216920/

Revision history for this message
Sergey Kraynev (skraynev) wrote :

Steve Baker, thank you very much for feedback. I used one of this patches as example and it works.

Steven Hardy: good news - I have reproduced this error on devstack. bad news - it's very-very rare. My bash script with create/delete scenario worked more then 10 minutes (1 fail and 17 success ), before I got it.

I prepared patch for it and test. Will upload soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/232124

Changed in heat:
assignee: Peter Razumovsky (prazumovsky) → Sergey Kraynev (skraynev)
status: Triaged → In Progress
Changed in heat:
milestone: next → mitaka-3
Changed in heat:
milestone: mitaka-3 → mitaka-rc1
Changed in heat:
assignee: Sergey Kraynev (skraynev) → Zane Bitter (zaneb)
Zane Bitter (zaneb)
Changed in heat:
assignee: Zane Bitter (zaneb) → Sergey Kraynev (skraynev)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/292268

Changed in heat:
milestone: mitaka-rc1 → mitaka-rc2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/292268
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=82b744042e39529eaca6660704ad4923c2442be9
Submitter: Jenkins
Branch: master

commit 82b744042e39529eaca6660704ad4923c2442be9
Author: Sergey Kraynev <email address hidden>
Date: Mon Mar 14 03:57:15 2016 -0400

    Handling concurrent transactions in metadata_set method

    There are follow changes in this patch:
     - Using exception ConcurrentTransactions for processing
       concurrent transactions during writing metadata.
     - wrapper @retry_on_conflict was used for metadata_set method to
       allow retrying in the event of a race. The same wrapper was added for
       _push_metadata_software_deployments method.
     - added new parameter for metadata_set method - merge_metadata.
       When RetryRequest exception is raised, oslo_db_api.wrap_db_retry
       re-call metadata_set method and in this case we need to refresh
       old metadata. It's mostly need for signals without data and id.
       For example:
         A and B signals come in the same moment and both get number 1,
       because metadata was empty. Then during write in db RetryRequest
       exception was raised for signal B. Metadata of this signal stores old
       number - 1. So we should re-calculate this value using new length
       of metadata and set number - 2.

    Change-Id: I1ddbad7cde3036cfa9310c670609fcde607ffcac
    Co-Authored-By: Zane Bitter <email address hidden>
    Partially-Bug: #1497274

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/232124
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=cb3183d0475ee959045dbf0098cd0d96b1ba1e98
Submitter: Jenkins
Branch: master

commit cb3183d0475ee959045dbf0098cd0d96b1ba1e98
Author: Sergey Kraynev <email address hidden>
Date: Thu Mar 10 21:39:48 2016 -0500

    Fix race condition for WaitCondition with several signals

    This fix is based on changes introduced in patch
    Ibf9dd58a66a77d9ae9d4b519b0f11567977f416c.

    Follow changes were done in this patch:
     - According previous item code for calculating metadata of
       wait_condition_handle was moved to separate method -
       normalise_signal_data().
     - Same method was implemented for BaseWaitConditionHandle class, which
       returns original signal data without changes.
     - handle_signal method for BaseWaitConditionHandle class was updated:
       * all code related with updating and verification metadata was moved
         to merge_signal_metadata internal method, which will be called from
         metadata_set method.
       * if/else block was changed to raise error in case, when metadata has
         wrong format, so else section was deleted.
     - corresponding tests for waitcondition resource, which expect several
       signals was added.

    Change-Id: Ia25146a742ce79dbb0480d9053131216037e5305
    Co-Authored-By: Zane Bitter <email address hidden>
    Closes-Bug: #1497274

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/296912

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/296914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/296947

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/mitaka)

Reviewed: https://review.openstack.org/296912
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=71861f4f7c8063e618f95a72271ad0851e113f8f
Submitter: Jenkins
Branch: stable/mitaka

commit 71861f4f7c8063e618f95a72271ad0851e113f8f
Author: Sergey Kraynev <email address hidden>
Date: Mon Mar 14 03:57:15 2016 -0400

    Handling concurrent transactions in metadata_set method

    There are follow changes in this patch:
     - Using exception ConcurrentTransactions for processing
       concurrent transactions during writing metadata.
     - wrapper @retry_on_conflict was used for metadata_set method to
       allow retrying in the event of a race. The same wrapper was added for
       _push_metadata_software_deployments method.
     - added new parameter for metadata_set method - merge_metadata.
       When RetryRequest exception is raised, oslo_db_api.wrap_db_retry
       re-call metadata_set method and in this case we need to refresh
       old metadata. It's mostly need for signals without data and id.
       For example:
         A and B signals come in the same moment and both get number 1,
       because metadata was empty. Then during write in db RetryRequest
       exception was raised for signal B. Metadata of this signal stores old
       number - 1. So we should re-calculate this value using new length
       of metadata and set number - 2.

    Change-Id: I1ddbad7cde3036cfa9310c670609fcde607ffcac
    Co-Authored-By: Zane Bitter <email address hidden>
    Partially-Bug: #1497274
    (cherry picked from commit 82b744042e39529eaca6660704ad4923c2442be9)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/296914
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=443f867ceb65e2302229ef9a3d8c5afff51ae42b
Submitter: Jenkins
Branch: stable/mitaka

commit 443f867ceb65e2302229ef9a3d8c5afff51ae42b
Author: Sergey Kraynev <email address hidden>
Date: Thu Mar 10 21:39:48 2016 -0500

    Fix race condition for WaitCondition with several signals

    This fix is based on changes introduced in patch
    Ibf9dd58a66a77d9ae9d4b519b0f11567977f416c.

    Follow changes were done in this patch:
     - According previous item code for calculating metadata of
       wait_condition_handle was moved to separate method -
       normalise_signal_data().
     - Same method was implemented for BaseWaitConditionHandle class, which
       returns original signal data without changes.
     - handle_signal method for BaseWaitConditionHandle class was updated:
       * all code related with updating and verification metadata was moved
         to merge_signal_metadata internal method, which will be called from
         metadata_set method.
       * if/else block was changed to raise error in case, when metadata has
         wrong format, so else section was deleted.
     - corresponding tests for waitcondition resource, which expect several
       signals was added.

    Change-Id: Ia25146a742ce79dbb0480d9053131216037e5305
    Co-Authored-By: Zane Bitter <email address hidden>
    Closes-Bug: #1497274
    (cherry picked from commit cb3183d0475ee959045dbf0098cd0d96b1ba1e98)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/298132

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/296947
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=8f370d6174c7574c6ee611b5089eac8a4c37dbf2
Submitter: Jenkins
Branch: master

commit 8f370d6174c7574c6ee611b5089eac8a4c37dbf2
Author: Sergey Kraynev <email address hidden>
Date: Thu Mar 24 04:30:39 2016 -0400

    Increase timeout of WaitCondition due to ConcurrentTransactions

    Currently we have retry logic with random period of delay.
    There is a chance, when we have several re-try attemts with maximum
    timeout, so it can be a root cause of floating fails, when we have not
    enough time for getting all signals.

    Also decrease number of signals from 30 to 25, due to small risk, when
    one of signals goes to ConcurrentTransaction error.

    Note, that all these changes are made in corresponding functional test
    and don't affect main code.

    Change-Id: I9ca85c4378d0f62e316f4f40ff73597ffaaabadf
    Related-Bug: #1497274

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (stable/mitaka)

Reviewed: https://review.openstack.org/298132
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=db06ee4514b0cc8037f037788c34631c272cfc6b
Submitter: Jenkins
Branch: stable/mitaka

commit db06ee4514b0cc8037f037788c34631c272cfc6b
Author: Sergey Kraynev <email address hidden>
Date: Thu Mar 24 04:30:39 2016 -0400

    Increase timeout of WaitCondition due to ConcurrentTransactions

    Currently we have retry logic with random period of delay.
    There is a chance, when we have several re-try attemts with maximum
    timeout, so it can be a root cause of floating fails, when we have not
    enough time for getting all signals.

    Also decrease number of signals from 30 to 25, due to small risk, when
    one of signals goes to ConcurrentTransaction error.

    Note, that all these changes are made in corresponding functional test
    and don't affect main code.

    Change-Id: I9ca85c4378d0f62e316f4f40ff73597ffaaabadf
    Related-Bug: #1497274
    (cherry picked from commit 8f370d6174c7574c6ee611b5089eac8a4c37dbf2)

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/heat 6.0.0.0rc2

This issue was fixed in the openstack/heat 6.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/302565

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)
Download full text (13.0 KiB)

Reviewed: https://review.openstack.org/302565
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=eeed79b7a2da4a8f7fcbe0c8a7c7a069f304ae11
Submitter: Jenkins
Branch: master

commit bea576f802f993204aaaaea82812820f2f51c668
Author: Pavlo Shchelokovskyy <email address hidden>
Date: Wed Mar 30 16:09:29 2016 +0300

    Sync integration tests requirements

    The patch that auto-generates integration tests requirements [1] missed
    Mitaka release.

    This patch manually syncs the integration tests requirements hopefully
    for the last time.

    [1] Ie79338cc10cc101fbf15b51c7923e3a7b8e4fbb4

    Change-Id: I3f001bcb7900f6d822055bf3567773b3a594227d

commit 950505d956e99bc9eed7d6fb0223800251d88b95
Author: Rabi Mishra <email address hidden>
Date: Wed Mar 30 09:34:36 2016 +0530

    Revert "Check RBAC policy for nested stacks"

    This is causing issues with tripleo failures
    with ValueError, when trying do stack-preview as
    part of policy enforcement for stack DELETE.

    This reverts commit a8accbba98b5fc15562864030f49d20570bf8412.

    Change-Id: I366316f27f24c650bd435e67dd17abd8676cedf4
    Closes-Bug: #1561172
    (cherry picked from commit 354f1bcef038340d95f7dc4bc29268c9efed642e)

commit 184b09a40672ccfe1964bbf4fc72234d4cd012b6
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Mar 30 06:08:07 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I656cedd8162b5a7d1682a33ab9b836ba25d22de9

commit 0c407b8596499466b93d5b511e19b0de130b6f03
Author: Zane Bitter <email address hidden>
Date: Mon Mar 28 17:19:47 2016 -0400

    Add translation rule to delete ssh auth key from Magnum baymodel

    Add a translation rule to ensure that users with existing templates will
    not have them break when they update due to the 'ssh_authorized_key'
    property having been removed from Magnum and hidden in Heat.

    Change-Id: Id73bace93bfc2379e80777da7dc0c0bb059d6cd3
    Related-Bug: #1560284
    (cherry picked from commit 0ab68f891348eeafa1efa219070ef39f776cb6c6)

commit db06ee4514b0cc8037f037788c34631c272cfc6b
Author: Sergey Kraynev <email address hidden>
Date: Thu Mar 24 04:30:39 2016 -0400

    Increase timeout of WaitCondition due to ConcurrentTransactions

    Currently we have retry logic with random period of delay.
    There is a chance, when we have several re-try attemts with maximum
    timeout, so it can be a root cause of floating fails, when we have not
    enough time for getting all signals.

    Also decrease number of signals from 30 to 25, due to small risk, when
    one of signals goes to ConcurrentTransaction error.

    Note, that all these changes are made in corresponding functional test
    and don't affect main code.

    Change-Id: I9ca85c4378d0f62e316f4f40ff73597ffaaabadf
    Related-Bug: #1497274
    (cherry picked from commit 8f370d6174c7574c6ee611b5089eac8a4c37dbf2)

commit 145be139cc8a34c57f84e1991031cfc2595477ba
Author: OpenStack Proposal Bot <openstack-infra...

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/heat 7.0.0.0b1

This issue was fixed in the openstack/heat 7.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/348856

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/liberty)

Reviewed: https://review.openstack.org/348856
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=c6d18486a5f307eec3293aeb99b487684f1fda79
Submitter: Jenkins
Branch: stable/liberty

commit c6d18486a5f307eec3293aeb99b487684f1fda79
Author: Sergey Kraynev <email address hidden>
Date: Mon Mar 14 03:57:15 2016 -0400

    Handling concurrent transactions in metadata_set method

    There are follow changes in this patch:
     - Using exception ConcurrentTransactions for processing
       concurrent transactions during writing metadata.
     - wrapper @retry_on_conflict was used for metadata_set method to
       allow retrying in the event of a race. The same wrapper was added for
       _push_metadata_software_deployments method.
     - added new parameter for metadata_set method - merge_metadata.
       When RetryRequest exception is raised, oslo_db_api.wrap_db_retry
       re-call metadata_set method and in this case we need to refresh
       old metadata. It's mostly need for signals without data and id.
       For example:
         A and B signals come in the same moment and both get number 1,
       because metadata was empty. Then during write in db RetryRequest
       exception was raised for signal B. Metadata of this signal stores old
       number - 1. So we should re-calculate this value using new length
       of metadata and set number - 2.

    Change-Id: I1ddbad7cde3036cfa9310c670609fcde607ffcac
    Co-Authored-By: Zane Bitter <email address hidden>
    Partially-Bug: #1497274
    (cherry picked from commit 82b744042e39529eaca6660704ad4923c2442be9)

tags: added: in-stable-liberty
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.