tempest.api.orchestration.stacks.test_update.UpdateStackTestJSON.test_stack_update_add_remove - mismatch error

Bug #1370865 reported by Steve Baker
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Invalid
High
Unassigned
tempest
Fix Released
High
Sean Dague

Bug Description

tempest.api.orchestration.stacks.test_update.UpdateStackTestJSON.test_stack_update_add_remove[gate]
2014-09-17 05:35:15.471 | ---------------------------------------------------------------------------------------------------
2014-09-17 05:35:15.471 |
2014-09-17 05:35:15.471 | Captured traceback:
2014-09-17 05:35:15.471 | ~~~~~~~~~~~~~~~~~~~
2014-09-17 05:35:15.472 | Traceback (most recent call last):
2014-09-17 05:35:15.472 | File "tempest/api/orchestration/stacks/test_update.py", line 82, in test_stack_update_add_remove
2014-09-17 05:35:15.472 | self.list_resources(stack_identifier))
2014-09-17 05:35:15.472 | File "/usr/local/lib/python2.7/dist-packages/testtools/testcase.py", line 348, in assertEqual
2014-09-17 05:35:15.472 | self.assertThat(observed, matcher, message)
2014-09-17 05:35:15.472 | File "/usr/local/lib/python2.7/dist-packages/testtools/testcase.py", line 433, in assertThat
2014-09-17 05:35:15.472 | raise mismatch_error
2014-09-17 05:35:15.472 | MismatchError: !=:
2014-09-17 05:35:15.472 | reference = {'random1': 'OS::Heat::RandomString'}
2014-09-17 05:35:15.472 | actual = {u'random1': u'OS::Heat::RandomString', u'random2': u'OS::Heat::RandomString'}

seen in gate-tempest-dsvm-postgres-full, possibly only affects postgres.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :
Changed in heat:
status: New → Triaged
assignee: nobody → Steve Baker (steve-stevebaker)
Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

This should be fixed by https://review.openstack.org/#/c/88075/ but still its occurring where HEAT return state UPDATE_COMPLETE but the new resources is not shown up while listing those.

Tempest test do wait for 'UPDATE_COMPLETE' status after update stack.

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

Currently i do not think any thing needs to be done for tempest. Depend on HEAT finding we can again reopen for tempest if needed.

Changed in tempest:
status: New → Invalid
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Logstash query:
message:"reference = {'random1': 'OS::Heat::RandomString'}" AND tags:console

It is occurring about once per hour, and only on check-tempest-dsvm-postgres-full and check-tempest-dsvm-postgres-full-icehouse

So it likely is specific to postgres

Revision history for this message
Matt Riedemann (mriedem) wrote :

The nova-metadata-api service also runs in the postgresql jobs, but I've no idea if that could be related.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/123023

Changed in heat:
status: Triaged → In Progress
Changed in heat:
importance: Undecided → Medium
milestone: none → juno-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/123023
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=146a734555207583b4ca9b20e1f101c5f467a28f
Submitter: Jenkins
Branch: master

commit 146a734555207583b4ca9b20e1f101c5f467a28f
Author: Steve Baker <email address hidden>
Date: Mon Sep 22 10:45:51 2014 +1200

    Commit transaction on object delete

    The postgres gate jobs currently have an intermittent failure
    where a deleted resource shows up in the resource list.

    This change does a transaction commit on object delete, which
    brings delete behaviour in line with soft_delete behaviour
    (soft_delete uses update_and_save, which also commits).

    Change-Id: I5e1be4c197863fde3149c964f992ad816db88b7e
    Closes-Bug: #1370865

Changed in heat:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/124784

Thierry Carrez (ttx)
Changed in heat:
status: Fix Committed → Fix Released
Revision history for this message
Joe Gordon (jogo) wrote : Re: tempest.api.orchestration.stacks.test_update.UpdateStackTestJSON.test_stack_update_add_remove mismatch error
Revision history for this message
Sean Dague (sdague) wrote :

The fact is the test is broken by design, it doing an async op, and expecting synchronous results. Unless there is a way to know if the update took, other than the data, I don't think this is a test we can do.

Changed in heat:
status: Fix Released → Confirmed
Changed in tempest:
status: Invalid → Confirmed
Changed in tempest:
assignee: nobody → Sean Dague (sdague)
status: Confirmed → In Progress
Sean Dague (sdague)
summary: - tempest.api.orchestration.stacks.test_update.UpdateStackTestJSON.test_stack_update_add_remove
- mismatch error
+ tempest.api.orchestration.stacks.test_update are broken by design
Changed in tempest:
importance: Undecided → High
Revision history for this message
Matt Riedemann (mriedem) wrote : Re: tempest.api.orchestration.stacks.test_update are broken by design

The tests are being removed:

https://review.openstack.org/#/c/125989/

Revision history for this message
Zane Bitter (zaneb) wrote :

It's clearly not broken _by design_, and it is quite explicitly waiting to see if the update took:

  self.client.wait_for_stack_status(stack_identifier, 'UPDATE_COMPLETE')

summary: - tempest.api.orchestration.stacks.test_update are broken by design
+ tempest.api.orchestration.stacks.test_update.UpdateStackTestJSON.test_stack_update_add_remove
+ - mismatch error
Sean Dague (sdague)
Changed in heat:
importance: Medium → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/126074

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/126084

Changed in heat:
assignee: Steve Baker (steve-stevebaker) → Zane Bitter (zaneb)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/125989
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=0ca4933e9cb5ffc0536d087a30c3c911f21320b6
Submitter: Jenkins
Branch: master

commit 0ca4933e9cb5ffc0536d087a30c3c911f21320b6
Author: Sean Dague <email address hidden>
Date: Fri Oct 3 10:52:36 2014 -0400

    remove heat test_updates tests

    These perform async operations, but expect synchronous results, so
    they can often fail if the system is slowed down in the least. These
    tests should be removed as they are non deterministic.

    Closes-Bug: #1370865

    Change-Id: I9f0ef9a2f8a3b8971c6e52b99723fe4137ec2d0a

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
Zane Bitter (zaneb) wrote :

Hopefully that revert should stem the tide of failures somewhat. However, after digging into it I don't think the first patch solved the original problem. It tried to improve the deletion of resources from the Resource table, but the resource list doesn't come from there - it's obtained from the template.

Revision history for this message
Steven Hardy (shardy) wrote :

Could it be related to this bug #1377681 where it again appears an update removing resources is not actually removing the expected resources?

Revision history for this message
Steven Hardy (shardy) wrote :

> The fact is the test is broken by design

As mentioned in comment #2 we wait for UPDATE_COMPLETE, so I don't understand this attitude (or appreciate it really, as the original author of the test).

AFAICT the test is working, and it's caught a real regression in heat.

Revision history for this message
Zane Bitter (zaneb) wrote :

It seems unlikely that bug #1377681 is related. Most likely that's a problem with the InstanceGroup resource. This one appears to be a DB problem, since it doesn't occur all the time, and was occurring almost exclusively on Postgres up until the big burst of failures on Friday, which I suspect can be fixed by https://review.openstack.org/#/c/126084/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/126084
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ce22b59b62f8f0e7136a2a7d675ca72a992a0543
Submitter: Jenkins
Branch: master

commit ce22b59b62f8f0e7136a2a7d675ca72a992a0543
Author: Zane Bitter <email address hidden>
Date: Mon Oct 6 16:46:55 2014 -0400

    Revert "Hard delete backup stack after successful update."

    This reverts commit 7cf6e52cacd2382b44b9255e9a5ed13271841261.

    This change appears to be the cause of many Tempest test breakages. Prior
    to this a new raw_template entry was created in the database after an
    update, and the pointer to the new raw_template and the stack status were
    changed atomically. Now those two pieces of data are being updated
    separately... it seems like it is possible for the database to return the
    new version of the stack on one call followed by the old version of the
    template on another call. Tempest is observing the stack status change to
    UPDATE_COMPLETE but then still getting data from the old template when
    doing a resource list immediately after.

    Change-Id: Ieea739bc45dcc365ccb6f09d0e848e310d327134
    Partial-Bug: #1370865

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/126464

Revision history for this message
Anant Patil (ananta) wrote :

The stack.store() is not an atomic call at all. First it stores the
template and then the user credentials and then the stack itself.

It is natural that if threads poll DB between these three operations,
they will find conflicting data. It was working earlier because each
update was creating new entry for raw template, which the external
polling thread wouldn't know unless the stack itself is stored with its
new template ID and status.

In the reverted change, the template id of old template is reused to
avoid creating new row in raw_template table.

When the old template ID is reused to persist new template, there is
possibility that a thread will find new template there, with the old ID,
before stack status changed to COMPLETED, because the stack is stored
_after_ template is stored. Since the stack store method is not atomic,
this happens.

If old template ID is not re-used when updating a stack, new entries for
new templates are created in DB, leaving the old template entry
dangling. Since back-up stack has its own raw template, it would get
deleted but the stale template remains.

Revision history for this message
Zane Bitter (zaneb) wrote :

> When the old template ID is reused to persist new template, there is
> possibility that a thread will find new template there, with the old ID,
> before stack status changed to COMPLETED, because the stack is stored
> _after_ template is stored.

You might think that, but in this bug what we're seeing is the opposite. I suspect that the databases do not guarantee at all the order of operations that are not part of a transaction.

Given that we were already seeing the error intermittently before that patch (although the frequency went through the roof around the time it was committed), it's possible that it only introduced a timing change or something of that sort to exacerbate the existing problem.

Revision history for this message
Anant Patil (ananta) wrote :

> You might think that, but in this bug what we're seeing is the
> opposite. I suspect that the databases do not guarantee at all the
> order of operations that are not part of a transaction.

Yes, I agree. There is no ordering in how transactions will be
committed, given that the three operations in stack.store() are actually
three different DB transactions.

I see that we can get rid of lot of these types of problems if we can
make the store methods transactional. For example, in this patch
https://review.openstack.org/#/c/123397/ , the author is defining a
rollback method and explicitly calling it. IMHO, one would have to make
such _workarounds_ in order to impose ACID properties on a method, which
I think we can get rid of if we write a generic closure to rollback the
store operations.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (stable/icehouse)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/124784
Reason: tests failing are removed in tempest/master

Thierry Carrez (ttx)
Changed in heat:
milestone: juno-rc1 → none
Zane Bitter (zaneb)
Changed in heat:
importance: Critical → High
assignee: Zane Bitter (zaneb) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/129497

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/129497
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=8f50ef5b753684ee825e9e1213d960a9497e5616
Submitter: Jenkins
Branch: master

commit 8f50ef5b753684ee825e9e1213d960a9497e5616
Author: Steve Baker <email address hidden>
Date: Mon Oct 20 14:34:41 2014 +1300

    Functional test UpdateStackTest

    This is a port of tempest.api.orchestration.stacks.stack_update
    in an effort to reproduce the postgres race described in
    bug #1370865.

    Subsequent changes will create an experimental postgres check job and
    a change to the test which runs test_stack_update_add_remove enough
    times to triger the race.

    Change-Id: I8aee26e29f242b11a5d3c8d56a859ae770be86c2
    Related-Bug: #1370865

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/130416

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tempest (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/126074

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/126464
Reason: This has moved to an in-tree functional test

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/130416
Reason: haven't been able to trigger the race with this approach

Angus Salkeld (asalkeld)
Changed in heat:
milestone: none → kilo-rc1
Angus Salkeld (asalkeld)
Changed in heat:
milestone: kilo-rc1 → next
tags: added: kilo-rc-potential
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Marking as invalid until we can reproduce

Changed in heat:
status: In Progress → Invalid
Angus Salkeld (asalkeld)
Changed in heat:
milestone: next → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.