Conflict errors when snapshot-restore stacks in parallel

Bug #1456672 reported by Kairat Kushaev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Kairat Kushaev

Bug Description

When I tried the test heat with rally I found the following problem:
1. rally input task is the following
HeatStacks.create_snapshot_restore_delete_stack:
    -
      args:
        template_path: "templates/random_strings.yaml.template"
      runner:
        type: "constant"
        times: 8
        concurrency: 4
      context:
        users:
          tenants: 3
          users_per_tenant: 2
So rally creates a stack in 8 threads with 3 tenants ans 2 users per tenant.
create_snapshot_restore_delete_stack executes simple create, snapshot, restore, delete operations with client.
2. When launched test I got the following:
2015-05-19 16:48:34.521 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 1 START
2015-05-19 16:48:34.519 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 0 START
2015-05-19 16:48:34.531 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 2 START
2015-05-19 16:48:34.547 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 3 START
2015-05-19 16:48:59.881 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 1 END: Error HTTPConflict: ERROR: Stack rally_stack_Ax7mX0n already has an action (SNAPSHOT) in progress.
2015-05-19 16:48:59.911 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 4 START
2015-05-19 16:49:16.641 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 2 END: OK
2015-05-19 16:49:16.652 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 5 START
2015-05-19 16:49:23.423 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 3 END: OK
2015-05-19 16:49:23.436 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 6 START
2015-05-19 16:49:23.534 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 0 END: OK
2015-05-19 16:49:23.549 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 7 START
2015-05-19 16:49:46.729 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 4 END: OK
2015-05-19 16:49:55.416 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 5 END: OK
2015-05-19 16:50:01.911 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 6 END: OK
2015-05-19 16:50:03.542 26343 INFO rally.benchmark.runners.base [-] Task 38c5e989-23bc-4d45-825a-b8ae19e2beb0 | ITER: 7 END: OK
+---------------------------------------------------------------------------------------------+
| Response Times (sec) |
+---------------------+--------+--------+--------+--------+--------+--------+---------+-------+
| action | min | median | 90%ile | 95%ile | max | avg | success | count |
+---------------------+--------+--------+--------+--------+--------+--------+---------+-------+
| heat.create_stack | 12.068 | 13.911 | 15.478 | 15.537 | 15.595 | 14.122 | 100.0% | 8 |
| heat.delete_stack | 4.732 | 10.61 | 14.012 | 14.048 | 14.084 | 9.544 | 87.5% | 8 |
| heat.restore_stack | 7.282 | 11.237 | 11.991 | 12.38 | 12.77 | 10.553 | 87.5% | 8 |
| heat.snapshot_stack | 7.475 | 9.467 | 10.286 | 10.827 | 11.369 | 9.231 | 100.0% | 8 |
| total | 38.474 | 42.094 | 48.922 | 48.958 | 48.994 | 43.43 | 87.5% | 8 |
+---------------------+--------+--------+--------+--------+--------+--------+---------+-------+
So restore for stack rally_stack_Ax7mX0n was not finished successfully.

Please see the heat-engine logs in attachment.

Env:
heat - master branch - liberty.1
rally - master branch - liberty.1

Revision history for this message
Kairat Kushaev (kkushaev) wrote :
Changed in heat:
assignee: nobody → Kairat Kushaev (kkushaev)
description: updated
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

So if you will deep down into logs you will see the following:
2015-05-19 16:48:59.849 ^[[00;32mDEBUG heat.engine.stack_lock [^[[00;36m-^[[00;32m] ^[[01;35m^[[00;32mEngine 21ec8e3e-369e-4ed0-9e2b-e9220629b49e released lock on stack 5665946a-64fa-4ae7-a05f-04f198e30103^[[00m ^[[00;33mfrom (pid=11867) release /opt/stack/heat/heat/engine/stack_lock.py:140^[[00m

and after that

2015-05-19 16:48:59.857 ^[[00;32mDEBUG heat.engine.stack_lock [^[[01;36mreq-a0e537a5-9784-4a55-bead-1fb82c01724f ^[[00;36mNone ctx_rally_38c5e989-23bc-4d45-825a-b8ae19e2beb0_tenant_2^[[00;32m] ^[[01;35m^[[00;32mLock on stack 5665946a-64fa-4ae7-a05f-04f198e30103 is owned by engine 21ec8e3e-369e-4ed0-9e2b-e9220629b49e^[[00m ^[[00;33mfrom (pid=11867) acquire /opt/stack/heat/heat/engine/stack_lock.py:90^[[00m
2015-05-19 16:48:59.857 ^[[00;32mDEBUG oslo_messaging.rpc.dispatcher [^[[01;36mreq-a0e537a5-9784-4a55-bead-1fb82c01724f ^[[00;36mNone ctx_rally_38c5e989-23bc-4d45-825a-b8ae19e2beb0_tenant_2^[[00;32m] ^[[01;35m^[[00;32mExpected exception during message handling (Stack rally_stack_Ax7mX0n already has an action (SNAPSHOT) in progress.)^[[00m ^[[00;33mfrom (pid=11867) _dispatch_and_reply /usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py:145^[[00m

It seems that lock release does not work correctly in multithread environment.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

If there is a lock release failure, it may only affect the snapshot operation.

Changed in heat:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Kairat Kushaev (kkushaev) wrote :

Finally, I found the root cause of the problem:
It turns out that heat do not release lock when stack state becomes (SNAPSHOT, COMPLETE).
Look at engine/service.py:
stack.snapshot()
data = stack.prepare_abandon()
               snapshot_object.Snapshot.update(
                cnxt, snapshot.id,
                {'data': data, 'status': stack.status,
                 'status_reason': stack.status_reason})
When stack becomes SNAPSHOT_complete heat executes two additional operations that requires stack to be locked.

Rally is starting stack.restore when stack becomes SNAPSHOT_COMPLETE, so it tries to acquire a lock when heat is still executing prepare_abandon and snapshot update in DB. As a result we have ActionInProgress errors.

So I am thinking how to do stack snapshot more modular. IMO, it had better to execute prepare_abandon and snapshot update in DB BEFORE stack becomes SNAPSHOT_COMPLETE because if some errors will happen during these operations a user will know about it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184769

Changed in heat:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/184769
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3433be542642c359486a30dcd5bd744e103196ba
Submitter: Jenkins
Branch: master

commit 3433be542642c359486a30dcd5bd744e103196ba
Author: kairat_kushaev <email address hidden>
Date: Thu May 21 15:26:59 2015 +0300

    Save snapshot to db before stack snapshot complete

    When executing stack snapshot heat do not release stack lock
    some undefined amount of time because it needs to prepare
    some data for stack snapshot. It leads to situation when stack
    snapshot is complete(from user perspective) but nobody can do any
    operations with stack. So the fix executes snapshot saving to DB
    right before stack becomes complete.

    Change-Id: Id011142498fee49fee9ec1437fc0816b25780e48
    Closes-bug: #1456672

Changed in heat:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in heat:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: liberty-1 → 5.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.