heat engine db transaction error: InvalidRequestError: This session is in 'committed' state; no further SQL can be emitted within this transaction.

Bug #1328983 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Critical
Steve Baker

Bug Description

http://logs.openstack.org/73/97573/5/gate/gate-tempest-dsvm-neutron/8de83cb/console.html

2014-06-11 12:41:32.597 | {0} tearDownClass (tempest.api.orchestration.stacks.test_stacks.StacksTestJSON) [0.000000s] ... FAILED
2014-06-11 12:41:32.597 |
2014-06-11 12:41:32.597 | Captured traceback:
2014-06-11 12:41:32.597 | ~~~~~~~~~~~~~~~~~~~
2014-06-11 12:41:32.597 | Traceback (most recent call last):
2014-06-11 12:41:32.597 | File "tempest/api/orchestration/base.py", line 138, in tearDownClass
2014-06-11 12:41:32.597 | cls._clear_stacks()
2014-06-11 12:41:32.598 | File "tempest/api/orchestration/base.py", line 89, in _clear_stacks
2014-06-11 12:41:32.598 | stack_identifier, 'DELETE_COMPLETE')
2014-06-11 12:41:32.598 | File "tempest/services/orchestration/json/orchestration_client.py", line 199, in wait_for_stack_status
2014-06-11 12:41:32.598 | raise exceptions.TimeoutException(message)
2014-06-11 12:41:32.598 | TimeoutException: Request timed out
2014-06-11 12:41:32.598 | Details: Stack heat-209866829 failed to reach DELETE_COMPLETE status within the required time (900 s).

I see this db transaction error in the heat engine log:

http://logs.openstack.org/73/97573/5/gate/gate-tempest-dsvm-neutron/8de83cb/logs/screen-h-eng.txt.gz?level=TRACE#_2014-06-11_12_26_32_530

2014-06-11 12:26:32.530 27343 ERROR heat.openstack.common.db.sqlalchemy.session [-] DB exception wrapped.
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session Traceback (most recent call last):
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/opt/stack/new/heat/heat/openstack/common/db/sqlalchemy/session.py", line 439, in _wrap
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session return f(self, *args, **kwargs)
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/opt/stack/new/heat/heat/openstack/common/db/sqlalchemy/session.py", line 705, in flush
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session return super(Session, self).flush(*args, **kwargs)
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 1903, in flush
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session self._flush(objects)
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2021, in _flush
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session transaction.rollback(_capture_exception=True)
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session compat.reraise(type_, value, traceback)
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 2021, in _flush
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session transaction.rollback(_capture_exception=True)
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 386, in rollback
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session self._assert_active(prepared_ok=True, rollback_ok=True)
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 192, in _assert_active
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session "This session is in 'committed' state; no further "
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session InvalidRequestError: This session is in 'committed' state; no further SQL can be emitted within this transaction.
2014-06-11 12:26:32.530 27343 TRACE heat.openstack.common.db.sqlalchemy.session

That shows up elsewhere:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiSW52YWxpZFJlcXVlc3RFcnJvclxcOiBUaGlzIHNlc3Npb24gaXMgaW4gJ2NvbW1pdHRlZCcgc3RhdGVcXDsgbm8gZnVydGhlciBTUUwgY2FuIGJlIGVtaXR0ZWQgd2l0aGluIHRoaXMgdHJhbnNhY3Rpb24uXCIgQU5EIHRhZ3M6XCJzY3JlZW4taC1lbmcudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjE3MjgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MDI1MDUxNDM5NTV9

16 hits in 2 days, check and gate, all failures.

Tags: db
Matt Riedemann (mriedem)
summary: - tempest.api.orchestration.stacks.test_stacks.StacksTestJSON fails with
- timeout: Stack failed to reach DELETE_COMPLETE status within the
- required time
+ heat engine db transaction error: InvalidRequestError: This session is
+ in 'committed' state; no further SQL can be emitted within this
+ transaction.
tags: added: db
Revision history for this message
Zane Bitter (zaneb) wrote :
Changed in heat:
importance: Undecided → Critical
Revision history for this message
Matt Riedemann (mriedem) wrote :

Yeah looks like this is probably a duplicate of bug 1306029 given the changes made for that bug and looking at the failures in log stash, the graphs are very similar when both issues show up.

Changed in heat:
status: New → Triaged
assignee: nobody → Steve Baker (steve-stevebaker)
milestone: none → juno-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/102377

Changed in heat:
status: Triaged → In Progress
Revision history for this message
Steve Baker (steve-stevebaker) wrote :

This happens in the following scenario:
* single heat-engine (so no multi-engine locking paths are invoked)
* multiple stack deletes on the same stack
* when the threads are destroyed from a previous delete call so that the new delete call can acquire the lock and do the delete

I don't think this is a race as-such. The error happens when user_creds_delete is being called by the thread which is currently being killed, and the session is already in the process of being shut down.

I have a change which does a graceful stop of the threadgroup instead of a hard stop. A timeout will be added to the threadgroup waiting, to give the threads an opportunity to finish within the timeout.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/103715

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/103716

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/102377
Reason: Alternative fix here https://review.openstack.org/#/c/103716/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/103715
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=657b339967f67ee103fb1d96914e412632a088a5
Submitter: Jenkins
Branch: master

commit 657b339967f67ee103fb1d96914e412632a088a5
Author: Steve Baker <email address hidden>
Date: Wed Jul 2 12:51:25 2014 +1200

    Catch NotFound for every stack delete db operation

    A stack delete can be interrupted by another delete request
    at any stage, so a stack may be in any state of partial deletion.

    This change ensures that every db operation in a stack delete
    will catch NotFound exceptions so that the delete process can
    continue.

    Change-Id: I3b665acbc7979147f24793a314da37150d4efeb8
    Partial-Bug: 1328983

Revision history for this message
Zane Bitter (zaneb) wrote :

So based on Steve's work, it appears our best guess at the problem is this:

* SQLAlchemy marks a transaction as COMMITTED at some point _before_ it completes processing of the commit().
* If an exception (in this case GreenletExit) is raised during the commit(), SQLAlchemy attempts to rollback the transaction.
* If the exception occurs after the transaction state has changed to COMMITTED, the rollback() fails with this error.

So this is a somewhat artificial problem, in that it's triggered by the timing of Tempest sending multiple delete calls to a stack that contains no resources. The proposed solution is equally artificial and just adds a delay to prevent that case.

This is probably the right way to go *assuming* that we are not missing out on any important operations by cancelling the commit() operation (albeit after the state has reached COMMITTED) without attempting a rollback. This is a question we need to take up with the SQLAlchemy folks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/103716
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=cb966ff38a9829e1d7d86701cb73bcdd2537595a
Submitter: Jenkins
Branch: master

commit cb966ff38a9829e1d7d86701cb73bcdd2537595a
Author: Steve Baker <email address hidden>
Date: Wed Jul 2 15:16:06 2014 +1200

    Sleep before stopping threads for delete

    The logged error in bug #1328983 occurs in the following scenario:
    * single heat-engine (so no multi-engine locking paths are invoked)
    * multiple stack deletes on the same stack
    * when the threads are stopped from a previous delete call so
      that the new delete call can acquire the lock and do the delete

    This change does an eventlet sleep before forcefully stopping
    the thread when taking over a delete from a different
    thread. This gives the previous delete thread an opportunity to
    finish naturally.

    This may not prevent the error in bug #1328983 from ever
    being logged in a production environment, but it is an internal error
    which does not propagate to the user. This change should however
    prevent this error from being logged during a tempest run, and the
    original gate failure was due to ERROR log entry detection rather
    than actual failing tests.

    By experimentation the timeout of 0.2s was chosen. The error in
    bug #1328983 started being logged again when the timeout was
    reduced to 0.02s

    Closes-Bug: #1328983

    Change-Id: I8f95f29bd238e097ed9f4b889afe12c88d193240

Changed in heat:
status: In Progress → Fix Committed
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.