Tests fails with SQL error 'Command Out of Sync'

Bug #1546431 reported by Thomas Herve
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Steve Baker

Bug Description

Seen here: http://logs.openstack.org/91/280691/2/gate/gate-heat-dsvm-functional-orig-mysql/065e324

 Captured traceback:
 ~~~~~~~~~~~~~~~~~~~
     Traceback (most recent call last):
       File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 339, in _stack_delete
         success_on_not_found=True)
       File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 329, in _wait_for_stack_status
         raise exceptions.TimeoutException(message)
     heat_integrationtests.common.exceptions.TimeoutException: Request timed out
     Details: Stack InstanceGroupUpdatePolicyTest-2000329881 failed to reach DELETE_COMPLETE status within the required time (1200 s).

On the engine logs we can see:

Exception during reset or similar
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 636, in _finalize_fairy
    fairy._reset(pool)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 776, in _reset
    pool._dialect.do_rollback(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2526, in do_rollback
    dbapi_connection.rollback()
  File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 762, in rollback
    self._read_ok_packet()
  File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 738, in _read_ok_packet
    raise err.OperationalError(2014, "Command Out of Sync")
OperationalError: (2014, 'Command Out of Sync')

It could be a thread-safety issue? There is no other error that I can see.

Angus Salkeld (asalkeld)
tags: added: gate-failure
Revision history for this message
Zane Bitter (zaneb) wrote :

I've also noticed that tests that timeout with no explanation (i.e. no failed tests, the gate runner just kills the job) are due to this error, since in the test cleanup code we delete any stacks the test created *and* wait for them to reach DELETE_COMPLETE (why???). So this problem could be much more widespread than we thought.

Example: http://logs.openstack.org/98/323998/2/check/gate-heat-dsvm-functional-orig-mysql-lbaasv2/0fdca14/

Changed in heat:
status: New → Triaged
Zane Bitter (zaneb)
Changed in heat:
milestone: none → newton-2
Revision history for this message
Zane Bitter (zaneb) wrote :

Oh yeah, this is everywhere.

Revision history for this message
Zane Bitter (zaneb) wrote :

I wouldn't be surprised if bug 1483670 turned out to have the same cause.

Revision history for this message
Steve Baker (steve-stevebaker) wrote :
Revision history for this message
Rabi Mishra (rabi) wrote :

It seems to be happening mostly when we stop the greenthread while deleting an IN_PROGRESS stack and the existing action is interrupted.

In these cases the 'Command Out of Sync' errors are associated with ResourceClosedError.

- Few cases it fails when the thread linked function(release) is called which in turn calls persist_state_and_release_lock[1].
- Other cases it fails while setting the resource state to FAILED[2]

Most of the cases this happens for StackValidationTest where we try to delete(cleanup) an IN_PROGRESS stack.

[1]

http://logs.openstack.org/99/330799/5/gate/gate-heat-dsvm-functional-orig-mysql-lbaasv2/c7e34e3/logs/screen-h-eng.txt.gz#_2016-06-30_05_01_58_311

[2]
http://logs.openstack.org/03/335303/3/check/gate-heat-dsvm-functional-orig-mysql-lbaasv2/ffb5434/logs/screen-h-eng.txt.gz#_2016-06-29_22_17_02_904

There are few other cases it fails with just 'Exception during reset or simillar'. Not sure how that happens.

Thomas Herve (therve)
Changed in heat:
milestone: newton-2 → newton-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/341373

Changed in heat:
assignee: nobody → Rabi Mishra (rabi)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/341373
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=d01b4dc7a8020fb1ae88580ef3b41d279e66a70c
Submitter: Jenkins
Branch: master

commit d01b4dc7a8020fb1ae88580ef3b41d279e66a70c
Author: rabi <email address hidden>
Date: Wed Jul 13 14:18:17 2016 +0530

    Don't persist_state of IN_PROGRESS stack again

    When a greenthread is stopped(ex. deleting an IN_PROGRESS stack),
    the linked function(release) is called and it tries to save
    the state of the stack. As we already saved the state of the
    IN_PROGRESS stack, we should not try to do it again.

    Change-Id: I598afcd286881bc1d9394152fe38701d5fed473f
    Partial-Bug: #1546431

Rabi Mishra (rabi)
Changed in heat:
assignee: Rabi Mishra (rabi) → nobody
Thomas Herve (therve)
Changed in heat:
status: In Progress → New
milestone: newton-3 → next
Revision history for this message
Zane Bitter (zaneb) wrote :

I believe bug 1499669 is a duplicate of this one.

Changed in heat:
status: New → Triaged
milestone: next → newton-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/369638

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

I'm going to mark this as the duplicate, since there is a local reproducer in 1546431

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/369638
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=6d0e8e9f68018ae64d61f76db900cd20d60d3eed
Submitter: Jenkins
Branch: master

commit 6d0e8e9f68018ae64d61f76db900cd20d60d3eed
Author: Zane Bitter <email address hidden>
Date: Tue Sep 13 15:05:16 2016 -0400

    Always use context manager for DB transactions

    An exception could occur at any time, so we want to make sure we roll
    back any uncompleted transaction when that happens.

    Change-Id: Id0d5b65f44981bde24840b8a0d5fdc0b88f24d8a
    Related-bug: #1546431

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/369827

Changed in heat:
assignee: nobody → Steve Baker (steve-stevebaker)
status: Triaged → In Progress
Revision history for this message
Zane Bitter (zaneb) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/370301

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/369827
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3000f904080d8dcd841d913dcd2ae658fb526c1a
Submitter: Jenkins
Branch: master

commit 3000f904080d8dcd841d913dcd2ae658fb526c1a
Author: Steve Baker <email address hidden>
Date: Fri Sep 16 03:29:59 2016 +0000

    Legacy delete attempt thread cancel before stop

    The error messages 'Command Out of Sync' are due to the threads being
    stopped in the middle of the database operations. This happens in the
    legacy action when delete is requested during a stack create.

    We have the thread cancel message but that was not being used in this
    case. Thread cancel should provide a more graceful way of ensuring the
    stack is in a FAILED state before the delete is attempted.

    This changes does the following in the delete_stack service method for
    legace engine:
    - if the stack is still locked, send thread cancel message
    - in a subthread wait for the lock to be released, or until a
      timeout based on the 4 minute cancel grace period
    - if the stack is still locked, do a thread stop as before

    Closes-Bug: #1499669
    Closes-Bug: #1546431
    Closes-Bug: #1536451
    Change-Id: I4cd613681f07d295955c4d8a06505d72d83728a0

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/373518

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/newton)

Reviewed: https://review.openstack.org/373518
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=2dd44db1b9cf4b789d8a083df6f97ae1fb5e22d5
Submitter: Jenkins
Branch: stable/newton

commit 2dd44db1b9cf4b789d8a083df6f97ae1fb5e22d5
Author: Steve Baker <email address hidden>
Date: Fri Sep 16 03:29:59 2016 +0000

    Legacy delete attempt thread cancel before stop

    The error messages 'Command Out of Sync' are due to the threads being
    stopped in the middle of the database operations. This happens in the
    legacy action when delete is requested during a stack create.

    We have the thread cancel message but that was not being used in this
    case. Thread cancel should provide a more graceful way of ensuring the
    stack is in a FAILED state before the delete is attempted.

    This changes does the following in the delete_stack service method for
    legace engine:
    - if the stack is still locked, send thread cancel message
    - in a subthread wait for the lock to be released, or until a
      timeout based on the 4 minute cancel grace period
    - if the stack is still locked, do a thread stop as before

    Closes-Bug: #1499669
    Closes-Bug: #1546431
    Closes-Bug: #1536451
    Change-Id: I4cd613681f07d295955c4d8a06505d72d83728a0
    (cherry picked from commit 3000f904080d8dcd841d913dcd2ae658fb526c1a)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Crag Wolfe (<email address hidden>) on branch: master
Review: https://review.openstack.org/372804

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 7.0.0.0rc2

This issue was fixed in the openstack/heat 7.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 7.0.0

This issue was fixed in the openstack/heat 7.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 8.0.0.0b1

This issue was fixed in the openstack/heat 8.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.