[nailgun] Deadlock in deployment task during cluster deletion

Bug #1433528 reported by Alexander Kislitsky
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Fuel Python (Deprecated)
7.0.x
Won't Fix
Medium
Fuel Python (Deprecated)
8.0.x
Won't Fix
Medium
Fuel Python (Deprecated)

Bug Description

Deadlock occurs in the process of deletion cluster during deployment.

nailgun/test/integration/test_charset_issues.py:74: in test_deletion_during_deployment
    headers=self.default_headers
.tox/py26/lib/python2.6/site-packages/webtest/app.py:312: in delete
    content_type=content_type)
.tox/py26/lib/python2.6/site-packages/webtest/app.py:626: in _gen_request
    expect_errors=expect_errors)
.tox/py26/lib/python2.6/site-packages/webtest/app.py:521: in do_request
    self._check_status(status, res)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <webtest.app.TestApp object at 0x774c8d0>, status = None
res = <400 Bad Request text/html body='{"message... []}'/719>

    def _check_status(self, status, res):
        if status == '*':
            return
        res_status = res.status
        if (isinstance(status, string_types) and '*' in status):
            if re.match(fnmatch.translate(status), res_status, re.I):
                return
        if isinstance(status, (list, tuple)):
            if res.status_int not in status:
                raise AppError(
                    "Bad response: %s (not one of %s for %s)\n%s",
                    res_status, ', '.join(map(str, status)),
                    res.request.url, res)
            return
        if status is None:
            if res.status_int >= 200 and res.status_int < 400:
                return
            raise AppError(
                "Bad response: %s (not 200 OK or 3xx redirect for %s)\n%s",
                res_status, res.request.url,
> res)
E AppError: Bad response: 400 Bad Request (not 200 OK or 3xx redirect for http://localhost/api/clusters/28/)
E '{"message": "(TransactionRollbackError) deadlock detected\\nDETAIL: Process 25782 waits for ShareLock on transaction 307254; blocked by process 25808.\\nProcess 25808 waits for ShareLock on transaction 307256; blocked by process 25782.\\nHINT: See server log for query details.\\n \'SELECT tasks.id AS tasks_id, tasks.cluster_id AS tasks_cluster_id, tasks.uuid AS tasks_uuid, tasks.name AS tasks_name, tasks.message AS tasks_message, tasks.status AS tasks_status, tasks.progress AS tasks_progress, tasks.result AS tasks_result, tasks.parent_id AS tasks_parent_id, tasks.weight AS tasks_weight \\\\nFROM tasks \\\\nWHERE tasks.cluster_id = %(cluster_id_1)s ORDER BY tasks.id ASC FOR UPDATE\' {\'cluster_id_1\': 28}", "errors": []}'

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/165355

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Deadlock on ISO #202

"build_id": "2015-03-16_22-54-44", "ostf_sha": "e86c961ceacfa5a8398b6cbda7b70a5f06afb476", "build_number": "202", "release_versions": {"2014.2-6.1": {"VERSION": {"build_id": "2015-03-16_22-54-44", "ostf_sha": "e86c961ceacfa5a8398b6cbda7b70a5f06afb476", "build_number": "202", "api": "1.0", "nailgun_sha": "874df0d06e32f14db77746cfeb2dd74d4a6e528c", "production": "docker", "python-fuelclient_sha": "2509c9b72cdcdbe46c141685a99b03cd934803be", "astute_sha": "93e427ac49109fa3fd8b0e1d0bb3d14092be2e8c", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "608b72a6f79a719cf01c35a19d0091fe20c8288a", "fuellib_sha": "924d73ae4766646e1c3a44d7b59c4120985e45f0"}}}, "auth_required": true, "api": "1.0", "nailgun_sha": "874df0d06e32f14db77746cfeb2dd74d4a6e528c", "production": "docker", "python-fuelclient_sha": "2509c9b72cdcdbe46c141685a99b03cd934803be", "astute_sha": "93e427ac49109fa3fd8b0e1d0bb3d14092be2e8c", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "608b72a6f79a719cf01c35a19d0091fe20c8288a", "fuellib_sha": "924d73ae4766646e1c3a44d7b59c4120985e45f0"

May be it happens during stop provisioning or earlier.

http://paste.openstack.org/show/193125/

All logs are here: https://drive.google.com/a/mirantis.com/file/d/0B6SjzarTGFxaSkpGMWpYc0YweXM/view?usp=sharing

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/165355
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=d0e7b5d7177998e17a70b39ae1fa1e467b58b98e
Submitter: Jenkins
Branch: master

commit d0e7b5d7177998e17a70b39ae1fa1e467b58b98e
Author: Alexander Kislitsky <email address hidden>
Date: Wed Mar 18 13:45:04 2015 +0300

    Deadlock in deletion cluster during deployment

    Cluster locking added into removing obsolete tasks
    in ApplyChangesTaskManager

    Change-Id: If31346ad7483e57660db92c40b2d763302ed36f8
    Closes-Bug: #1433528

Changed in fuel:
status: In Progress → Fix Committed
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :
Download full text (8.8 KiB)

Reproduced on RC2(288 ISO) when delete cluster action from UI was started

[7f843e605740] (cluster) Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/nailgun/api/v1/handlers/cluster.py", line 65, in DELETE task_manager.execute() File "/usr/lib/python2.6/site-packages/nailgun/task/manager.py", line 1001, in execute objects.NodeCollection.lock_for_update(nodes).all() File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2300, in all return list(self) File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2412, in __iter__ return self._execute_and_instances(context) File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2427, in _execute_and_instances result = conn.execute(querycontext.statement, self._params) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 729, in execute return meth(self, multiparams, params) File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py", line 321, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement compiled_sql, distilled_params File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context context) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception exc_info File "/usr/lib64/python2.6/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context context) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute cursor.execute(statement, parameters) DBAPIError: (TransactionRollbackError) deadlock detected DETAIL: Process 1214 waits for ShareLock on transaction 4249; blocked by process 1178. Process 1178 waits for ShareLock on transaction 4250; blocked by process 1214. HINT: See server log for query details. 'SELECT nodes.id AS nodes_id, nodes.uuid AS nodes_uuid, nodes.cluster_id AS nodes_cluster_id, nodes.group_id AS nodes_group_id, nodes.name AS nodes_name, nodes.status AS nodes_status, nodes.meta AS nodes_meta, nodes.mac AS nodes_mac, nodes.ip AS nodes_ip, nodes.hostname AS nodes_hostname, nodes.manufacturer AS nodes_manufacturer, nodes.platform_name AS nodes_platform_name, nodes.kernel_params AS nodes_kernel_params, nodes.progress AS nodes_progress, nodes.os_platform AS nodes_os_platform, nodes.pending_addition AS nodes_pending_addition, nodes.pending_deletion AS nodes_pending_deletion, nodes.error_type AS nodes_error_type, nodes.error_msg AS nodes_error_msg, nodes.timestamp AS nodes_timestamp, nodes.online AS nodes_online, nodes.labels AS nodes_labels, nodes.roles AS nodes_roles, nodes.pending_roles AS nodes_pending_roles, nodes.primary_roles AS nodes_primary_roles, nodes.agent_checksum AS nodes_agent_checksum, nodes.replaced_deployment_info AS nodes_replaced_deployment_info, nodes.replaced_provisioning_info AS nodes_replaced_provisioning_info, nodes.network_template AS nodes...

Read more...

Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :

[7f843e605740] (base) Error during execution of deployment task on environment '3': (TransactionRollbackError) deadlock detected DETAIL: Process 1177 waits for ShareLock on transaction 11695; blocked by process 1197. Process 1197 waits for ShareLock on transaction 11696; blocked by process 1177. HINT: See server log for query details. 'DELETE FROM tasks WHERE tasks.id = %(id)s' ({'id': 51}, {'id': 52}) [pid: 799|app: 0|req: 6121/18957] 10.109.0.19 () {36 vars in 527 bytes} [Sat Sep 12 20:20:20 2015] PUT /api/nodes/agent/ => generated 10 bytes in 1288 msecs (HTTP/1.1 200) 4 headers in 185 bytes (2 switches on core 0) [pid: 798|app: 0|req: 2030/18958] 10.109.0.2 () {44 vars in 793 bytes} [Sat Sep 12 20:20:19 2015] DELETE /api/tasks/47 => generated 0 bytes in 1406 msecs (HTTP/1.1 204) 3 headers in 161 bytes (2 switches on core 0) [pid: 796|app: 0|req: 7796/18959] 10.109.0.2 () {48 vars in 918 bytes} [Sat Sep 12 20:20:19 2015] PUT /api/clusters/3/changes => generated 324 bytes in 1411 msecs (HTTP/1.1 500) 4 headers in 197 bytes (2 switches on core 0)

Steps to reproduce:

1)Deploy cluster with 3 controllers and Neutron VLan
2) Add 2 compute node
3) Deploy cluster

Actual Result:
Deploy task was not started and warning message was showed.

After that deploy cluster changes - all work's fine

no longer affects: fuel/8.0.x
no longer affects: fuel/8.0.x
no longer affects: fuel/8.0.x
no longer affects: fuel/6.1.x
tags: added: tricky
tags: added: module-nailgun
removed: nailgun
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/224161

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Alexander Kislitsky (akislitsky)
status: Confirmed → In Progress
Dmitry Pyzhov (dpyzhov)
tags: added: tech-debt
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/224161
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=ad79aecc37d1b6e3003fffe132064433ba09e43d
Submitter: Jenkins
Branch: master

commit ad79aecc37d1b6e3003fffe132064433ba09e43d
Author: Alexander Kislitsky <email address hidden>
Date: Mon Sep 14 13:18:57 2015 +0300

    Detection of possible deadlocks improved

    Tracing of locking order of objects with ids added:
     - object modifications traced
     - object deletion traced
     - object getting traced
     - fetching first object traced
     - fetching one object traced
     - bulk upgrade query traced
     - bulk delete query traced
     - added mixin class for locking ordered objects with
       before deletion and updating

    Observing DB model added for ids tracing.
    Cleaning of locks on flush removed.
    Raising of errors on detection of possible deadlocks
    enabled only for deadlock_detector test.

    Change-Id: I73830cbcb84466171eb25f922fff4da2aaa25a1d
    Partial-Bug: #1433528
    Closes-Bug: #1452359

Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/8.0.x
Changed in fuel:
status: Confirmed → Fix Committed
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
tags: added: on-verification
Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :
Download full text (4.5 KiB)

The issue still can be reproduced on Fuel 8 (build 202) with the following steps:

1. Make deployment with 3 nodes with the following roles (per each): controller, compute, cinder
2. When Progress Bar hits 100% click 'Stop' button and Stop deployment
3. Check Log tabs, filter by Error and by RPC consumer

Log Error Msg:
[7fbb60a3a700] (receiverd) Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/nailgun/rpc/receiverd.py", line 52, in consume_msg callback(**body["args"]) File "/usr/lib/python2.6/site-packages/nailgun/rpc/receiver.py", line 275, in deploy_resp objects.NodeCollection.lock_for_update(q_nodes).all() File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2300, in all return list(self) File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2412, in __iter__ return self._execute_and_instances(context) File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py", line 2427, in _execute_and_instances result = conn.execute(querycontext.statement, self._params) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 729, in execute return meth(self, multiparams, params) File "/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py", line 321, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement compiled_sql, distilled_params File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context context) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception exc_info File "/usr/lib64/python2.6/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context context) File "/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute cursor.execute(statement, parameters) DBAPIError: (TransactionRollbackError) deadlock detected DETAIL: Process 1762 waits for ShareLock on transaction 27520; blocked by process 1125. Process 1125 waits for ShareLock on transaction 27521; blocked by process 1762. HINT: See server log for query details. 'SELECT nodes.id AS nodes_id, nodes.uuid AS nodes_uuid, nodes.cluster_id AS nodes_cluster_id, nodes.group_id AS nodes_group_id, nodes.name AS nodes_name, nodes.status AS nodes_status, nodes.meta AS nodes_meta, nodes.mac AS nodes_mac, nodes.ip AS nodes_ip, nodes.hostname AS nodes_hostname, nodes.manufacturer AS nodes_manufacturer, nodes.platform_name AS nodes_platform_name, nodes.kernel_params AS nodes_kernel_params, nodes.progress AS nodes_progress, nodes.os_platform AS nodes_os_platform, nodes.pending_addition AS nodes_pending_addition, nodes.pending_deletion AS nodes_pending_deletion, nodes.error_type AS nodes_error_type, nodes.error_msg AS nodes_error_msg, nodes.timestamp AS nodes_timestamp, nodes.online AS nodes_online, nodes.labels AS nodes_labels, nodes.roles AS nodes_roles, nodes.pending_roles AS nodes_pending_roles, node...

Read more...

Changed in fuel:
status: Fix Committed → Confirmed
Changed in fuel:
milestone: 8.0 → 9.0
status: Confirmed → New
Artem Roma (aroma-x)
Changed in fuel:
status: New → Confirmed
tags: removed: on-verification
Dmitry Pyzhov (dpyzhov)
no longer affects: fuel/future
Revision history for this message
Maciej Kwiek (maciej-iai) wrote :

I was not able to reproduce this on ISO 417. Marking as incomplete.

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Maciej Kwiek (maciej-iai)
status: Confirmed → Incomplete
status: Incomplete → Invalid
assignee: Maciej Kwiek (maciej-iai) → Fuel Python Team (fuel-python)
Revision history for this message
Alexander Kislitsky (akislitsky) wrote :

@Andrey, sorry for late response. Could you please provide information is deployment was stopped successfully or not. AFAIS this error shouldn't affect result of the stop operation and it should be successfully finished. Thus changing importance to Medium.

Deadlock wasn't reproduced on 200 times repeated stop deployment test.

Changed in fuel:
importance: High → Medium
status: Invalid → Incomplete
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Moving medium bug to 9.0 because we have SCF in action for 8.0

Revision history for this message
Dmitry Klenov (dklenov) wrote :

Moving to invalid as issue is more than a month in incomplete state.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.