Controller redeployment can lead to nailgun DB deadlock

Bug #1624230 reported by Valentyn Yakovlev
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Georgy Kibardin
Nominated for Ocata by Georgy Kibardin
Future
Fix Committed
High
Georgy Kibardin
Newton
Fix Released
High
Georgy Kibardin

Bug Description

Steps to reproduce:
 1) deploy cluster:
   4 nodes with controller
   1 Cinder node

 2) Delete primary controller.
 3) Add another one controller.
 4) Re-deploy cluster.

Expected Result:
Cluster ready after re-deployment

Actual:
Redeployment fail with error

2016-09-16 16:27:17 ERROR [7f1886631880] (helpers) Extracting of actor_id failed
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nailgun/task/helpers.py", line 351, in prepare_action_log_kwargs
    actor_id = action_log.actor_id
AttributeError: 'NoneType' object has no attribute 'actor_id'
2016-09-16 15:42:06 ERROR [7f1886631880] (base) Unexpected exception occured
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nailgun/api/v1/handlers/base.py", line 297, in handle_errors
    return func(cls, *args, **kwargs)
  File "<string>", line 2, in PUT
  File "/usr/lib/python2.7/site-packages/nailgun/api/v1/handlers/base.py", line 355, in validate
    return func(cls, *args, **kwargs)
  File "<string>", line 2, in PUT
  File "/usr/lib/python2.7/site-packages/nailgun/api/v1/handlers/base.py", line 381, in serialize
    resp = func(cls, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nailgun/api/v1/handlers/node.py", line 207, in PUT
    self.collection.single.update_by_agent(node, nd)
  File "/usr/lib/python2.7/site-packages/nailgun/objects/node.py", line 928, in update_by_agent
    return cls.update(instance, data)
  File "/usr/lib/python2.7/site-packages/nailgun/objects/node.py", line 779, in update
    super(Node, cls).update(instance, data)
  File "/usr/lib/python2.7/site-packages/nailgun/objects/base.py", line 96, in update
    db().flush()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2019, in flush
    self._flush(objects)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2137, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2101, in _flush
    flush_context.execute()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
    rec.execute(self)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
    uow
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 170, in save_obj
    mapper, table, update)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 706, in _emit_update_statements
    execute(statement, multiparams)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
    exc_info
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
    context)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
    cursor.execute(statement, parameters)
OperationalError: (psycopg2.extensions.TransactionRollbackError) deadlock detected
DETAIL: Process 8983 waits for ShareLock on transaction 2768; blocked by process 26677.
Process 26677 waits for ShareLock on transaction 2769; blocked by process 8983.
HINT: See server log for query details.
CONTEXT: SQL statement "SELECT 1 FROM ONLY "public"."clusters" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
 [SQL: 'UPDATE nodes SET mac=%(mac)s, agent_checksum=%(agent_checksum)s WHERE nodes.id = %(nodes_id)s'] [parameters: {'agent_checksum': u'deeeec8fce9c1e3ed5c150684b86aa008bebc1ec', 'mac': u'64:86:C0:AD:16:15', 'nodes_id': 9}]
[pid: 7727|app: 0|req: 440/790] 10.109.10.11 () {40 vars in 559 bytes} [Fri Sep 16 15:42:05 2016] PUT /api/nodes/agent/ => generated 39 bytes in 1294 msecs (HTTP/1.1 500) 5 headers in 223 bytes (2 switches on core 0)
[pid: 7727|app: 0|req: 441/791] 10.109.10.1 () {38 vars in 554 bytes} [Fri Sep 16 15:42:07 2016] GET /api/tasks/1 => generated 276 bytes in 8 msecs (HTTP/1.1 200) 4 headers in 185 bytes (2 switches on core 0)

Revision history for this message
Valentyn Yakovlev (vyakovlev) wrote :
Changed in fuel:
assignee: nobody → Fuel QA Team (fuel-qa)
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Dmitry Klenov (dklenov) wrote :

@Valentyn Yakovlev, please specify fuel version and return the issue back to New state.

Changed in fuel:
importance: Undecided → High
tags: added: area-library
Changed in fuel:
status: New → Incomplete
Revision history for this message
Valentyn Yakovlev (vyakovlev) wrote :

shotgun2 short-report
cat /etc/fuel_build_id:
 495
cat /etc/fuel_build_number:
 495
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-agent-9.0.0-1.mos291.noarch
 fuel-bootstrap-cli-9.0.0-1.mos291.noarch
 fuel-ui-9.0.0-1.mos2814.noarch
 rubygem-astute-9.0.0-1.mos770.noarch
 fuel-setup-9.0.0-1.mos6355.noarch
 shotgun-9.0.0-1.mos90.noarch
 python-fuelclient-9.0.0-1.mos354.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8853.noarch
 python-packetary-9.0.0-1.mos149.noarch
 fuel-ostf-9.0.0-1.mos944.noarch
 nailgun-mcagents-9.0.0-1.mos770.noarch
 fuel-migrate-9.0.0-1.mos8594.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-utils-9.0.0-1.mos8594.noarch
 fuelmenu-9.0.0-1.mos275.noarch
 fuel-misc-9.0.0-1.mos8594.noarch
 fuel-mirror-9.0.0-1.mos149.noarch
 fuel-notify-9.0.0-1.mos8594.noarch
 fuel-nailgun-9.0.0-1.mos8853.noarch
 fuel-library9.0-9.0.0-1.mos8594.noarch
 fuel-9.0.0-1.mos6355.noarch
 fuel-release-9.0.0-1.mos6355.noarch
 fuel-openstack-metadata-9.0.0-1.mos8853.noarch

Changed in fuel:
status: Incomplete → New
Dmitry Klenov (dklenov)
Changed in fuel:
milestone: none → 9.1
status: New → Confirmed
description: updated
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Dmitry Guryanov (dguryanov)
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

This is a random failure and it needs more time for investigation. Moving to 9.2.

Changed in fuel:
milestone: 9.1 → 9.2
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Dmitry Guryanov (dguryanov) → Fuel Sustaining (fuel-sustaining-team)
tags: added: area-python
removed: area-library
summary: - fail re-deploys after replacing the main controller
+ Controller redeployment can lead to nailgun DB deadlock
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Georgy Kibardin (gkibardin)
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

From current log it is almost impossible to figure the second transaction. I'm trying to reproduce. The reproduction is blocked by https://bugs.launchpad.net/fuel/+bug/1648732

Changed in fuel:
status: Confirmed → Invalid
status: Invalid → In Progress
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

Cannot reproduce.

Changed in fuel:
status: In Progress → Incomplete
Changed in fuel:
milestone: 9.2 → 9.3
Revision history for this message
Leontii Istomin (listomin) wrote :

Has been reproduced on ~400-nodes environment:
from nailgun.app log: http://paste.openstack.org/show/596030/
from postgresql log: http://paste.openstack.org/show/596035/

[root@fuel ~]# ps aux | grep 25785
root 19595 0.0 0.0 112648 972 pts/2 S+ 12:09 0:00 grep --color=auto 25785
postgres 25785 0.3 0.1 340048 142576 ? Ss Jan16 33:28 postgres: nailgun nailgun 10.21.0.2(60143) idle
[root@fuel ~]# ps aux | grep 7371
postgres 7371 1.1 0.1 341064 147300 ? Ss Jan17 101:14 postgres: nailgun nailgun 10.21.0.2(33717) idle
root 19605 0.0 0.0 112652 972 pts/2 S+ 12:09 0:00 grep --color=auto 7371

Will try to reproduce with additional logging in postgresql:
postgresql::server::config_entry { 'log_statement':
      value => 'all',
}

postgresql::server::config_entry { 'log_line_prefix':
      value => '%t %c %x ',
}

Changed in fuel:
status: Incomplete → New
Revision history for this message
Roman Rufanov (rrufanov) wrote :

Fix for 1648732 was release, it is unblocked

Changed in fuel:
status: New → Confirmed
Revision history for this message
Sergey Galkin (sgalkin) wrote :

[root@fuel pg_log]# grep deadlock postgresql-Tue.log
2017-01-24 17:19:42 UTC 58874217.6c91 4918627 ERROR: deadlock detected
2017-01-24 17:20:42 UTC 588760c4.18b 4919171 ERROR: deadlock detected
2017-01-24 17:25:36 UTC 58874217.6c91 4922171 ERROR: deadlock detected
2017-01-24 17:25:38 UTC 588760c4.18b 4922187 ERROR: deadlock detected
2017-01-24 17:25:38 UTC 588760b6.183 4922197 ERROR: deadlock detected
2017-01-24 17:25:39 UTC 58874217.6c91 4922186 ERROR: deadlock detected
2017-01-24 18:16:25 UTC 58874217.6c91 4946083 ERROR: deadlock detected
2017-01-24 18:20:51 UTC 58874217.6c91 4948926 ERROR: deadlock detected
2017-01-24 18:29:26 UTC 588760c6.18e 4954100 ERROR: deadlock detected
2017-01-24 18:47:38 UTC 58874217.6c91 4965626 ERROR: deadlock detected
2017-01-24 18:50:45 UTC 588760c5.18d 4967597 ERROR: deadlock detected
2017-01-24 18:55:28 UTC 58874217.6c91 4970344 ERROR: deadlock detected
2017-01-24 19:40:42 UTC 58874217.6c91 4992839 ERROR: deadlock detected
2017-01-24 19:45:42 UTC 588760b6.183 4995935 ERROR: deadlock detected
2017-01-24 19:57:48 UTC 58874217.6c91 5003711 ERROR: deadlock detected
2017-01-24 20:04:33 UTC 58874217.6c91 5007770 ERROR: deadlock detected
2017-01-24 20:11:34 UTC 588760c4.18c 5012187 ERROR: deadlock detected

Revision history for this message
Sergey Galkin (sgalkin) wrote :
Revision history for this message
Georgy Kibardin (gkibardin) wrote :

There are two concurrent transactions:
1. Node update by nailgun agent. First it updates node's meta and then ip and mac.
2. Node progress update by receiver. First it makes "select for update" for node and then updates progress etc.

Figuring how to fix this better.

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/425153

Revision history for this message
Georgy Kibardin (gkibardin) wrote :
  • 1 Edit (114.2 KiB, text/plain)

Transaction log

Revision history for this message
Georgy Kibardin (gkibardin) wrote :
  • 2 Edit (223.9 KiB, text/plain)

And another one (opposite example)

Revision history for this message
Sergey Galkin (sgalkin) wrote :

it seems fix does not help
2017-01-25 15:05:31 UTC 58874217.6c91 5527561 ERROR: deadlock detected
2017-01-25 15:13:41 UTC 5888aa4c.76ea 5532825 ERROR: deadlock detected
2017-01-25 15:24:32 UTC 58874217.6c91 5540151 ERROR: deadlock detected
[root@fuel pg_log]# date
Wed Jan 25 15:25:31 UTC 2017

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Georgy Kibardin (<email address hidden>) on branch: master
Review: https://review.openstack.org/425153

Revision history for this message
Mikhail Chernik (mchernik) wrote :

Reproduced on 9.2 RC1 during deployment of 203-nodes environment from scratch.

postgresql log: http://mos-scale-share.mirantis.com/postgresql-Wed.log.gz

Full diagnostic snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2017-01-25_20-30-57.tar

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/426076

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/426151

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-nailgun-agent (master)

Reviewed: https://review.openstack.org/426076
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=5e7127f9ec13c0ed0e62b9d9150d835ca2a27b72
Submitter: Jenkins
Branch: master

commit 5e7127f9ec13c0ed0e62b9d9150d835ca2a27b72
Author: georgy <email address hidden>
Date: Fri Jan 27 08:58:52 2017 +0300

    Fix volatile CPU 'size'

    The 'size' parameter for CPU returned by lshw is current CPU clock
    which changes constantly breaking nailgun optimization which expects
    hw data to not change each time.
    This also must reduce probability of deadlocks in #1624230.

    Change-Id: I7d9b5282991b17424d458c2612dfa4eeeb52be48
    Partial-Bug: #1624230

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/430064

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-nailgun-agent (stable/newton)

Reviewed: https://review.openstack.org/430064
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=dd36db6e062ede9a1d4af0da1c83c210eae88d02
Submitter: Jenkins
Branch: stable/newton

commit dd36db6e062ede9a1d4af0da1c83c210eae88d02
Author: georgy <email address hidden>
Date: Fri Jan 27 08:58:52 2017 +0300

    Fix volatile CPU 'size'

    The 'size' parameter for CPU returned by lshw is current CPU clock
    which changes constantly breaking nailgun optimization which expects
    hw data to not change each time.
    This also must reduce probability of deadlocks in #1624230.

    Change-Id: I7d9b5282991b17424d458c2612dfa4eeeb52be48
    Partial-Bug: #1624230
    (cherry picked from commit 5e7127f9ec13c0ed0e62b9d9150d835ca2a27b72)

tags: added: in-stable-newton
Changed in fuel:
milestone: 9.x-updates → 9.2-mu-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-nailgun-agent (stable/mitaka)

Reviewed: https://review.openstack.org/426151
Committed: https://git.openstack.org/cgit/openstack/fuel-nailgun-agent/commit/?id=026399208d614fe8629a90e0f5717889bdc8af34
Submitter: Jenkins
Branch: stable/mitaka

commit 026399208d614fe8629a90e0f5717889bdc8af34
Author: georgy <email address hidden>
Date: Fri Jan 27 08:58:52 2017 +0300

    Fix volatile CPU 'size'

    The 'size' parameter for CPU returned by lshw is current CPU clock
    which changes constantly breaking nailgun optimization which expects
    hw data to not change each time.
    This also must reduce probability of deadlocks in #1624230.

    Change-Id: I7d9b5282991b17424d458c2612dfa4eeeb52be48
    Partial-Bug: #1624230
    (cherry picked from commit 5e7127f9ec13c0ed0e62b9d9150d835ca2a27b72)

tags: added: in-stable-mitaka
Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Ekaterina Shutova (eshutova) wrote :

Verified on 9.2 mu1 updates.
Environment redeployed successfully with described scenario.

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (stable/mitaka)

Change abandoned by Andreas Jaeger (<email address hidden>) on branch: stable/mitaka
Review: https://review.opendev.org/426156
Reason: This repo is retired now, no further work will get merged.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.