MySQL crash after restart via Pacemaker: aborting due to conflicting prims

Bug #1617400 reported by Aleksei Shishkin
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Critical
Sergii Golovatiuk
Nominated for Ocata by Sergii Golovatiuk
Mitaka
Fix Released
Critical
Fedor Zhadaev
Newton
Fix Committed
Critical
Sergii Golovatiuk

Bug Description

Detailed bug description:
  MySQL crashes after restart via ‘pcs resource restart clone_p_mysqld’ command in MOS 9.0
  Almost 50% of MySQL restart attempts are finished with next MySQL error: WSREP: exception from gcomm, backend must be restarted: 281c836a aborting due to conflicting prims: older overrides (FATAL)

Steps to reproduce:
  Run next command:
  pcs resource restart clone_p_mysqld

Expected results:
  MySQL will be restarted without errors in log files
  After restart MySQL wsrep_cluster_size will be equal to controllers count (in my case 3)

Actual result:
  MySQL restarted and error appeared in MySQL log file on one of controller nodes:
    2016-08-23T16:59:02.214831+02:00 node-3 mysqld: 2016-08-23 16:59:02 22172 [ERROR] WSREP: exception from gcomm, backend must be restarted: 1a2cd5b7 aborting due to conflicting prims: older overrides (FATAL)

  Node with this error will have wsrep_cluster_size=1, other two nodes will have wsrep_cluster_size=2

Reproducibility:
 This error can be reproduced almost in 50% restart attempts
 Also found on Customer environment

Workaround:
 Replace MOS 9.0 native OCF script (/usr/lib/ocf/resource.d/fuel/mysql-wss) with another one from master branch: https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/mysql-wss

Impact:
 Critical impact
 Leads to MySQL split-brain

Description of the environment:
 Operation system: Ubuntu 14.04
 Versions of components: MOS 9.0

Additional information:
Logs from controllers:
 controller-1
   2016-08-26T16:59:02.221676+02:00 node-1 mysqld: 2016-08-26 16:59:02 24841 [Warning] WSREP: bae1b7a9 conflicting prims: my prim: view_id(PRIM,bae1b7a9,6) other prim: view_id(PRIM,1a2cd5b7,1)
   2016-08-26T16:59:02.221682+02:00 node-1 mysqld: 2016-08-26 16:59:02 24841 [Warning] WSREP: bae1b7a9 discarding other prim view: older overrides

 controller-2
   2016-08-26T16:59:02.218143+02:00 node-2 mysqld: 2016-08-26 16:59:02 9624 [ERROR] WSREP: failed to open gcomm backend connection: 131: 20c98253 last prims not consistent (FATAL)
   2016-08-26T16:59:02.218143+02:00 node-2 mysqld: 2016-08-26 16:59:02 9624 [ERROR] WSREP: gcs/src/gcs.cpp:long int gcs_open(gcs_conn_t*, const char*, const char*, bool)():1379: Failed to open channel 'openstack' at 'gcomm://192.168.2.7,192.168.2.8,192.168.2.9': -131 (State not recoverable)

 controller-3
   2016-08-26T16:59:02.214527+02:00 node-3 mysqld: 2016-08-26 16:59:02 22172 [Warning] WSREP: 1a2cd5b7 conflicting prims: my prim: view_id(PRIM,1a2cd5b7,1) other prim: view_id(PRIM,bae1b7a9,6)
   2016-08-26T16:59:02.214831+02:00 node-3 mysqld: 2016-08-26 16:59:02 22172 [ERROR] WSREP: exception from gcomm, backend must be restarted: 1a2cd5b7 aborting due to conflicting prims: older overrides (FATAL)
  2016-08-26T16:59:02.470825+02:00 node-3 mysqld: 2016-08-26 16:59:02 22172 [Warning] WSREP: Send action {(nil), 7222, TORDERED} returned -103 (Software caused connection abort)

tags: added: area-library
Changed in fuel:
milestone: none → 10.0
Revision history for this message
Aleksei Shishkin (ashishkin) wrote :

Customer found on version MOS 9.0
Backport is needed

Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
importance: Undecided → Critical
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Please provide the diagnostic snapshot.

Changed in fuel:
status: New → Incomplete
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

Maksim,

this issue already fixed in the latest OCF script version.
https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/mysql-wss

This bug filed intentionally - to include this fix into upcoming 9.1 version before the code freeze.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Miroslav, we backported almost all the changes for MySQL OCF from the master to stable/mitaka. I believe the issue is already fixed in 9.1. To check this we need more information.

Changed in fuel:
status: Confirmed → Incomplete
Anton Matveev (amatveev)
tags: added: sla1
Changed in fuel:
status: Incomplete → Confirmed
status: Confirmed → Incomplete
Changed in fuel:
status: Incomplete → Triaged
Changed in fuel:
milestone: 10.0 → 10.1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/460948

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Changed in fuel:
milestone: 10.1 → 11.x-updates
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Sergii Golovatiuk (sgolovatiuk)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/462134

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/462135

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/462136

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/460948
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=c8373d4aea13b881bfbd17ff060516ca35055b49
Submitter: Jenkins
Branch: master

commit c8373d4aea13b881bfbd17ff060516ca35055b49
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/newton)

Reviewed: https://review.openstack.org/462135
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=31efd1e2c90cc0ad203423cf44a4366c6f04e024
Submitter: Jenkins
Branch: stable/newton

commit 31efd1e2c90cc0ad203423cf44a4366c6f04e024
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/ocata)

Reviewed: https://review.openstack.org/462134
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=afa04a86d9e7acb677366325ce42cc77e0c8a324
Submitter: Jenkins
Branch: stable/ocata

commit afa04a86d9e7acb677366325ce42cc77e0c8a324
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/462136
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=179b27932f6f792693bb05bce6de5bc004085f34
Submitter: Jenkins
Branch: stable/mitaka

commit 179b27932f6f792693bb05bce6de5bc004085f34
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/465674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/465674
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3e8fe44dbe77a4b6595f3fe0aac5d7b62d777e1a
Submitter: Jenkins
Branch: stable/mitaka

commit 3e8fe44dbe77a4b6595f3fe0aac5d7b62d777e1a
Author: Vladimir Kuklin <email address hidden>
Date: Wed May 17 19:42:09 2017 +0300

    Introduce critical section on master election process

    This commit adds a node attribute with value which
    means the timestamp of when the election process started.
    If we have election on any node in process we sleep for a while
    unless the attribute is outdated.

    We start the election only if the attributes for all nodes are outdated
    or if they do not exist.

    This prevents us from hitting rare condition when several nodes
    start simultaneously but do not agree on the master node due to
    race condition in MySQL start time and pacemaker attribute setting

    Change-Id: I7f4728b75ce5577338dff182634b608823cff74e
    Closes-bug: #1617400
    Co-Authored-By: Fedor Zhadaev <email address hidden>

Revision history for this message
Dmitry (dtsapikov) wrote :

Verified on 9.2+mu2

tags: added: on-verification
tags: removed: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/478413

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/478413
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=81274177d6fa32e787b281f6bc22bf27d9f6a0ae
Submitter: Jenkins
Branch: stable/mitaka

commit 81274177d6fa32e787b281f6bc22bf27d9f6a0ae
Author: Dmitry Sutyagin <email address hidden>
Date: Wed Jun 28 11:38:45 2017 +0400

    Make election attribute name unique for each node

    This commit changes logic of setting election attribute.
    Using different names of election attribute for each node resolves
    issue when one node cleared attribute set by another node.

    Change-Id: Id22d8a4d9cda4b4a2efda447eab7e30a1d6c9410
    Closes-bug: #1617400
    Co-Authored-By: Denis Ipatov <email address hidden>
    Co-Authored-By: Dmitry Sutyagin <email address hidden>

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library ocata-eol

This issue was fixed in the openstack/fuel-library ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.