MySQL crash after restart via Pacemaker: aborting due to conflicting prims

Bug #1617400 reported by Aleksei Shishkin on 2016-08-26
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Sergii Golovatiuk
Nominated for Ocata by Sergii Golovatiuk
Mitaka
Critical
Fedor Zhadaev
Newton
Critical
Sergii Golovatiuk

Bug Description

Detailed bug description:
  MySQL crashes after restart via ‘pcs resource restart clone_p_mysqld’ command in MOS 9.0
  Almost 50% of MySQL restart attempts are finished with next MySQL error: WSREP: exception from gcomm, backend must be restarted: 281c836a aborting due to conflicting prims: older overrides (FATAL)

Steps to reproduce:
  Run next command:
  pcs resource restart clone_p_mysqld

Expected results:
  MySQL will be restarted without errors in log files
  After restart MySQL wsrep_cluster_size will be equal to controllers count (in my case 3)

Actual result:
  MySQL restarted and error appeared in MySQL log file on one of controller nodes:
    2016-08-23T16:59:02.214831+02:00 node-3 mysqld: 2016-08-23 16:59:02 22172 [ERROR] WSREP: exception from gcomm, backend must be restarted: 1a2cd5b7 aborting due to conflicting prims: older overrides (FATAL)

  Node with this error will have wsrep_cluster_size=1, other two nodes will have wsrep_cluster_size=2

Reproducibility:
 This error can be reproduced almost in 50% restart attempts
 Also found on Customer environment

Workaround:
 Replace MOS 9.0 native OCF script (/usr/lib/ocf/resource.d/fuel/mysql-wss) with another one from master branch: https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/mysql-wss

Impact:
 Critical impact
 Leads to MySQL split-brain

Description of the environment:
 Operation system: Ubuntu 14.04
 Versions of components: MOS 9.0

Additional information:
Logs from controllers:
 controller-1
   2016-08-26T16:59:02.221676+02:00 node-1 mysqld: 2016-08-26 16:59:02 24841 [Warning] WSREP: bae1b7a9 conflicting prims: my prim: view_id(PRIM,bae1b7a9,6) other prim: view_id(PRIM,1a2cd5b7,1)
   2016-08-26T16:59:02.221682+02:00 node-1 mysqld: 2016-08-26 16:59:02 24841 [Warning] WSREP: bae1b7a9 discarding other prim view: older overrides

 controller-2
   2016-08-26T16:59:02.218143+02:00 node-2 mysqld: 2016-08-26 16:59:02 9624 [ERROR] WSREP: failed to open gcomm backend connection: 131: 20c98253 last prims not consistent (FATAL)
   2016-08-26T16:59:02.218143+02:00 node-2 mysqld: 2016-08-26 16:59:02 9624 [ERROR] WSREP: gcs/src/gcs.cpp:long int gcs_open(gcs_conn_t*, const char*, const char*, bool)():1379: Failed to open channel 'openstack' at 'gcomm://192.168.2.7,192.168.2.8,192.168.2.9': -131 (State not recoverable)

 controller-3
   2016-08-26T16:59:02.214527+02:00 node-3 mysqld: 2016-08-26 16:59:02 22172 [Warning] WSREP: 1a2cd5b7 conflicting prims: my prim: view_id(PRIM,1a2cd5b7,1) other prim: view_id(PRIM,bae1b7a9,6)
   2016-08-26T16:59:02.214831+02:00 node-3 mysqld: 2016-08-26 16:59:02 22172 [ERROR] WSREP: exception from gcomm, backend must be restarted: 1a2cd5b7 aborting due to conflicting prims: older overrides (FATAL)
  2016-08-26T16:59:02.470825+02:00 node-3 mysqld: 2016-08-26 16:59:02 22172 [Warning] WSREP: Send action {(nil), 7222, TORDERED} returned -103 (Software caused connection abort)

tags: added: area-library
Changed in fuel:
milestone: none → 10.0
Aleksei Shishkin (ashishkin) wrote :

Customer found on version MOS 9.0
Backport is needed

Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
importance: Undecided → Critical
Maksim Malchuk (mmalchuk) wrote :

Please provide the diagnostic snapshot.

Changed in fuel:
status: New → Incomplete
Miroslav Anashkin (manashkin) wrote :

Maksim,

this issue already fixed in the latest OCF script version.
https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/mysql-wss

This bug filed intentionally - to include this fix into upcoming 9.1 version before the code freeze.

Changed in fuel:
status: Incomplete → Confirmed
Maksim Malchuk (mmalchuk) wrote :

Miroslav, we backported almost all the changes for MySQL OCF from the master to stable/mitaka. I believe the issue is already fixed in 9.1. To check this we need more information.

Changed in fuel:
status: Confirmed → Incomplete
Anton Matveev (amatveev) on 2016-08-29
tags: added: sla1
Changed in fuel:
status: Incomplete → Confirmed
status: Confirmed → Incomplete
Changed in fuel:
status: Incomplete → Triaged
Changed in fuel:
milestone: 10.0 → 10.1

Fix proposed to branch: master
Review: https://review.openstack.org/460948

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Vladimir Kuklin (vkuklin)
status: Triaged → In Progress
Changed in fuel:
milestone: 10.1 → 11.x-updates
Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Sergii Golovatiuk (sgolovatiuk)

Reviewed: https://review.openstack.org/460948
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=c8373d4aea13b881bfbd17ff060516ca35055b49
Submitter: Jenkins
Branch: master

commit c8373d4aea13b881bfbd17ff060516ca35055b49
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/462135
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=31efd1e2c90cc0ad203423cf44a4366c6f04e024
Submitter: Jenkins
Branch: stable/newton

commit 31efd1e2c90cc0ad203423cf44a4366c6f04e024
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Reviewed: https://review.openstack.org/462134
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=afa04a86d9e7acb677366325ce42cc77e0c8a324
Submitter: Jenkins
Branch: stable/ocata

commit afa04a86d9e7acb677366325ce42cc77e0c8a324
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/462136
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=179b27932f6f792693bb05bce6de5bc004085f34
Submitter: Jenkins
Branch: stable/mitaka

commit 179b27932f6f792693bb05bce6de5bc004085f34
Author: Vladimir Kuklin <email address hidden>
Date: Fri Apr 28 13:43:49 2017 +0300

    Fix race condition for primary component bootstrap

    Create node is_pc flag before starting to check if there
    is more than one of those flags. Thus, we avoid race condition
    when there is 0 is_pc flags and galera starts with --wsrep-new-cluster
    on 2 nodes.

    We set it before the check and, as setting them is synchronous through
    Pacemaker CIB, in that case when >1 nodes attempt to bootstrap with
    --wsrep-new-cluster, only one node will see <= 1 is_pc flags. Others
    will see more than one and fail and reattempt to start. At that point
    one of the nodes will already be bootstrapped, thus reelection will not
    be triggered and the section of bootstrap will be skipped

    Change-Id: I82a71132eef7877ac7ab1ed04263044b3b1e8d9b
    Closes-bug: #1617400
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Reviewed: https://review.openstack.org/465674
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3e8fe44dbe77a4b6595f3fe0aac5d7b62d777e1a
Submitter: Jenkins
Branch: stable/mitaka

commit 3e8fe44dbe77a4b6595f3fe0aac5d7b62d777e1a
Author: Vladimir Kuklin <email address hidden>
Date: Wed May 17 19:42:09 2017 +0300

    Introduce critical section on master election process

    This commit adds a node attribute with value which
    means the timestamp of when the election process started.
    If we have election on any node in process we sleep for a while
    unless the attribute is outdated.

    We start the election only if the attributes for all nodes are outdated
    or if they do not exist.

    This prevents us from hitting rare condition when several nodes
    start simultaneously but do not agree on the master node due to
    race condition in MySQL start time and pacemaker attribute setting

    Change-Id: I7f4728b75ce5577338dff182634b608823cff74e
    Closes-bug: #1617400
    Co-Authored-By: Fedor Zhadaev <email address hidden>

Dmitry (dtsapikov) wrote :

Verified on 9.2+mu2

tags: added: on-verification
tags: removed: on-verification

Reviewed: https://review.openstack.org/478413
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=81274177d6fa32e787b281f6bc22bf27d9f6a0ae
Submitter: Jenkins
Branch: stable/mitaka

commit 81274177d6fa32e787b281f6bc22bf27d9f6a0ae
Author: Dmitry Sutyagin <email address hidden>
Date: Wed Jun 28 11:38:45 2017 +0400

    Make election attribute name unique for each node

    This commit changes logic of setting election attribute.
    Using different names of election attribute for each node resolves
    issue when one node cleared attribute set by another node.

    Change-Id: Id22d8a4d9cda4b4a2efda447eab7e30a1d6c9410
    Closes-bug: #1617400
    Co-Authored-By: Denis Ipatov <email address hidden>
    Co-Authored-By: Dmitry Sutyagin <email address hidden>

tags: added: in-stable-mitaka
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers