Mysql-wss OCF script raises false-positive split-brain errors

Bug #1651982 reported by Ivan Udovichenko
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Vladimir Kuklin
Mitaka
Fix Released
High
Vladimir Kuklin
Newton
Fix Committed
High
Vladimir Kuklin
Ocata
Fix Committed
High
Vladimir Kuklin

Bug Description

Test 'Launch instance, create snapshot, launch instance from snapshot' failed during OSTF health checks [1][2].
Target component is Glance [3].
Traceback from node-1 [4], node-2 [5], node-3 [6].

Split-brain also took place. [7]

[1] https://ci.fuel-infra.org/job/11.0-community.main.ubuntu.bvt_2/224/console
[2] http://paste.openstack.org/show/593084/
[3] https://github.com/openstack/fuel-ostf/blob/master/fuel_health/tests/smoke/test_nova_image_actions.py#L139
[4] http://paste.openstack.org/show/593082/
[5] http://paste.openstack.org/show/593080/
[6] http://paste.openstack.org/show/593083/
[7] http://paste.openstack.org/show/593111/

Changed in fuel:
importance: Undecided → High
milestone: none → 11.0
summary: - [BVT_2][224] Timout while launching another instance from the snapshot
+ [BVT_2][224] Timeout while launching another instance from the snapshot
Changed in fuel:
assignee: nobody → Sergii Golovatiuk (sgolovatiuk)
status: New → Confirmed
Revision history for this message
Nastya Urlapova (aurlapova) wrote : Re: [BVT_2][224] Timeout while launching another instance from the snapshot

The same issue on 9.2 snapshot #684:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/158/testReport/(root)/ceph_partitions_repetitive_cold_restart/ceph_partitions_repetitive_cold_restart/

  Scenario:
        1. Revert snapshot 'prepare_load_ceph_ha'
        2. Wait until MySQL Galera is UP on some controller
        3. Check Ceph status
        4. Run ostf
        5. Fill ceph partitions on all nodes up to 30%
        6. Check Ceph status
        7. Disable UMM
        8. Run RALLY
        9. 100 times repetitive reboot:
        10. Cold restart of all nodes
        11. Wait for HA services ready
        12. Wait until MySQL Galera is UP on some controller
        13. Run ostf <<< failed here

>>>
2016-12-27T01:22:44.400903+00:00 err: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:12845, this is a split-brain!
2016-12-27T01:22:44.405356+00:00 err: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: c27b16c7-cbd2-11e6-97b3-03ad09eb7038:0, which was not e
xpected

tags: added: swarm-fail
summary: - [BVT_2][224] Timeout while launching another instance from the snapshot
+ Timeout while launching another instance from the snapshot
Revision history for this message
Nastya Urlapova (aurlapova) wrote : Re: Timeout while launching another instance from the snapshot
tags: added: area-library
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

So, the issue here is that on some slow environments there is a race condition due to which some nodes update their GTID SEQNO during the monitor operation on the other nodes. This leads to marking the node that runs with `--wsrep-new-cluster` option to be marked as a duplicate primary component which in turn makes pacemaker fail and restart the node in question. The proper solution would be to use real master-slave OCF script for galera, which is not possible for already released Mitaka and Newton due to significant update impediments. So the solution is to refactor `check_if_new_cluster` method to raise split-brain error only if there are actually more than one primary component by storing info if the node is primary component in the CIB and counting whether there are more than 1 primary components.

summary: - Timeout while launching another instance from the snapshot
+ Mysql-wss OCF script raises false-positive split-brain errors
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/417434

Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin)
status: Confirmed → In Progress
tags: added: galera
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/418892

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/418893

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/417434
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3a2cc9e24a6d695af7c0a65f65b8f62ce1c8920c
Submitter: Jenkins
Branch: master

commit 3a2cc9e24a6d695af7c0a65f65b8f62ce1c8920c
Author: Vladimir Kuklin <email address hidden>
Date: Fri Jan 6 19:11:09 2017 +0300

    Avoid false-positive split-brain detection for mysql-wss

    With change Iaa4855d769fe1e0203fcfb9981413273e0e4dda2
    we detect whether the node is running as a primary component
    while it is not master. While it is a good solution, sometimes
    we face race condition when the node which is a 'master' gets lower
    sequence number due to other nodes updating their gtid and the same
    time. Although it happens rarely and mostly on the slow or overloaded
    environemnts, it leads to redundant mysql restarts and service
    downtime for OpenStack APIs.

    The proper fix would be to use master-slave resource and corresponding
    script, but this is a far to big change for the bug under question.

    The solution proposed checks if the node is a primary component during
    start and monitor operations and also checks for number of currently
    running primary components by setting and querying an additional
    attribute `is_pc`. It triggers monitor failure only when the node
    is not running with the 'master' GTID and is a primary component
    and if there is more than one primary components.

    Misc: fix functions return codes to reflect shell 'true'
    and 'false' numeric values.

    Change-Id: Id3ea32347ed37a6efffd3ee85dfb3110b2e8c8ca
    Closes-bug: #1651982

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/newton)

Reviewed: https://review.openstack.org/418893
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=c6f0b71a8b5960a625f193f64fd9e313371864ab
Submitter: Jenkins
Branch: stable/newton

commit c6f0b71a8b5960a625f193f64fd9e313371864ab
Author: Vladimir Kuklin <email address hidden>
Date: Fri Jan 6 19:11:09 2017 +0300

    Avoid false-positive split-brain detection for mysql-wss

    With change Iaa4855d769fe1e0203fcfb9981413273e0e4dda2
    we detect whether the node is running as a primary component
    while it is not master. While it is a good solution, sometimes
    we face race condition when the node which is a 'master' gets lower
    sequence number due to other nodes updating their gtid and the same
    time. Although it happens rarely and mostly on the slow or overloaded
    environemnts, it leads to redundant mysql restarts and service
    downtime for OpenStack APIs.

    The proper fix would be to use master-slave resource and corresponding
    script, but this is a far to big change for the bug under question.

    The solution proposed checks if the node is a primary component during
    start and monitor operations and also checks for number of currently
    running primary components by setting and querying an additional
    attribute `is_pc`. It triggers monitor failure only when the node
    is not running with the 'master' GTID and is a primary component
    and if there is more than one primary components.

    Misc: fix functions return codes to reflect shell 'true'
    and 'false' numeric values.

    Change-Id: Id3ea32347ed37a6efffd3ee85dfb3110b2e8c8ca
    Closes-bug: #1651982

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/418892
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3a6b0976e529c549419654f3952e12bf16cdb957
Submitter: Jenkins
Branch: stable/mitaka

commit 3a6b0976e529c549419654f3952e12bf16cdb957
Author: Vladimir Kuklin <email address hidden>
Date: Fri Jan 6 19:11:09 2017 +0300

    Avoid false-positive split-brain detection for mysql-wss

    With change Iaa4855d769fe1e0203fcfb9981413273e0e4dda2
    we detect whether the node is running as a primary component
    while it is not master. While it is a good solution, sometimes
    we face race condition when the node which is a 'master' gets lower
    sequence number due to other nodes updating their gtid and the same
    time. Although it happens rarely and mostly on the slow or overloaded
    environemnts, it leads to redundant mysql restarts and service
    downtime for OpenStack APIs.

    The proper fix would be to use master-slave resource and corresponding
    script, but this is a far to big change for the bug under question.

    The solution proposed checks if the node is a primary component during
    start and monitor operations and also checks for number of currently
    running primary components by setting and querying an additional
    attribute `is_pc`. It triggers monitor failure only when the node
    is not running with the 'master' GTID and is a primary component
    and if there is more than one primary components.

    Misc: fix functions return codes to reflect shell 'true'
    and 'false' numeric values.

    Change-Id: Id3ea32347ed37a6efffd3ee85dfb3110b2e8c8ca
    Closes-bug: #1651982

Revision history for this message
Sergey Novikov (snovikov) wrote :

The issue had not been being reproduced during the latest 9.2 swarm runs (including for RC1). I've moved bug to "Fix Released"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 11.0.0.0rc1

This issue was fixed in the openstack/fuel-library 11.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.