Bug #1651982 “Mysql-wss OCF script raises false-positive split-b...” : Bugs : Fuel for OpenStack

Ivan Udovichenko (iudovichenko) on 2016-12-22

Changed in fuel:
importance:	Undecided → High
milestone:	none → 11.0

Ivan Udovichenko (iudovichenko) on 2016-12-22

summary:

- [BVT_2][224] Timout while launching another instance from the snapshot
+ [BVT_2][224] Timeout while launching another instance from the snapshot

Oleksiy Molchanov (omolchanov) on 2016-12-22

Changed in fuel:
assignee:	nobody → Sergii Golovatiuk (sgolovatiuk)
status:	New → Confirmed

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-12-27: Re: [BVT_2][224] Timeout while launching another instance from the snapshot

#1

The same issue on 9.2 snapshot #684:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/158/testReport/(root)/ceph_partitions_repetitive_cold_restart/ceph_partitions_repetitive_cold_restart/

  Scenario:
        1. Revert snapshot 'prepare_load_ceph_ha'
        2. Wait until MySQL Galera is UP on some controller
        3. Check Ceph status
        4. Run ostf
        5. Fill ceph partitions on all nodes up to 30%
        6. Check Ceph status
        7. Disable UMM
        8. Run RALLY
        9. 100 times repetitive reboot:
        10. Cold restart of all nodes
        11. Wait for HA services ready
        12. Wait until MySQL Galera is UP on some controller
        13. Run ostf <<< failed here

>>>
2016-12-27T01:22:44.400903+00:00 err: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:12845, this is a split-brain!
2016-12-27T01:22:44.405356+00:00 err: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: c27b16c7-cbd2-11e6-97b3-03ad09eb7038:0, which was not e
xpected

tags:

added: swarm-fail

Nastya Urlapova (aurlapova) on 2016-12-27

summary:

- [BVT_2][224] Timeout while launching another instance from the snapshot
+ Timeout while launching another instance from the snapshot

Revision history for this message

Nastya Urlapova (aurlapova) wrote on 2016-12-27: Re: Timeout while launching another instance from the snapshot

#2

fail_error_ceph_partitions_repetitive_cold_restart-fuel-snapshot-2016-12-27_01-53-57.tar Edit (218.6 MiB, application/x-tar)

Oleksiy Molchanov (omolchanov) on 2016-12-27

tags:

added: area-library

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2017-01-06:

#3

So, the issue here is that on some slow environments there is a race condition due to which some nodes update their GTID SEQNO during the monitor operation on the other nodes. This leads to marking the node that runs with `--wsrep-new-cluster` option to be marked as a duplicate primary component which in turn makes pacemaker fail and restart the node in question. The proper solution would be to use real master-slave OCF script for galera, which is not possible for already released Mitaka and Newton due to significant update impediments. So the solution is to refactor `check_if_new_cluster` method to raise split-brain error only if there are actually more than one primary component by storing info if the node is primary component in the CIB and counting whether there are more than 1 primary components.

summary:

- Timeout while launching another instance from the snapshot
+ Mysql-wss OCF script raises false-positive split-brain errors

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-06: Fix proposed to fuel-library (master)

#4

Fix proposed to branch: master
Review: https://review.openstack.org/417434

Changed in fuel:
assignee:	Sergii Golovatiuk (sgolovatiuk) → Vladimir Kuklin (vkuklin)
status:	Confirmed → In Progress

Bogdan Dobrelya (bogdando) on 2017-01-09

tags:

added: galera

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-11: Fix proposed to fuel-library (stable/mitaka)

#5

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/418892

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-11: Fix proposed to fuel-library (stable/newton)

#6

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/418893

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-11: Fix merged to fuel-library (master)

#7

Reviewed: https://review.openstack.org/417434
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3a2cc9e24a6d695af7c0a65f65b8f62ce1c8920c
Submitter: Jenkins
Branch: master

commit 3a2cc9e24a6d695af7c0a65f65b8f62ce1c8920c
Author: Vladimir Kuklin <email address hidden>
Date: Fri Jan 6 19:11:09 2017 +0300

Avoid false-positive split-brain detection for mysql-wss

    With change Iaa4855d769fe1e0203fcfb9981413273e0e4dda2
    we detect whether the node is running as a primary component
    while it is not master. While it is a good solution, sometimes
    we face race condition when the node which is a 'master' gets lower
    sequence number due to other nodes updating their gtid and the same
    time. Although it happens rarely and mostly on the slow or overloaded
    environemnts, it leads to redundant mysql restarts and service
    downtime for OpenStack APIs.

The proper fix would be to use master-slave resource and corresponding
script, but this is a far to big change for the bug under question.

    The solution proposed checks if the node is a primary component during
    start and monitor operations and also checks for number of currently
    running primary components by setting and querying an additional
    attribute `is_pc`. It triggers monitor failure only when the node
    is not running with the 'master' GTID and is a primary component
    and if there is more than one primary components.

Misc: fix functions return codes to reflect shell 'true'
and 'false' numeric values.

Change-Id: Id3ea32347ed37a6efffd3ee85dfb3110b2e8c8ca
Closes-bug: #1651982

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-11: Fix merged to fuel-library (stable/newton)

#8

Reviewed: https://review.openstack.org/418893
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=c6f0b71a8b5960a625f193f64fd9e313371864ab
Submitter: Jenkins
Branch: stable/newton

commit c6f0b71a8b5960a625f193f64fd9e313371864ab
Author: Vladimir Kuklin <email address hidden>
Date: Fri Jan 6 19:11:09 2017 +0300

Avoid false-positive split-brain detection for mysql-wss

    With change Iaa4855d769fe1e0203fcfb9981413273e0e4dda2
    we detect whether the node is running as a primary component
    while it is not master. While it is a good solution, sometimes
    we face race condition when the node which is a 'master' gets lower
    sequence number due to other nodes updating their gtid and the same
    time. Although it happens rarely and mostly on the slow or overloaded
    environemnts, it leads to redundant mysql restarts and service
    downtime for OpenStack APIs.

The proper fix would be to use master-slave resource and corresponding
script, but this is a far to big change for the bug under question.

    The solution proposed checks if the node is a primary component during
    start and monitor operations and also checks for number of currently
    running primary components by setting and querying an additional
    attribute `is_pc`. It triggers monitor failure only when the node
    is not running with the 'master' GTID and is a primary component
    and if there is more than one primary components.

Misc: fix functions return codes to reflect shell 'true'
and 'false' numeric values.

Change-Id: Id3ea32347ed37a6efffd3ee85dfb3110b2e8c8ca
Closes-bug: #1651982

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-11: Fix merged to fuel-library (stable/mitaka)

#9

Reviewed: https://review.openstack.org/418892
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=3a6b0976e529c549419654f3952e12bf16cdb957
Submitter: Jenkins
Branch: stable/mitaka

commit 3a6b0976e529c549419654f3952e12bf16cdb957
Author: Vladimir Kuklin <email address hidden>
Date: Fri Jan 6 19:11:09 2017 +0300

Avoid false-positive split-brain detection for mysql-wss

    With change Iaa4855d769fe1e0203fcfb9981413273e0e4dda2
    we detect whether the node is running as a primary component
    while it is not master. While it is a good solution, sometimes
    we face race condition when the node which is a 'master' gets lower
    sequence number due to other nodes updating their gtid and the same
    time. Although it happens rarely and mostly on the slow or overloaded
    environemnts, it leads to redundant mysql restarts and service
    downtime for OpenStack APIs.

The proper fix would be to use master-slave resource and corresponding
script, but this is a far to big change for the bug under question.

    The solution proposed checks if the node is a primary component during
    start and monitor operations and also checks for number of currently
    running primary components by setting and querying an additional
    attribute `is_pc`. It triggers monitor failure only when the node
    is not running with the 'master' GTID and is a primary component
    and if there is more than one primary components.

Misc: fix functions return codes to reflect shell 'true'
and 'false' numeric values.

Change-Id: Id3ea32347ed37a6efffd3ee85dfb3110b2e8c8ca
Closes-bug: #1651982

Revision history for this message

Sergey Novikov (snovikov) wrote on 2017-01-27:

#10

The issue had not been being reproduced during the latest 9.2 swarm runs (including for RC1). I've moved bug to "Fix Released"

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-27: Fix included in openstack/fuel-library 11.0.0.0rc1

#11

This issue was fixed in the openstack/fuel-library 11.0.0.0rc1 release candidate.

Fuel for OpenStack

Mysql-wss OCF script raises false-positive split-brain errors

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Vladimir Kuklin	Fuel for OpenStack 11.0
Mitaka	Fix Released	High	Vladimir Kuklin	Fuel for OpenStack 9.2
Newton	Fix Committed	High	Vladimir Kuklin	Fuel for OpenStack 10.1
Ocata	Fix Committed	High	Vladimir Kuklin	Fuel for OpenStack 11.0