[SWARM] OSTF tests failed due to split-brain issue.

Bug #1636841 reported by Alexey. Kalashnikov on 2016-10-26
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Sergii Golovatiuk
Nominated for Ocata by Oleksiy Molchanov
Mitaka
High
Sergii Golovatiuk
Newton
High
Sergii Golovatiuk

Bug Description

During today swarm tests, at least two of them were failed due to "split-brain" issue which leading to failures of ostf tests.

Failed tests:
Deploy HA environment with Cinder, Neutron and network template on two nodegroups.
Error Message:
http://paste.openstack.org/show/587077/
SLAVE_NODE_MEMORY=3968
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.multiracks_2/106/console
diagnostic snapshot:
https://drive.google.com/open?id=0B0EB6QSDWt2vcGRDZmR0c1hMQXM
shotgun2 report:
http://paste.openstack.org/show/587091/
/var/log/remote/node-1.test.domain.local/ocf-mysql-wss.log:2016-10-25T23:55:50.384664+00:00 err: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:5342, this is a split-brain!

Deploy HA environment with NeutronVXLAN and 2 nodegroups
Error Message:
http://paste.openstack.org/show/587079/
SLAVE_NODE_MEMORY=3968
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_7/106/console
diagnostic snapshot:
https://drive.google.com/open?id=0B0EB6QSDWt2vZFJjQk9OVThKZ1E

Additional information
Related issues:
https://bugs.launchpad.net/fuel/+bug/1630233
https://bugs.launchpad.net/fuel/+bug/1620268

Changed in fuel:
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
milestone: none → 11.0
status: New → Confirmed
tags: added: area-library
Changed in fuel:
importance: Undecided → High
tags: added: swarm-blocker
Dmitry Pyzhov (dpyzhov) on 2016-11-01
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Sergii Golovatiuk (sgolovatiuk)
Bogdan Dobrelya (bogdando) wrote :

The failure was a permanent race condition , see http://pastebin.com/nc20bkVb This is a design flaw in OCF MySQL RA

Roman Rufanov (rrufanov) wrote :

@Bogdan - what is the proposed solution then? Are you saying that OSTFs will be failing and we can not fix it?

Sergii Golovatiuk (sgolovatiuk) wrote :

According to the log p_mysqld_start_0 timed out. It didn't failed. It just timed out... It's not related to the way how we assemble galera. It's related to load average during the test run. This bug is sporadic and related to environment rather than to galera assemble process.

Changed in fuel:
status: Confirmed → Won't Fix
Changed in fuel:
status: Won't Fix → Confirmed

Reviewed: https://review.openstack.org/400141
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=081f27670fc065f4c52920c1ac911163c9f4ad11
Submitter: Jenkins
Branch: master

commit 081f27670fc065f4c52920c1ac911163c9f4ad11
Author: Sergii Golovatiuk <email address hidden>
Date: Mon Nov 21 10:18:06 2016 +0100

    Remove mysqld_safe from mysql-wss OCF

    Pacemaker controls mysqld, thus we don't need mysqld_safe wrapper which
    does the same. This should help to get statuses for OCF scripts on very
    high loaded systems

    Change-Id: I73649d60c3cc08cbe696c6bc97ee5aa0ad430908
    Related-Bug: 1636841
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Elena Ezhova (eezhova) wrote :

It is likely that this issue was hit in the recent 9.x BVT job [0]. The following failures occur https://paste.mirantis.net/show/2850/ .
There are traces in Keystone and other services logs at the time of the failure:
<11>Dec 8 15:14:52 node-1 keystone-admin: 2016-12-08 15:14:52.797 4633 ERROR keystone.common.wsgi [req-c56c4e38-c945-4b82-8e35-2fcb71b65433 - - - - -] (_mysql_exceptions.OperationalError) (2013, "Lost connection to MySQL server at 'reading initial communication packet', system error: 11") [SQL: u'SELECT 1']

It occurs that mysqld was stopped by pacemaker at that time https://paste.mirantis.net/show/2851/ and a split-brain happened.

<27>Dec 8 15:14:50 node-1 ocf-mysql-wss: ERROR: p_mysqld: check_if_galera_pc(): But I'm running a new cluster, PID:24125, this is a split-brain!
<27>Dec 8 15:14:50 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_monitor(): I'm a master, and my GTID: 9500b2b2-bd54-11e6-be60-cf0a892e175d:3738, which was not expected
<27>Dec 8 15:14:50 node-1 ocf-ns_vrouter: ERROR: RTNETLINK answers: File exists
<27>Dec 8 15:14:57 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): PIDFile /var/run/resource-agents/mysql-wss/mysql-wss.pid of MySQL server not found. Sleeping for 2 seconds. 0 retries left
<27>Dec 8 15:14:59 node-1 ocf-mysql-wss: ERROR: p_mysqld: mysql_status(): MySQL is not running

[0] https://product-ci.infra.mirantis.net/job/9.x.main.ubuntu.bvt_2/611

Reviewed: https://review.openstack.org/410598
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=dda74618cd03aa2720afe8b0980947bf735dd6b2
Submitter: Jenkins
Branch: stable/mitaka

commit dda74618cd03aa2720afe8b0980947bf735dd6b2
Author: Sergii Golovatiuk <email address hidden>
Date: Mon Nov 21 10:18:06 2016 +0100

    Remove mysqld_safe from mysql-wss OCF

    Pacemaker controls mysqld, thus we don't need mysqld_safe wrapper which
    does the same. This should help to get statuses for OCF scripts on very
    high loaded systems

    Change-Id: I73649d60c3cc08cbe696c6bc97ee5aa0ad430908
    Related-Bug: 1636841
    Signed-off-by: Sergii Golovatiuk <email address hidden>
    (cherry picked from commit 081f27670fc065f4c52920c1ac911163c9f4ad11)

tags: added: in-stable-mitaka

Fix proposed to branch: master
Review: https://review.openstack.org/413034

Changed in fuel:
status: Confirmed → In Progress

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/413109

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/413034
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=139c5f4fadd45c3d1430c38b6a3c7238612d4211
Submitter: Jenkins
Branch: master

commit 139c5f4fadd45c3d1430c38b6a3c7238612d4211
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Dec 20 12:23:07 2016 +0100

    Fix dependency issues between tasks

    DB tasks should wait for galera explicitly. There shouldn't be
    conditions when primary-database is done, allowing *-db tasks to create
    users while galera is syncing with other nodes. That causes split brain
    as *-db tasks create users in MyISAM tables which is not handled by
    Galera.

    Closes-Bug: #1636841
    Change-Id: I729ba0f2bfce1e731de36c932f5f0350e91adb22
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Reviewed: https://review.openstack.org/413109
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=0271ec3be2e3a5ab06a03298d33935f86ee955ca
Submitter: Jenkins
Branch: stable/mitaka

commit 0271ec3be2e3a5ab06a03298d33935f86ee955ca
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Dec 20 12:23:07 2016 +0100

    Fix dependency issues between tasks

    DB tasks should wait for galera explicitly. There shouldn't be
    conditions when primary-database is done, allowing *-db tasks to create
    users while galera is syncing with other nodes. That causes split brain
    as *-db tasks create users in MyISAM tables which is not handled by
    Galera.

    Closes-Bug: #1636841
    Change-Id: I729ba0f2bfce1e731de36c932f5f0350e91adb22
    Signed-off-by: Sergii Golovatiuk <email address hidden>

Reviewed: https://review.openstack.org/413107
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=bfe89ff3c91f8446ba361a8a4c7e3312f3754ceb
Submitter: Jenkins
Branch: stable/newton

commit bfe89ff3c91f8446ba361a8a4c7e3312f3754ceb
Author: Sergii Golovatiuk <email address hidden>
Date: Tue Dec 20 12:23:07 2016 +0100

    Fix dependency issues between tasks

    DB tasks should wait for galera explicitly. There shouldn't be
    conditions when primary-database is done, allowing *-db tasks to create
    users while galera is syncing with other nodes. That causes split brain
    as *-db tasks create users in MyISAM tables which is not handled by
    Galera.

    Closes-Bug: #1636841
    Change-Id: I729ba0f2bfce1e731de36c932f5f0350e91adb22
    Signed-off-by: Sergii Golovatiuk <email address hidden>
    (cherry picked from commit 139c5f4fadd45c3d1430c38b6a3c7238612d4211)

This issue was fixed in the openstack/fuel-library 11.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers