[ostf] HA tests failures after restarting entire cluster: data replication over mysql, galera environment state

Bug #1610180 reported by Andrey Lavrentyev
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Alex Schultz
Mitaka
Fix Released
High
Alex Schultz

Bug Description

Detailed bug description:
HA tests failures after restarting entire cluster: data replication over mysql, galera environment state

AssertionError: Failed 2 OSTF tests; should fail 0 tests. Names of failed tests:
  - Check data replication over mysql (failure) Database creation failed Please refer to OpenStack logs for more details.
  - Check galera environment state (failure) Actual value - 3,

Auto acceptance failure: https://product-ci.infra.mirantis.net/job/9.x.acceptance.ubuntu.failover_group_3/2/testReport/%28root%29/shutdown_ceph_for_all/

Steps to reproduce:
Execute 'shutdown_ceph_for_all' test with:
1. Create cluster with Neutron Vxlan, ceph for all, ceph replication factor - 3
2. Add 3 controller, 2 compute, 3 ceph nodes
3. Verify Network
4. Deploy cluster
5. Verify networks
6. Run OSTF
7. Create 2 volumes and 2 instances with attached volumes
8. Fill ceph storages up to 30%(15% for each instance)
9. Shutdown of all nodes
10. Wait 5 minutes
11. Start cluster
12. Wait until OSTF 'HA' suite passes
13. Verify networks
14. Run OSTF tests

Expected results:
OSTF HA passes

Actual result:
2 OSTF HA tests failed

Description of the environment:
9.1 snapshot #93
[root@nailgun log]# shotgun2 short-report
cat /etc/fuel_build_id:
 495
cat /etc/fuel_build_number:
 495
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 python-packetary-9.0.0-1.mos142.noarch
 fuel-migrate-9.0.0-1.mos8496.noarch
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-openstack-metadata-9.0.0-1.mos8748.noarch
 fuel-ostf-9.0.0-1.mos938.noarch
 nailgun-mcagents-9.0.0-1.mos753.noarch
 shotgun-9.0.0-1.mos90.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-library9.0-9.0.0-1.mos8496.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8748.noarch
 rubygem-astute-9.0.0-1.mos753.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-misc-9.0.0-1.mos8496.noarch
 fuelmenu-9.0.0-1.mos275.noarch
 fuel-notify-9.0.0-1.mos8496.noarch
 fuel-nailgun-9.0.0-1.mos8748.noarch
 fuel-ui-9.0.0-1.mos2718.noarch
 fuel-mirror-9.0.0-1.mos142.noarch
 fuel-utils-9.0.0-1.mos8496.noarch

MOS_CENTOS_OS_MIRROR_ID: os-2016-06-23-135731
MOS_CENTOS_PROPOSED_MIRROR_ID: proposed-2016-08-04-102320
MOS_CENTOS_UPDATES_MIRROR_ID: updates-2016-06-23-135916
MOS_CENTOS_SECURITY_MIRROR_ID: security-2016-06-23-140002
MOS_CENTOS_HOLDBACK_MIRROR_ID: holdback-2016-06-23-140047
MOS_UBUNTU_MIRROR_ID: 9.0-2016-08-04-084321
UBUNTU_MIRROR_ID: ubuntu-2016-08-03-174238
CENTOS_MIRROR_ID: centos-7.2.1511-2016-05-31-083834

Logs: https://drive.google.com/open?id=0B5HPBFb7K7gXSTlnOGwzY010elk

Changed in fuel:
milestone: none → 9.1
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
tags: added: area-library
Changed in fuel:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Bogdan, can you check this Mysql failure?

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The RC is the OCF logic split brain:
2016-08-05T05:04:49.635061+00:00 node-5 ocf-mysql-wss.log info: INFO: p_mysqld: get_master() Possible masters: node-5.test.domain.local
2016-08-05T05:04:49.640088+00:00 node-5 ocf-mysql-wss.log info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
2016-08-05T05:04:50.044341+00:00 node-2 ocf-mysql-wss.log info: INFO: p_mysqld: get_master() Possible masters: node-2.test.domain.local node-5.test.domain.local
2016-08-05T05:04:50.049333+00:00 node-2 ocf-mysql-wss.log info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!

Revision history for this message
ElenaRossokhina (esolomina) wrote :

There is the similar issue on CI during repetitive restart:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/20/testReport/(root)/ceph_partitions_repetitive_cold_restart/

Scenario:
1. Revert snapshot 'prepare_load_ceph_ha'
2. Wait until MySQL Galera is UP on some controller
3. Check Ceph status
4. Run ostf
5. Fill ceph partitions on all nodes up to 30%
6. Check Ceph status
7. Disable UMM
8. Run RALLY
9. 100 times repetitive reboot (this stsp fails during ostf check Names of failed tests:
  - Check data replication over mysql (failure) Can not get data from database node node-5 Please refer to OpenStack logs for more details.
  - Check galera environment state (failure) Actual value - 3,

Brief investigation discover mysql problems and the similar messages in ocf-mysql-wss.log

Full logs is available https://drive.google.com/open?id=0B2ag_Bf-ShtTcFFmZXRTMDZFNlE

tags: added: swarm-fail
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Alex, please check this.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

OCF mysql split brain as node 4 and node 5 both thought they were primary as the primary determination happened ~288 ms apart.

node-3.test.domain.local/ocf-mysql-wss.log:2016-08-08T06:29:18.434981+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-3.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:43:02.305038+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-4.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:58:21.933554+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-5.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:32:24.277443+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-5.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:58:21.645710+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!

I suppose we could attempt to do some sort of random sleep/retry when only one node returns a GTID to make sure we don't end up with two nodes trying to claim master at once.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/356122

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/356123

Changed in fuel:
status: Confirmed → In Progress
no longer affects: fuel/newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/356123
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=9df2fb42852d0790adab854a633200edb161bac4
Submitter: Jenkins
Branch: master

commit 9df2fb42852d0790adab854a633200edb161bac4
Author: Alex Schultz <email address hidden>
Date: Tue Aug 16 13:45:33 2016 -0600

    Add retry to master gtid query

    If all the mysql nodes are booted at the exact same time, we can end up
    with a situation where the master determination can occur almost at the
    same time. This change updates the gtid fetching that is done during
    master determination to include a retry with a random 1-10 second sleep
    in an attempt to allow for the other nodes to update pacemaker with
    their gtid information.

    Change-Id: Ib12fb927391857ca9e3fb0a3ee45a7eec9e7913e
    Closes-Bug: #1610180

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/356122
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=f6c1191ed41e5d0a8c456329be7084f644e531a7
Submitter: Jenkins
Branch: stable/mitaka

commit f6c1191ed41e5d0a8c456329be7084f644e531a7
Author: Alex Schultz <email address hidden>
Date: Tue Aug 16 13:45:33 2016 -0600

    Add retry to master gtid query

    If all the mysql nodes are booted at the exact same time, we can end up
    with a situation where the master determination can occur almost at the
    same time. This change updates the gtid fetching that is done during
    master determination to include a retry with a random 1-10 second sleep
    in an attempt to allow for the other nodes to update pacemaker with
    their gtid information.

    Change-Id: Ib12fb927391857ca9e3fb0a3ee45a7eec9e7913e
    Closes-Bug: #1610180

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0rc1

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0

This issue was fixed in the openstack/fuel-library 10.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.