Fuel for OpenStack

[ostf] HA tests failures after restarting entire cluster: data replication over mysql, galera environment state

Bug #1610180 reported by Andrey Lavrentyev on 2016-08-05

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Alex Schultz	Fuel for OpenStack 10.0
	Mitaka	Fix Released	High	Alex Schultz	Fuel for OpenStack 9.1

Bug Description

Detailed bug description:
HA tests failures after restarting entire cluster: data replication over mysql, galera environment state

AssertionError: Failed 2 OSTF tests; should fail 0 tests. Names of failed tests:
- Check data replication over mysql (failure) Database creation failed Please refer to OpenStack logs for more details.
- Check galera environment state (failure) Actual value - 3,

Auto acceptance failure: https://product-ci.infra.mirantis.net/job/9.x.acceptance.ubuntu.failover_group_3/2/testReport/%28root%29/shutdown_ceph_for_all/

Steps to reproduce:
Execute 'shutdown_ceph_for_all' test with:
1. Create cluster with Neutron Vxlan, ceph for all, ceph replication factor - 3
2. Add 3 controller, 2 compute, 3 ceph nodes
3. Verify Network
4. Deploy cluster
5. Verify networks
6. Run OSTF
7. Create 2 volumes and 2 instances with attached volumes
8. Fill ceph storages up to 30%(15% for each instance)
9. Shutdown of all nodes
10. Wait 5 minutes
11. Start cluster
12. Wait until OSTF 'HA' suite passes
13. Verify networks
14. Run OSTF tests

Expected results:
OSTF HA passes

Actual result:
2 OSTF HA tests failed

Description of the environment:
9.1 snapshot #93
[root@nailgun log]# shotgun2 short-report
cat /etc/fuel_build_id:
495
cat /etc/fuel_build_number:
495
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
python-packetary-9.0.0-1.mos142.noarch
fuel-migrate-9.0.0-1.mos8496.noarch
fuel-release-9.0.0-1.mos6349.noarch
fuel-bootstrap-cli-9.0.0-1.mos285.noarch
fuel-openstack-metadata-9.0.0-1.mos8748.noarch
fuel-ostf-9.0.0-1.mos938.noarch
nailgun-mcagents-9.0.0-1.mos753.noarch
shotgun-9.0.0-1.mos90.noarch
python-fuelclient-9.0.0-1.mos325.noarch
fuel-9.0.0-1.mos6349.noarch
fuel-library9.0-9.0.0-1.mos8496.noarch
fuel-provisioning-scripts-9.0.0-1.mos8748.noarch
rubygem-astute-9.0.0-1.mos753.noarch
fuel-setup-9.0.0-1.mos6349.noarch
network-checker-9.0.0-1.mos74.x86_64
fuel-agent-9.0.0-1.mos285.noarch
fuel-misc-9.0.0-1.mos8496.noarch
fuelmenu-9.0.0-1.mos275.noarch
fuel-notify-9.0.0-1.mos8496.noarch
fuel-nailgun-9.0.0-1.mos8748.noarch
fuel-ui-9.0.0-1.mos2718.noarch
fuel-mirror-9.0.0-1.mos142.noarch
fuel-utils-9.0.0-1.mos8496.noarch

MOS_CENTOS_OS_MIRROR_ID: os-2016-06-23-135731
MOS_CENTOS_PROPOSED_MIRROR_ID: proposed-2016-08-04-102320
MOS_CENTOS_UPDATES_MIRROR_ID: updates-2016-06-23-135916
MOS_CENTOS_SECURITY_MIRROR_ID: security-2016-06-23-140002
MOS_CENTOS_HOLDBACK_MIRROR_ID: holdback-2016-06-23-140047
MOS_UBUNTU_MIRROR_ID: 9.0-2016-08-04-084321
UBUNTU_MIRROR_ID: ubuntu-2016-08-03-174238
CENTOS_MIRROR_ID: centos-7.2.1511-2016-05-31-083834

Logs: https://drive.google.com/open?id=0B5HPBFb7K7gXSTlnOGwzY010elk

Tags:

Ivan Ponomarev (ivanzipfer) on 2016-08-05

Changed in fuel:
milestone:	none → 9.1
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)

Ivan Ponomarev (ivanzipfer) on 2016-08-05

tags:	added: area-library
Changed in fuel:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-08-05:

Bogdan, can you check this Mysql failure?

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-08-08:

The RC is the OCF logic split brain:
2016-08-05T05:04:49.635061+00:00 node-5 ocf-mysql-wss.log info: INFO: p_mysqld: get_master() Possible masters: node-5.test.domain.local
2016-08-05T05:04:49.640088+00:00 node-5 ocf-mysql-wss.log info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
2016-08-05T05:04:50.044341+00:00 node-2 ocf-mysql-wss.log info: INFO: p_mysqld: get_master() Possible masters: node-2.test.domain.local node-5.test.domain.local
2016-08-05T05:04:50.049333+00:00 node-2 ocf-mysql-wss.log info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-08-08:

There is the similar issue on CI during repetitive restart:
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.repetitive_restart/20/testReport/(root)/ceph_partitions_repetitive_cold_restart/

Scenario:
1. Revert snapshot 'prepare_load_ceph_ha'
2. Wait until MySQL Galera is UP on some controller
3. Check Ceph status
4. Run ostf
5. Fill ceph partitions on all nodes up to 30%
6. Check Ceph status
7. Disable UMM
8. Run RALLY
9. 100 times repetitive reboot (this stsp fails during ostf check Names of failed tests:
- Check data replication over mysql (failure) Can not get data from database node node-5 Please refer to OpenStack logs for more details.
- Check galera environment state (failure) Actual value - 3,

Brief investigation discover mysql problems and the similar messages in ocf-mysql-wss.log

Full logs is available https://drive.google.com/open?id=0B2ag_Bf-ShtTcFFmZXRTMDZFNlE

Sergey Shevorakov (sshevorakov) on 2016-08-10

tags:

added: swarm-fail

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2016-08-15:

Alex, please check this.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2016-08-16:

OCF mysql split brain as node 4 and node 5 both thought they were primary as the primary determination happened ~288 ms apart.

node-3.test.domain.local/ocf-mysql-wss.log:2016-08-08T06:29:18.434981+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-3.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:43:02.305038+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-4.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:58:21.933554+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-5.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:32:24.277443+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!
node-5.test.domain.local/ocf-mysql-wss.log:2016-08-08T07:58:21.645710+00:00 info: INFO: p_mysqld: check_if_galera_pc(): I\'m Primary Component. Join me!

I suppose we could attempt to do some sort of random sleep/retry when only one node returns a GTID to make sure we don't end up with two nodes trying to claim master at once.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-16: Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/356122

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-16: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/356123

Changed in fuel:
status:	Confirmed → In Progress

Maksim Malchuk (mmalchuk) on 2016-08-22

no longer affects:

fuel/newton

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-08-22: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/356123
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=9df2fb42852d0790adab854a633200edb161bac4
Submitter: Jenkins
Branch: master

commit 9df2fb42852d0790adab854a633200edb161bac4
Author: Alex Schultz <email address hidden>
Date: Tue Aug 16 13:45:33 2016 -0600

Add retry to master gtid query

    If all the mysql nodes are booted at the exact same time, we can end up
    with a situation where the master determination can occur almost at the
    same time. This change updates the gtid fetching that is done during
    master determination to include a retry with a random 1-10 second sleep
    in an attempt to allow for the other nodes to update pacemaker with
    their gtid information.

Change-Id: Ib12fb927391857ca9e3fb0a3ee45a7eec9e7913e
Closes-Bug: #1610180