Fuel for OpenStack

Cluster cannot recover after cold reboot

Bug #1585506 reported by ElenaRossokhina on 2016-05-25

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Bogdan Dobrelya	Fuel for OpenStack 10.0
Mitaka	Invalid	Undecided	Unassigned	Fuel for OpenStack 9.0
Newton	Invalid	High	Bogdan Dobrelya	Fuel for OpenStack 10.0

Bug Description

Detailed bug description:
Cluster cannot recover after cold restart (several hours)
Steps to reproduce:
https://mirantis.testrail.com/index.php?/cases/view/842819
Expected results:
all steps passed

Actual result:
HA-set fails:
[root@nailgun ~]# fuel health --env 1 --check ha
[ 1 of 7] [success] 'Check state of haproxy backends on controllers' (1.57 s)
[ 2 of 7] [failure] 'Check data replication over mysql' (2.937 s) Can not get data from database node node-2 Please refer to OpenStack logs for more details.
[ 3 of 7] [success] 'Check if amount of tables in databases is the same on each node' (3.782 s)
[ 4 of 7] [failure] 'Check galera environment state' (0.6464 s) Actual value - 3,
[ 5 of 7] [success] 'Check pacemaker status' (1.586 s)
[ 6 of 7] [success] 'RabbitMQ availability' (13.78 s)
[ 7 of 7] [success] 'RabbitMQ replication' (32.69 s)

full sanpshot - https://drive.google.com/a/mirantis.com/file/d/0B2ag_Bf-ShtTUjlCSjFRejBOYkE/view?usp=sharing

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
376
cat /etc/fuel_build_number:
376
cat /etc/fuel_release:
9.0
cat /etc/fuel_openstack_version:
mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
fuel-release-9.0.0-1.mos6346.noarch
fuel-bootstrap-cli-9.0.0-1.mos281.noarch
fuel-migrate-9.0.0-1.mos8376.noarch
rubygem-astute-9.0.0-1.mos745.noarch
fuel-misc-9.0.0-1.mos8376.noarch
network-checker-9.0.0-1.mos72.x86_64
fuel-mirror-9.0.0-1.mos136.noarch
fuel-openstack-metadata-9.0.0-1.mos8693.noarch
fuel-notify-9.0.0-1.mos8376.noarch
nailgun-mcagents-9.0.0-1.mos745.noarch
fuel-provisioning-scripts-9.0.0-1.mos8693.noarch
python-fuelclient-9.0.0-1.mos315.noarch
fuelmenu-9.0.0-1.mos270.noarch
fuel-9.0.0-1.mos6346.noarch
fuel-utils-9.0.0-1.mos8376.noarch
fuel-setup-9.0.0-1.mos6346.noarch
fuel-library9.0-9.0.0-1.mos8376.noarch
shotgun-9.0.0-1.mos88.noarch
fuel-agent-9.0.0-1.mos281.noarch
fuel-ui-9.0.0-1.mos2688.noarch
fuel-ostf-9.0.0-1.mos934.noarch
python-packetary-9.0.0-1.mos136.noarch
fuel-nailgun-9.0.0-1.mos8693.noarch

Tags:

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-05-25:

I've investigated issue deeply and it seems similar https://bugs.launchpad.net/fuel/+bug/1563899
part of ostf.log http://paste.openstack.org/show/505301/

Ilya Kutukov (ikutukov) on 2016-05-30

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → High

Ilya Kutukov (ikutukov) on 2016-05-30

Changed in fuel:
milestone:	none → 9.0
assignee:	nobody → Fuel Sustaining (fuel-sustaining-team)
tags:	added: area-python

Bogdan Dobrelya (bogdando) on 2016-05-30

Changed in fuel:
assignee:	Fuel Sustaining (fuel-sustaining-team) → Bogdan Dobrelya (bogdando)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-30:

I cannot download logs

Changed in fuel:
status:	Confirmed → Incomplete

Revision history for this message

ElenaRossokhina (esolomina) wrote on 2016-05-30:

Sorry. Logs are uploaded again, please check
https://drive.google.com/open?id=0B2ag_Bf-ShtTZjV0X3FodWxURXM

Bogdan Dobrelya (bogdando) on 2016-05-31

Changed in fuel:
status:	Incomplete → Confirmed

Bogdan Dobrelya (bogdando) on 2016-05-31

tags:

added: area-library galera
removed: area-python

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-31:

few comments about how DB OSTF failures may be treated, either expected to the current cluster state and not fatal (there is no full downtime associated), or correctness/availability issues, which is a fatal failure.

There are 3 types of failures logged, for example:

Type 1:
../ostf/cluster_1_ha.log:2016-05-24 16:00:51 FAILURE Check data replication over mysql (fuel_health.tests.ha.test_mysql_replication.TestMysqlReplication.test_mysql_replication) Can not get data from database node node-2 Please refer to OpenStack logs for more details. File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 67, in testPartExecutor

This is only about this type of failure:
SSHExecCommandFailed: Command 'mysql -h localhost -e "SELECT * FROM ost1601.ost474 WHERE data = "9026077276"" ', exit status: 1, Error:
ERROR 1146 (42S02) at line 1: Table 'ost1601.ost474' doesn't exist

The test creates a table from a node, then checks it from another node. I'm not sure this test is correct as firstly, we don't use A/A for Galera, yet, so there is no sens to verify data in A/A mode as well. Second, I need more details how *exaclty* it executes select after create/insert/update, like does it use transactions with RR isolation? Does it waits for sync http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait ?

By results of additional investigation TBD, the test case may have to be reworked.

Type 2:
../ostf/cluster_1_ha.log:2016-05-24 16:00:56 FAILURE Check galera environment state (fuel_health.tests.ha.test_mysql_status.TestMysqlStatus.test_state_of_galera_cluster) Actual value - 3, File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 67, in testPartExecutor
(wsrep_cluster_size 2, while 3 was expected)

This is only means that 1 of 3 nodes was unexpectedly in the cluster. I believe we can just ignore this, unless there are active backends, we're fine to go.

Type 3:
../ostf/cluster_1_ha.log:2016-05-24 15:52:58 FAILURE Check data replication over mysql (fuel_health.tests.ha.test_mysql_replication.TestMysqlReplication.test_mysql_replication) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct

This means there is no active backends left perhaps, and that is for sure a fatal failure and must be investigated

There are 3 types of failures logged, for example:

Type 1:
	../ostf/cluster_1_ha.log:2016-05-24 16:00:51 FAILURE Check data replication over mysql (fuel_health.tests.ha.test_mysql_replication.TestMysqlReplication.test_mysql_replication) Can not get data from database node node-2 Please refer to OpenStack logs for more details.   File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 67, in testPartExecutor

This is only about this type of failure:
		SSHExecCommandFailed: Command 'mysql -h localhost -e "SELECT * FROM ost1601.ost474 WHERE data = "9026077276"" ', exit status: 1, Error:
		ERROR 1146 (42S02) at line 1: Table 'ost1601.ost474' doesn't exist

By results of additional investigation TBD, the test case may have to be reworked.

Type 2:
	../ostf/cluster_1_ha.log:2016-05-24 16:00:56 FAILURE Check galera environment state (fuel_health.tests.ha.test_mysql_status.TestMysqlStatus.test_state_of_galera_cluster)  Actual value - 3,    File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 67, in testPartExecutor
		(wsrep_cluster_size 2, while 3 was expected)

This is only means that 1 of 3 nodes was unexpectedly in the cluster. I believe we can just ignore this, unless there are active backends, we're fine to go.

Type 3:
	../ostf/cluster_1_ha.log:2016-05-24 15:52:58 FAILURE Check data replication over mysql (fuel_health.tests.ha.test_mysql_replication.TestMysqlReplication.test_mysql_replication) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct

This means there is no active backends left perhaps, and that is for sure a fatal failure and must be investigated

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-31:

A little fix: "...This is only means that 1 of 3 nodes was unexpectedly *not* in the cluster."

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-31:

Here is the RCA http://pastebin.com/d1W6GvgC, done keeping in mind the aforementioned 3 types of failures.

By results:
* Fatal ostf failures of "Type 3" are expected with the current cluster state, which is yet to be assembled after a cold restart. Fits to the SLA for a DB cluster recovery after a cold restart, which is from 5 to 20 min, IIRC
* ostf failures of "Type 2" are expected with the recent cluster state, which is 1 of 3 nodes is down, not fatal, just ignore this
* ostf failures of "Type 1" are in the scope for only A/A Galera, we use A/P, so not fatal, just ignore this

moving to invalid, and opening an OSTF bug for the Type 1 to be reworked

Changed in fuel:
status:	Confirmed → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.