Cluster cannot recover after cold reboot

Bug #1585506 reported by ElenaRossokhina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Bogdan Dobrelya
Mitaka
Invalid
Undecided
Unassigned
Newton
Invalid
High
Bogdan Dobrelya

Bug Description

Detailed bug description:
Cluster cannot recover after cold restart (several hours)
Steps to reproduce:
https://mirantis.testrail.com/index.php?/cases/view/842819
Expected results:
all steps passed

Actual result:
HA-set fails:
[root@nailgun ~]# fuel health --env 1 --check ha
[ 1 of 7] [success] 'Check state of haproxy backends on controllers' (1.57 s)
[ 2 of 7] [failure] 'Check data replication over mysql' (2.937 s) Can not get data from database node node-2 Please refer to OpenStack logs for more details.
[ 3 of 7] [success] 'Check if amount of tables in databases is the same on each node' (3.782 s)
[ 4 of 7] [failure] 'Check galera environment state' (0.6464 s) Actual value - 3,
[ 5 of 7] [success] 'Check pacemaker status' (1.586 s)
[ 6 of 7] [success] 'RabbitMQ availability' (13.78 s)
[ 7 of 7] [success] 'RabbitMQ replication' (32.69 s)

full sanpshot - https://drive.google.com/a/mirantis.com/file/d/0B2ag_Bf-ShtTUjlCSjFRejBOYkE/view?usp=sharing

[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 376
cat /etc/fuel_build_number:
 376
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6346.noarch
 fuel-bootstrap-cli-9.0.0-1.mos281.noarch
 fuel-migrate-9.0.0-1.mos8376.noarch
 rubygem-astute-9.0.0-1.mos745.noarch
 fuel-misc-9.0.0-1.mos8376.noarch
 network-checker-9.0.0-1.mos72.x86_64
 fuel-mirror-9.0.0-1.mos136.noarch
 fuel-openstack-metadata-9.0.0-1.mos8693.noarch
 fuel-notify-9.0.0-1.mos8376.noarch
 nailgun-mcagents-9.0.0-1.mos745.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8693.noarch
 python-fuelclient-9.0.0-1.mos315.noarch
 fuelmenu-9.0.0-1.mos270.noarch
 fuel-9.0.0-1.mos6346.noarch
 fuel-utils-9.0.0-1.mos8376.noarch
 fuel-setup-9.0.0-1.mos6346.noarch
 fuel-library9.0-9.0.0-1.mos8376.noarch
 shotgun-9.0.0-1.mos88.noarch
 fuel-agent-9.0.0-1.mos281.noarch
 fuel-ui-9.0.0-1.mos2688.noarch
 fuel-ostf-9.0.0-1.mos934.noarch
 python-packetary-9.0.0-1.mos136.noarch
 fuel-nailgun-9.0.0-1.mos8693.noarch

Revision history for this message
ElenaRossokhina (esolomina) wrote :

I've investigated issue deeply and it seems similar https://bugs.launchpad.net/fuel/+bug/1563899
part of ostf.log http://paste.openstack.org/show/505301/

Ilya Kutukov (ikutukov)
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
Ilya Kutukov (ikutukov)
Changed in fuel:
milestone: none → 9.0
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
tags: added: area-python
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I cannot download logs

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
ElenaRossokhina (esolomina) wrote :

Sorry. Logs are uploaded again, please check
https://drive.google.com/open?id=0B2ag_Bf-ShtTZjV0X3FodWxURXM

Changed in fuel:
status: Incomplete → Confirmed
tags: added: area-library galera
removed: area-python
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

few comments about how DB OSTF failures may be treated, either expected to the current cluster state and not fatal (there is no full downtime associated), or correctness/availability issues, which is a fatal failure.

There are 3 types of failures logged, for example:

Type 1:
 ../ostf/cluster_1_ha.log:2016-05-24 16:00:51 FAILURE Check data replication over mysql (fuel_health.tests.ha.test_mysql_replication.TestMysqlReplication.test_mysql_replication) Can not get data from database node node-2 Please refer to OpenStack logs for more details. File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 67, in testPartExecutor

This is only about this type of failure:
  SSHExecCommandFailed: Command 'mysql -h localhost -e "SELECT * FROM ost1601.ost474 WHERE data = "9026077276"" ', exit status: 1, Error:
  ERROR 1146 (42S02) at line 1: Table 'ost1601.ost474' doesn't exist

The test creates a table from a node, then checks it from another node. I'm not sure this test is correct as firstly, we don't use A/A for Galera, yet, so there is no sens to verify data in A/A mode as well. Second, I need more details how *exaclty* it executes select after create/insert/update, like does it use transactions with RR isolation? Does it waits for sync http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait ?

By results of additional investigation TBD, the test case may have to be reworked.

Type 2:
 ../ostf/cluster_1_ha.log:2016-05-24 16:00:56 FAILURE Check galera environment state (fuel_health.tests.ha.test_mysql_status.TestMysqlStatus.test_state_of_galera_cluster) Actual value - 3, File "/usr/lib/python2.7/site-packages/unittest2/case.py", line 67, in testPartExecutor
  (wsrep_cluster_size 2, while 3 was expected)

This is only means that 1 of 3 nodes was unexpectedly in the cluster. I believe we can just ignore this, unless there are active backends, we're fine to go.

Type 3:
 ../ostf/cluster_1_ha.log:2016-05-24 15:52:58 FAILURE Check data replication over mysql (fuel_health.tests.ha.test_mysql_replication.TestMysqlReplication.test_mysql_replication) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct

This means there is no active backends left perhaps, and that is for sure a fatal failure and must be investigated

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A little fix: "...This is only means that 1 of 3 nodes was unexpectedly *not* in the cluster."

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Here is the RCA http://pastebin.com/d1W6GvgC, done keeping in mind the aforementioned 3 types of failures.

By results:
* Fatal ostf failures of "Type 3" are expected with the current cluster state, which is yet to be assembled after a cold restart. Fits to the SLA for a DB cluster recovery after a cold restart, which is from 5 to 20 min, IIRC
* ostf failures of "Type 2" are expected with the recent cluster state, which is 1 of 3 nodes is down, not fatal, just ignore this
* ostf failures of "Type 1" are in the scope for only A/A Galera, we use A/P, so not fatal, just ignore this

moving to invalid, and opening an OSTF bug for the Type 1 to be reworked

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.