Fuel for OpenStack

[regression] 9.0 mitaka mos is failed to deploy with nodes in error state

Bug #1578106 reported by Yury Tregubov on 2016-05-04

This bug report is a duplicate of: Bug #1574999: [regression] MySQL may fail to SST because of a ./ib* files race condition. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	Critical	Bogdan Dobrelya	Fuel for OpenStack 10.0
	Mitaka	In Progress	Critical	Bogdan Dobrelya	Fuel for OpenStack 9.0

Bug Description

Since about iso #262 on 9.0 mos mitaka the deployment of the environment is failed quite recently.
In about 50% cases the deploy is failed with nodes stucked in error state.

The following error are seen in pacemaker logs on one of the affected envs:
warning: status_from_rc: Action 61 (p_mysqld_start_0) on node-2.test.domain.local failed (target: 0 vs. rc: 1): Error

Diagnostic snapshot from the env is attached.

And looks like the problem appears only on environments with the 3 controllers.
Test environments with only one controller are deployed fine so far.

So to reproduce the problem just try to deploy env with 9.0 mitaks iso #262 or later.
Environment should contain at least 3 controllers, one compute and one storage (cinder or ceph doesn't matter).

Tags:

Revision history for this message

Yury Tregubov (ytregubov) wrote on 2016-05-04:

9.0 iso #262 disagnostic Edit (50.8 MiB, application/octet-stream)

Timur Nurlygayanov (tnurlygayanov) on 2016-05-04

Changed in fuel:
assignee:	nobody → Fuel Library (Deprecated) (fuel-library)
importance:	Undecided → Critical

Revision history for this message

Timur Nurlygayanov (tnurlygayanov) wrote on 2016-05-04:

It looks like deployment failed because fuel can't install mysql and rabbitmq clusters in HA mode, we can see many puppet errors in logs, like:

(/Stage[main]/Cluster::Mysql/Exec[wait-initial-sync]) Failed to call refresh: mysql -uclustercheck -pd83wmbYJXsbpeo9ZCJz3sJI9 -Nbe "show status like 'wsrep_local_state_comment'" | grep -q -e Synced && sleep 10 returned 1 instead of one of [0]

Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)

(/Stage[main]/Rabbitmq::Install::Rabbitmqadmin/Staging::File[rabbitmqadmin]/Exec[/var/lib/rabbitmq/rabbitmqadmin]/returns) change from notrun to 0 failed: curl -k --noproxy localhost --retry 30 --retry-delay 6 -f -L -o /var/lib/rabbitmq/rabbitmqadmin http://nova:kOVE8QMA6taaje1mZZjHCbZe@localhost:15672/cli/rabbitmqadmin returned 7 instead of one of [0]

It is blocker for QA team because many test configurations which we use to run Tempest tests failed with this error.
Priority changed to Critical because deployment of the environment failed and we don't have a workaround yet.