[regression] 9.0 mitaka mos is failed to deploy with nodes in error state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Committed
|
Critical
|
Bogdan Dobrelya | ||
Mitaka |
In Progress
|
Critical
|
Bogdan Dobrelya |
Bug Description
Since about iso #262 on 9.0 mos mitaka the deployment of the environment is failed quite recently.
In about 50% cases the deploy is failed with nodes stucked in error state.
The following error are seen in pacemaker logs on one of the affected envs:
warning: status_from_rc: Action 61 (p_mysqld_start_0) on node-2.
Diagnostic snapshot from the env is attached.
And looks like the problem appears only on environments with the 3 controllers.
Test environments with only one controller are deployed fine so far.
So to reproduce the problem just try to deploy env with 9.0 mitaks iso #262 or later.
Environment should contain at least 3 controllers, one compute and one storage (cinder or ceph doesn't matter).
Changed in fuel: | |
assignee: | nobody → Fuel Library (Deprecated) (fuel-library) |
importance: | Undecided → Critical |
summary: |
- 9.0 mitaka mos is failed to deploy with nodes in error state + [regression] 9.0 mitaka mos is failed to deploy with nodes in error + state |
tags: | added: mysql |
no longer affects: | fuel/newton |
It looks like deployment failed because fuel can't install mysql and rabbitmq clusters in HA mode, we can see many puppet errors in logs, like:
(/Stage[ main]/Cluster: :Mysql/ Exec[wait- initial- sync]) Failed to call refresh: mysql -uclustercheck -pd83wmbYJXsbpe o9ZCJz3sJI9 -Nbe "show status like 'wsrep_ local_state_ comment' " | grep -q -e Synced && sleep 10 returned 1 instead of one of [0]
Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql --defaults- extra-file= /root/. my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/ mysqld/ mysqld. sock' (111)
(/Stage[ main]/Rabbitmq: :Install: :Rabbitmqadmin/ Staging: :File[rabbitmqa dmin]/Exec[ /var/lib/ rabbitmq/ rabbitmqadmin] /returns) change from notrun to 0 failed: curl -k --noproxy localhost --retry 30 --retry-delay 6 -f -L -o /var/lib/ rabbitmq/ rabbitmqadmin http:// nova:kOVE8QMA6t aaje1mZZjHCbZe@ localhost: 15672/cli/ rabbitmqadmin returned 7 instead of one of [0]
It is blocker for QA team because many test configurations which we use to run Tempest tests failed with this error.
Priority changed to Critical because deployment of the environment failed and we don't have a workaround yet.