Rabbitmq server failed to start after unexpected reboot and maintenace mode manipulation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Invalid
|
Critical
|
Dmitry Ilyin |
Bug Description
Steps to Reproduce:
1. Create cluster
2. Add 3 node with controller and mongo roles
3. Add 2 node with compute and cinder roles
4. Deploy the cluster
5. Run ostf
6. Run unexpected reboot
7. Wait until controller is switching in maintenance mode
8. Exit maintenance mode
9. Check the controller become available
10. Run ostf
Expected Result:
OSTF tests are passed
Actual:
{
"RabbitMQ availability (failure)": "Number of RabbitMQ nodes is not equal to number of cluster nodes."
},
{
"RabbitMQ replication (failure)": "Failed to establish AMQP connection to 5673/tcp port on 10.109.2.6 from controller node! Please refer to OpenStack logs for more details."
}
I've reverted environment, wait near 20 minutes after this run ostf and got the same results:
http://
Then I see in crm_mon -1 that rabbit master is not run:
Clone Set: clone_p_dns [p_dns]
Started: [ node-1.
Master/Slave Set: master_
p_
Slaves: [ node-3.
also there is no rabbit-server running at all:
root@node-
rabbitmq 5432 0.0 0.0 90308 2172 ? Ss 08:11 0:00 /usr/bin/python /usr/bin/
rabbitmq 12676 0.2 0.0 8900 1976 ? S 08:14 0:09 /usr/lib/
root 30719 0.0 0.0 10464 940 pts/0 S+ 09:31 0:00 grep --color=auto rabbit
At the same time seems we try to start it but failed with next last message in t the log:
Error: {could_
Also other 2 nodes lost clusters:
from node-4
[root@nailgun ~]# ssh node-4
Warning: Permanently added 'node-4' (RSA) to the list of known hosts.
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-64-generic x86_64)
* Documentation: https:/
root@node-4:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-4' ...
[{nodes,
{running_
{cluster_
{partitions,[]}]
root@node-4:~#
from node-3:
oot@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,
VERSION:
feature_groups:
- mirantis
production: "docker"
release: "7.0"
openstack_
api: "1.0"
build_number: "295"
build_id: "295"
nailgun_sha: "16a39d40120dd4
python-
fuel-agent_sha: "082a47bf014002
fuel-
astute_sha: "6c5b73f93e24cc
fuel-library_sha: "8e9a9ae51abbbd
fuel-ostf_sha: "1f08e6e7102117
fuelmain_sha: "6b83
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → Dmitry Ilyin (idv1985) |
Changed in fuel: | |
status: | Confirmed → Invalid |
summary: |
- Rabbit server failed to start after unexpected reboot and maintenace + Rabbitmq server failed to start after unexpected reboot and maintenace mode manipulation |
The description looks too vague. What was the time you waited between of the:
9. Check the controller become available
10. Run ostf
You should wait at least for 5 minutes *after* controller became available (pacemaker with corosync started) *and* before to check if the rabbitmq cluster recovered. Did you forget about failover time?