Rabbitmq server failed to start after unexpected reboot and maintenace mode manipulation

Bug #1495885 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Critical
Dmitry Ilyin

Bug Description

https://product-ci.infra.mirantis.net/job/7.0.system_test.ubuntu.cic_maintenance_mode/93/testReport/junit/%28root%29/auto_cic_maintenance_mode/auto_cic_maintenance_mode/

Steps to Reproduce:
1. Create cluster
2. Add 3 node with controller and mongo roles
3. Add 2 node with compute and cinder roles
4. Deploy the cluster
5. Run ostf
6. Run unexpected reboot
7. Wait until controller is switching in maintenance mode
8. Exit maintenance mode
9. Check the controller become available
10. Run ostf

Expected Result:
OSTF tests are passed

Actual:
 {
  "RabbitMQ availability (failure)": "Number of RabbitMQ nodes is not equal to number of cluster nodes."
 },
 {
  "RabbitMQ replication (failure)": "Failed to establish AMQP connection to 5673/tcp port on 10.109.2.6 from controller node! Please refer to OpenStack logs for more details."
 }

I've reverted environment, wait near 20 minutes after this run ostf and got the same results:
http://paste.openstack.org/show/462674/

Then I see in crm_mon -1 that rabbit master is not run:
Clone Set: clone_p_dns [p_dns]
     Started: [ node-1.test.domain.local node-3.test.domain.local node-4.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED Master node-1.test.domain.local
     Slaves: [ node-3.test.domain.local node-4.test.domain.local ]
also there is no rabbit-server running at all:

root@node-1:/var/log/rabbitmq# ps uax| grep rabbit
rabbitmq 5432 0.0 0.0 90308 2172 ? Ss 08:11 0:00 /usr/bin/python /usr/bin/rabbit-fence.py
rabbitmq 12676 0.2 0.0 8900 1976 ? S 08:14 0:09 /usr/lib/erlang/erts-5.10.4/bin/epmd -daemon
root 30719 0.0 0.0 10464 940 pts/0 S+ 09:31 0:00 grep --color=auto rabbit

At the same time seems we try to start it but failed with next last message in t the log:
Error: {could_not_start,rabbitmq_management,
           {{shutdown,
                {failed_to_start_child,rabbit_mgmt_sup,
                    {'EXIT',
                        {{shutdown,
                             [{{already_started,<5613.6927.0>},
                               {child,undefined,rabbit_mgmt_db,
                                   {rabbit_mgmt_db,start_link,[]},
                                   permanent,4294967295,worker,
                                   [rabbit_mgmt_db]}}]},
                         {gen_server2,call,
                             [<5200.4474.0>,
                              {init,<5200.4472.0>},
                              infinity]}}}}},
            {rabbit_mgmt_app,start,[normal,[]]}}}

Also other 2 nodes lost clusters:
from node-4
[root@nailgun ~]# ssh node-4
Warning: Permanently added 'node-4' (RSA) to the list of known hosts.
Welcome to Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-64-generic x86_64)

 * Documentation: https://help.ubuntu.com/
root@node-4:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-4' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-3','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-4']},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]}]
root@node-4:~#

from node-3:
oot@node-3:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-3']}]}]

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "295"
  build_id: "295"
  nailgun_sha: "16a39d40120dd4257698795f12de4ae8200b1778"
  python-fuelclient_sha: "2864459e27b0510a0f7aedac6cdf27901ef5c481"
  fuel-agent_sha: "082a47bf014002e515001be05f99040437281a2d"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "8e9a9ae51abbbd4edef1432809311004461eec94"
  fuel-ostf_sha: "1f08e6e71021179b9881a824d9c999957fcc7045"
  fuelmain_sha: "6b83

Tags: rabbitmq
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Andrey Maximov (maximov)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Ilyin (idv1985)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The description looks too vague. What was the time you waited between of the:
9. Check the controller become available
10. Run ostf

You should wait at least for 5 minutes *after* controller became available (pacemaker with corosync started) *and* before to check if the rabbitmq cluster recovered. Did you forget about failover time?

Changed in fuel:
status: New → Incomplete
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Bogdan, answering your question:
we waiting more then 5 minutes , we waiting 60*10 until node became online and then 1500 seconds for ostf ha run, so no we do not forget about failover time. Also after revert failed env remains near 1hour and issue is still here

Changed in fuel:
status: Incomplete → New
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

The root cause is incorrect behaviour of OCF script which was introduced as a work/around to bug #1472230. Actually, that whole change was rewritten except for simple get_status command which returns generic error instead of not_running. This leads to the situation when pacemaker does not know what to do with this resource and actually does not perform any fail-over.

Changed in fuel:
status: New → Confirmed
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

actually, this bug is a duplicate of https://bugs.launchpad.net/fuel/+bug/1484280

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

According to atop log we have 1G out of 3 G swapped already. So far mongo setup requires 6 Gigabytes. Thus I am marking this bug as Invalid

ATOP - node-3 2015/09/14 23:49:11 ------ 20s elapsed
PRC | sys 1.17s | user 5.82s | | #proc 207 | #trun 2 | #tslpi 462 | #tslpu 0 | #zombie 0 | clones 740 | | #exit 597 |
CPU | sys 6% | user 30% | irq 2% | | idle 62% | wait 0% | | steal 0% | guest 0% | curf 3.29GHz | curscal ?% |
CPL | avg1 0.67 | avg5 0.91 | | avg15 1.02 | | | csw 61818 | intr 30944 | | | numcpu 1 |
MEM | tot 2.9G | free 111.6M | cache 164.6M | dirty 1.0M | buff 15.0M | | slab 67.4M | | | | |
SWP | tot 3.0G | free 2.4G | | | | | | | | vmcom 6.3G | vmlim 4.5G |
PAG | scan 0 | | stall 0 | | | | | swin 16 | | | swout 0 |
LVM | mysql-root | busy 0% | read 2 | write 134 | KiB/r 4 | | KiB/w 8 | MBr/s 0.00 | MBw/s 0.05 | avq 2.57 | avio 0.21 ms |
LVM | logs-log | busy 0% | read 0 | write 70 | KiB/r 0 | | KiB/w 8 | MBr/s 0.00 | MBw/s 0.03 | avq 1

summary: - Rabbit server failed to start after unexpected reboot and maintenace
+ Rabbitmq server failed to start after unexpected reboot and maintenace
mode manipulation
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Vladimir, this bug looks absolutely equal to https://bugs.launchpad.net/bugs/1472230

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.