Sometimes RabbitMQ processes are not starting again after they were killed on all rabbit nodes.

Bug #1614508 reported by Alexander Koryagin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
MOS Oslo

Bug Description

Hello,
Please take a look at the following issue:
    Sometimes RabbitMQ processes are not starting again after they were killed on all rabbit nodes.

Configuration:
  MOS 9.0 official ISO: MirantisOpenStack-9.0.iso

  [root@nailgun ~]# fuel nodes
      id | status | name | cluster | ip | mac | roles | pending_roles | online | group_id
      ---+--------+---------------------+---------+------------+-------------------+------------+---------------+--------+---------
      1 | ready | slave-01_controller | 1 | 10.109.2.3 | 64:9f:ca:b0:49:c0 | controller | | 1 | 1
      2 | ready | slave-02_controller | 1 | 10.109.2.4 | 64:6d:92:eb:98:6f | controller | | 1 | 1
      4 | ready | slave-03_controller | 1 | 10.109.2.5 | 64:bd:82:d3:a5:64 | controller | | 1 | 1
      5 | ready | slave-04_compute | 1 | 10.109.2.6 | 64:ba:bf:de:02:4f | compute | | 1 | 1
      6 | ready | slave-05_compute | 1 | 10.109.2.7 | 64:25:a3:5a:26:08 | compute | | 1 | 1
      3 | ready | slave-06_cinder | 1 | 10.109.2.8 | 64:8a:c2:f8:de:e3 | cinder | | 1 | 1

Actions:

  !! Perform all actions below from all RabbitMQ nodes (controllers) step-by-step.
  In order to kill rabbit processes on all rabbit nodes almost at the same time.

  1) OK - Check RabbitMQ processes on node:
      # ps aux | grep -v grep | grep 'beam.smp'

  2) OK - Kill all RabbitMQ processes on node:
      # date ; for pid in $(pgrep 'beam.smp'); do kill -9 $pid; echo 'killed: ' $pid; done

  3) OK - Wait 5-15 minutes till RabbitMQ proccesse will start and cluster will recover.

  4) OK - Check RabbitMQ cluster status:
      # for cmd in 'rabbitmqctl cluster_status' 'rabbitmqctl status' 'pcs resource show p_rabbitmq-server' 'pcs resource show master_p_rabbitmq-server'; do $cmd &> /dev/null ; echo "$cmd => $?" ; done

      Expected that command above should return '0' exit code launched from _ALL_ Rabbit nodes.
      If not - wait 5 more minutes.
      Waiting more then 30 minutes probably will indicate that the RabbitMQ cluster will not recover on node.

  5) NOK - Repeat steps 1-4 on all RabbitMQ nodes (controllers) 10 times.

    Usually after 2-6 rounds of processes kills and automatic restarts on all nodes, RabbitMQ process is not starting on one node even after 30 minutes of waiting.

    root@node-4:~# ps aux | grep -v grep | grep 'beam.smp'
        rabbitmq 18238 7.5 1.9 553272 79504 ? Sl 09:45 0:08 /usr/lib/erlang/erts-7.1/bin/beam.smp -W w -A 64 -K true -A4 -P 1048576 -K true -B i -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /var/lib/rabbitmq/native-code -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.6.1/ebin -noshell -noinput -sname rabbit@messaging-node-4 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -rabbit tcp_listeners [{"10.109.1.4",5673}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.6.1/plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@messaging-node-4-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@messaging-node-4"

    root@node-4:~# date ; for pid in $(pgrep 'beam.smp'); do kill -9 $pid; echo 'killed: ' $pid; done
        Thu Aug 18 09:47:08 UTC 2016
        killed: 18238
        killed: 31842

    Wait 30 minutes.
    On all other two nodes rabbit is ok. Problem only in node-4.

    root@node-4:~# for cmd in 'rabbitmqctl cluster_status' 'rabbitmqctl status' 'pcs resource show p_rabbitmq-server' 'pcs resource show master_p_rabbitmq-server'; do $cmd &> /dev/null ; echo "$cmd => $?" ; done
        rabbitmqctl cluster_status => 69
        rabbitmqctl status => 69
        pcs resource show p_rabbitmq-server => 0
        pcs resource show master_p_rabbitmq-server => 0

    root@node-4:~# ps aux | grep -v grep | grep 'beam.smp'
        {empty}

    root@node-4:~# rabbitmqctl cluster_status
        Cluster status of node 'rabbit@messaging-node-4' ...
        Error: unable to connect to node 'rabbit@messaging-node-4': nodedown

        DIAGNOSTICS
        ===========

        attempted to contact: ['rabbit@messaging-node-4']

        rabbit@messaging-node-4:
          * connected to epmd (port 4369) on messaging-node-4
          * epmd reports: node 'rabbit' not running at all
                          other nodes on messaging-node-4: ['rabbitmq-cli-18']
          * suggestion: start the node

        current node details:
        - node name: 'rabbitmq-cli-18@node-4'
        - home dir: /var/lib/rabbitmq
        - cookie hash: soeIWU2jk2YNseTyDSlsEA==

Tags: area-oslo
Revision history for this message
Alexander Koryagin (akoryagin) wrote :
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Hi MOS Oslo team, could you please check if we kill the services correctly and what is the expected behaviour for the scenario.

Thank you!

Changed in fuel:
assignee: nobody → MOS Oslo (mos-oslo)
milestone: none → 9.1
importance: Undecided → High
tags: added: area-oslo
Changed in fuel:
status: New → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The log entry from node-4:
2016-08-18T09:45:11.407352+00:00 notice: ** FATAL ** Failed to merge schema: Bad cookie in table definition mirrored_sup_childspec: 'rabbit@messaging
-node-4' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@messaging-node-4','rabbit@messaging-node-2','rabbit@messaging-node-1'],[],[],0,read_wr
ite,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1471249727370204720,-576460752303423223,1},'rabbit@messaging-nod
e-1'},{{4,0},{'rabbit@messaging-node-4',{1471,249929,805005}}}}, 'rabbit@messaging-node-1' = {cstruct,mirrored_sup_childspec,ordered_set,['rabbit@mess
aging-node-1'],[],[],0,read_write,false,[],[],false,mirrored_sup_childspec,[key,mirroring_pid,childspec],[],[],[],{{1471513460790531689,-5764607523034
22815,1},'rabbit@messaging-node-1'},{{2,0},[]}}

shows us that that is a known issue

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.