Pacemaker thinks RabbitMQ is up, but no beam process running

Bug #1487581 reported by Ilya Shakhat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
MOS QA Team

Bug Description

Pacemaker tells that there's rabbit running on node-4:

root@node-4:~# pcs resource
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-5.domain.tld ]
     Slaves: [ node-2.domain.tld node-4.domain.tld ]

but actually there's no process:

root@node-4:~# ps aux | grep beam
root 25298 0.0 0.0 10460 928 pts/19 S+ 18:52 0:00 grep --color=auto beam

The issue happened after failover.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "189"
  build_id: "2015-08-18_17-24-26"
  nailgun_sha: "d0b727cdd0d8e7ce5e17e6ea1306d835bfdfb5e7"
  python-fuelclient_sha: "6ad5e0eb4dbbca6cade3444554606927ecd3f16f"
  fuel-agent_sha: "57145b1d8804389304cd04322ba0fb3dc9d30327"
  fuel-nailgun-agent_sha: "e01693992d7a0304d926b922b43f3b747c35964c"
  astute_sha: "e24ca066bf6160bc1e419aaa5d486cad1aaa937d"
  fuel-library_sha: "7c80eed2119260cc15a700068b9eb20ccc773926"
  fuel-ostf_sha: "235f21b230fea15724d625b2dc44ade0464527e2"
  fuelmain_sha: "c9dad194e82a60bf33060eae635fff867116a9ce"

Revision history for this message
Ilya Shakhat (shakhat) wrote :
Download full text (21.1 KiB)

From lrmd.log @ node-4:
2015-08-21T15:33:56.928132+00:00 info: INFO: p_rabbitmq-server: get_monitor(): CHECK LEVEL IS: 0
2015-08-21T15:33:57.473084+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
2015-08-21T15:33:57.477217+00:00 info: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
2015-08-21T15:33:57.981325+00:00 info: INFO: p_rabbitmq-server: get_monitor(): master attribute is 1
2015-08-21T15:33:58.528780+00:00 info: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
2015-08-21T15:33:58.540762+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. checking if we are the part of healthy cluster
2015-08-21T15:33:58.588117+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. looking for master on node-4.domain.tld
2015-08-21T15:33:58.636903+00:00 info: INFO: p_rabbitmq-server: get_monitor(): fetched master attribute for node-4.domain.tld. attr value is 1
2015-08-21T15:33:58.650643+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. looking for master on node-5.domain.tld
2015-08-21T15:33:58.702779+00:00 info: INFO: p_rabbitmq-server: get_monitor(): fetched master attribute for node-5.domain.tld. attr value is 0
2015-08-21T15:33:58.715842+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. master is node-5.domain.tld
2015-08-21T15:33:59.302150+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running and is member of healthy cluster
2015-08-21T15:33:59.308134+00:00 info: INFO: p_rabbitmq-server: get_monitor(): preparing to update master score for node
2015-08-21T15:33:59.392991+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (21226) with node-5.domain.tld (23354)
2015-08-21T15:34:09.342443+00:00 info: INFO: p_rabbitmq-server: notify: pre-promote begin.
2015-08-21T15:34:09.355550+00:00 info: INFO: p_rabbitmq-server: my_host(): hostlist is: node-2.domain.tld
2015-08-21T15:34:11.976741+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 0: /usr/sbin/rabbitmqctl list_channels 2>&1 > /dev/null
2015-08-21T15:34:12.677780+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 0: /usr/sbin/rabbitmqctl -q eval 'rabbit_alarm:get_alarms().'
2015-08-21T15:34:19.960448+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 0: /usr/sbin/rabbitmqctl -q list_queues memory messages consumer_utilisa
tion
2015-08-21T15:34:21.973623+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 0: /usr/sbin/rabbitmqctl -q status
2015-08-21T15:34:22.045735+00:00 info: INFO: p_rabbitmq-server: get_monitor(): RabbitMQ is running 231 queues consuming 6m of 24245m total, with 14 queued messages, average consume
r utilization 0.793103
2015-08-21T15:34:22.100066+00:00 info: INFO: p_rabbitmq-server: get_monitor(): RabbitMQ status: [{pid,3691}, {running_applications, [{rabbitmq_management,"RabbitMQ Management Conso
le","3.5.4"}, {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.5.4"}, {webmachine,"webmachine","1.10.3-rmq3.5.4-gite9359c7"}, {mochiweb,"M...

Revision history for this message
Ilya Shakhat (shakhat) wrote :
tags: added: rabbitmq
Changed in fuel:
milestone: none → 7.0
assignee: nobody → MOS Oslo (mos-oslo)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Ilya, could you please also add diagnostic snapshot?

Changed in fuel:
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Ilya Shakhat (shakhat) wrote :

Timur, the snapshot is ~ 700Mb, that's why I picked only necessary files.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Ilya, please provide us an env with repro. It is really hard to tell what went wrong just from the logs.

Changed in fuel:
status: Confirmed → Incomplete
assignee: MOS Oslo (mos-oslo) → Ilya Shakhat (shakhat)
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Targeted to 7.0-updates and 8.0 as we have not enough time to fix it for 7.0

Changed in fuel:
milestone: 7.0 → 7.0-updates
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Ilya provided access to the env, but it self-healed itself at that moment. One observation from me is that it seems like log rotation triggered pacemaker's self-healing: 11 minutes after that it can be seen in lrmd.log that RabbitMQ is starting. (The rotation happened at 2015-08-23T00:00:00). See lrmd and pacemaker logs in the attached archive. Unfortunately we don't have pacemaker logs for earlier time - they are already rotated.

The issue seems to be the same as https://bugs.launchpad.net/fuel/+bug/1472230

We need access to an env with the issue being active. Until then I am leaving the issue in incomplete state and reassigning it back to MOS Oslo team.

Changed in fuel:
assignee: Ilya Shakhat (shakhat) → MOS Oslo (mos-oslo)
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

We have an automated test which reproduces this bug with probability around 70%. It is ha_neutron_tun_scale test.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Folks, can we get logs for these tests failing? I do not see any update here and we merged a set of fixes since the latest report.

Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0-updates → 8.0
no longer affects: fuel/8.0.x
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote :

Need logs per vova's request in #10

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

QA team, could you please reproduce the issue and collect the snapshot after that? Also, please attach /var/log/pacemaker.log* files from each controller as well, as they are not put into the snapshot.

Changed in fuel:
assignee: MOS Oslo (mos-oslo) → MOS QA Team (mos-qa)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Ok, will do.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Reproduced on my environment, MOS 8.0 ISO #343, 3 controllers, 2 computes, Neutron VLAN.

Steps To Reproduce
1. Login to one contrroller
2. Get PID of beam process:
ps ax | grep beam
3. Kill this process by PID:
kill -9 <ID>
4, Check status of rabbitMQ cluster in pacemaker:

root@node-3:~# ps ax | grep beam
19876 pts/8 S+ 0:00 grep --color=auto beam
root@node-3:~# pcs resource
 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started
 vip__public (ocf::fuel:ns_IPaddr2): Started
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     p_rabbitmq-server (ocf::fuel:rabbitmq-server): FAILED
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-5.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-3.test.domain.local node-5.test.domain.local ]
 sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Started
 sysinfo_node-5.test.domain.local (ocf::pacemaker:SysInfo): Started
 sysinfo_node-3.test.domain.local (ocf::pacemaker:SysInfo): Started
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

after 5 minutes pcs will detect that RammitMQ was closed:

root@node-3:~# pcs resource
 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started
 vip__public (ocf::fuel:ns_IPaddr2): Started
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-5.test.domain.local ]
     Stopped: [ node-3.test.domain.local ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-3.test.domain.local node-5.test.domain.local ]
 sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Started
 sysinfo_node-5.test.domain.local (ocf::pacemaker:SysInfo): Started
 sysinfo_node-3.test.domain.local (ocf::pacemaker:SysInfo): Started
 Clone Set: clone_p_ntp [p_ntp]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-2.test.domain.local node-3.test.domain.local node-5.test.domain.local ]
root@node-3:~# ps ax | grep beam
 1210 pts/8 S+ 0:00 grep --color=auto beam

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

So, we can reproduce the bug and in case if beam process will be killed pacemaker detect it only after several minutes, but then pacemaker is able to properly start this service again:

(5-10 minutes after kill -9 <ID of beam process>)

root@node-3:~# ps ax | grep beam
 2246 ? Sl 0:06 /usr/lib/erlang/erts-7.1/bin/beam.smp -W w -A 64 -K true -A30 -P 1048576 -K true -B i -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.6/sbin/../ebin -noshell -noinput -sname rabbit@node-3 -boot start_sasl -config /etc/rabbitmq/rabbitmq -kernel inet_default_connect_options [{nodelay,true}] -rabbit tcp_listeners [{"10.109.1.7",5673}] -sasl errlog_type error -sasl sasl_error_logger false -rabbit error_logger {file,"/<email address hidden>"} -rabbit sasl_error_logger {file,"/<email address hidden>"} -rabbit enabled_plugins_file "/etc/rabbitmq/enabled_plugins" -rabbit plugins_dir "/usr/lib/rabbitmq/lib/rabbitmq_server-3.5.6/sbin/../plugins" -rabbit plugins_expand_dir "/var/lib/rabbitmq/mnesia/rabbit@node-3-plugins-expand" -os_mon start_cpu_sup false -os_mon start_disksup false -os_mon start_memsup false -mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@node-3"
 5469 pts/8 S+ 0:00 grep --color=auto beam

Revision history for this message
Ivan Ponomarev (ivanzipfer) wrote :

Please provide additional information and reopen if required

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.