Continously restarting rabbitmq container for CentOS

Bug #1564773 reported by Yogesh
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
kolla
Invalid
Critical
Paul Bourke
Liberty
Won't Fix
Critical
Unassigned
Mitaka
Won't Fix
Critical
Unassigned

Bug Description

We are setting up 3 node + 2 storage node setup, but rabbitmq container on control node is restarting frequently and it fails to start properly. We recreated rabbitmq image and tested but it is still failing.
This was earlier working fine with all-in-one node. We also tried just by running only rabbitmq container, still same issue. Its keep on restarting.

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
00ddcfdc5994 10.44.82.22:4000/kollaglue/centos-binary-rabbitmq:2.0.0 "kolla_start" About an hour ago Restarting (1) 34 minutes ago rabbitmq

Here log of rabbitmq container:-

Crash dump was written to: erl_crash.dump
init terminating in do_boot ()
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Copying service configuration files
INFO:__main__:Removing existing destination: /etc/rabbitmq/rabbitmq-env.conf
INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq-env.conf to /etc/rabbitmq/rabbitmq-env.conf
INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq-env.conf
INFO:__main__:Removing existing destination: /etc/rabbitmq/rabbitmq.config
INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq.config to /etc/rabbitmq/rabbitmq.config
INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq.config
INFO:__main__:Removing existing destination: /etc/rabbitmq/rabbitmq_clusterer.config
INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq_clusterer.config to /etc/rabbitmq/rabbitmq_clusterer.config
INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq_clusterer.config
INFO:__main__:Writing out command to execute
Running command: '/usr/sbin/rabbitmq-server'
{"init terminating in do_boot",{undef,[{rabbit_clusterer,boot,[],[]},{init,start_it,1,[]},{init,start_em,1,[]}]}}

Any idea?

Yogesh (yogesh-deshmukh)
description: updated
Steven Dake (sdake)
Changed in kolla:
importance: Undecided → Critical
milestone: none → 2.0.0
status: New → Triaged
milestone: 2.0.0 → newton-1
Changed in kolla:
assignee: nobody → MD NADEEM (mail2nadeem92)
Revision history for this message
Yongfeng Du (dolpherdu) wrote :

I'm not sure if this is the same problem, but I have encounter problems when start the rabbitmq-server container, I'm running all-in-one, ubuntu 14.04.4, install from source.
The error messages:
{error_logger,{{2016,4,25},{8,16,56}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",eaddrnotavail]}
{error_logger,{{2016,4,25},{8,16,56}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{
error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.e
rl"},{line,239}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[<0.18.0>]},{dictionary,[{longnames,false}]},{trap_exit
,true},{status,running},{heap_size,610},{stack_size,27},{reductions,470}],[]]}
{error_logger,{{2016,4,25},{8,16,56}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribu
tion}},{offender,[{pid,undefined},{name,net_kernel},{mfargs,{net_kernel,start_link,[[rabbit,shortnames]]}},{restart_type,permanent},{shutdo
wn,2000},{child_type,worker}]}]}
{error_logger,{{2016,4,25},{8,16,56}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{fail
ed_to_start_child,net_kernel,{'EXIT',nodistribution}}}},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_distribution,start_link,[]}}
,{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2016,4,25},{8,16,56}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Arg
ument__4']}},{pid,<0.9.0>},{registered_name,[]},{error_info,{exit,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_chil
d,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,133
}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}},{ancestors,[<0.8.0>]},{messages,[{'EXIT',<0.10.0>,normal}]},{links,[
<0.8.0>,<0.7.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,27},{reductions,117}],[]]}
{error_logger,{{2016,4,25},{8,16,56}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to
_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{fai
led_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}}"}

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{fail
ed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},

Steven Dake (sdake)
Changed in kolla:
milestone: newton-1 → newton-2
Changed in kolla:
milestone: newton-2 → newton-3
Changed in kolla:
milestone: newton-3 → newton-rc1
Changed in kolla:
milestone: newton-rc1 → newton-rc2
Revision history for this message
Chris Hoge (hoge) wrote :

I'm experiencing a similar problem with the rabbitmq container on an allinone deployment.

Changed in kolla:
importance: Critical → High
Revision history for this message
Matt McEuen (mm9745) wrote :

I'm having a similar issue with the rabbitmq container restarting in a multinode deployment on debian.

root@kittencloud-1:~# docker logs rabbitmq
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Copying service configuration files
INFO:__main__:Removing existing destination: /etc/rabbitmq/rabbitmq-env.conf
INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq-env.conf to /etc/rabbitmq/rabbitmq-env.conf
INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq-env.conf
INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq.config to /etc/rabbitmq/rabbitmq.config
INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq.config
INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq-clusterer.config to /etc/rabbitmq/rabbitmq-clusterer.config
INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq-clusterer.config
INFO:__main__:Copying /var/lib/kolla/config_files/definitions.json to /etc/rabbitmq/definitions.json
INFO:__main__:Setting permissions for /etc/rabbitmq/definitions.json
INFO:__main__:Writing out command to execute
Running command: '/usr/sbin/rabbitmq-server'
(the block above repeats continually)

Revision history for this message
sean mooney (sean-k-mooney) wrote :

this is happening on master with ubuntu source also

Changed in kolla:
status: Triaged → Confirmed
Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

@sean, could u provide your logs?

Revision history for this message
Matt McEuen (mm9745) wrote :

It takes a couple seconds for the rabbitmq container to restart after being started, so I'm able to see what's going on during that. Is there any context I can gather or commands I could run that would be helpful? Here's a ps:

root@kittencloud-1:/etc/kolla/rabbitmq# docker restart rabbitmq; sleep 2; docker exec rabbitmq ps aux | grep rabbit
rabbitmq
rabbitmq 1 0.0 0.0 4508 756 ? Ss+ 07:52 0:00 /bin/sh /usr/sbin/rabbitmq-server
rabbitmq 15 0.0 0.0 4508 1748 ? S+ 07:52 0:00 /bin/sh -e /usr/lib/rabbitmq/bin/rabbitmq-server
rabbitmq 45 0.0 0.1 527308 26572 ? Sl+ 07:52 0:00 /usr/lib/erlang/erts-7.3/bin/beam.smp -- -root /usr/lib/erlang -progname erl -- -home /var/lib/rabbitmq -epmd_port 4369 -- -pa /usr/lib/rabbitmq/lib/rabbitmq_server-3.5.7/sbin/../ebin -boot start_clean -noshell -noinput -hidden -s rabbit_prelaunch -sname rabbitmqprelaunch15 -extra rabbit
rabbitmq 52 0.0 0.0 26304 232 ? S 07:52 0:00 /usr/lib/erlang/erts-7.3/bin/epmd -daemon
rabbitmq 85 0.0 0.0 7504 892 ? Ss 07:52 0:00 inet_gethost 4
rabbitmq 86 0.0 0.0 9624 1500 ? S 07:52 0:00 inet_gethost 4
rabbitmq 87 0.0 0.0 34424 2900 ? Rs 07:52 0:00 ps aux

Revision history for this message
Matt McEuen (mm9745) wrote :

Here is my docker inspect rabbitmq output:
http://pastebin.com/nYkgwgv1

Changed in kolla:
assignee: Md Nadeem (mail2nadeem92) → nobody
Steven Dake (sdake)
Changed in kolla:
importance: High → Critical
Revision history for this message
Steven Dake (sdake) wrote :

So there are a bunch of different erlang crashes in this bug report, and we think we have fixed them all. This was fixed by introducing new versions of erlang as well as rabbitmq. This was done prior to rc2.

I have found a new issue that is blocking tagging of rc2. We are not tagging a release candidate that doesn't work out of the box. The issue seems related to multinode.

Here are the logs:
a856e619aad9 192.168.1.103:4000/kolla/centos-source-rabbitmq:3.0.0 "kolla_start" 15 minutes ago Restarting (140) About a minute ago rabbitmq

INFO:__main__:Setting permissions for /etc/rabbitmq/definitions.json
INFO:__main__:Writing out command to execute
Running command: '/usr/sbin/rabbitmq-server'
{"init terminating in do_boot",{undef,[{rabbit_clusterer,boot,[],[]},{init,start_it,1,[]},{init,start_em,1,[]}]}}
/usr/lib/rabbitmq/bin/rabbitmq-server: line 232: 186 User defined signal 2 start_rabbitmq_server "$@"
I

looking into the container at rabbitmq-server (shell script):
else
    # When RabbitMQ runs in the foreground but the Erlang shell is
    # disabled, we setup signal handlers to stop RabbitMQ properly. This
    # is at least useful in the case of Docker.

    # The Erlang VM should ignore SIGINT.
    RABBITMQ_SERVER_START_ARGS="${RABBITMQ_SERVER_START_ARGS} +B i"

    # Signal handlers. They all stop RabbitMQ properly (using
    # rabbitmqctl stop). Depending on the signal, this script will exit
    # with a non-zero error code:
    # SIGHUP SIGTERM SIGTSTP
    # They are considered a normal process termination, so the script
    # exits with 0.
    # SIGINT
    # They are considered an abnormal process termination, the script
    # exits with the job exit code.
    trap "stop_rabbitmq_server; exit 0" HUP TERM TSTP
   trap "stop_rabbitmq_server" INT

    start_rabbitmq_server "$@" &

    # Block until RabbitMQ exits or a signal is caught.
    # Waits for last command (which is start_rabbitmq_server)
    wait $!

It appears to me that rabbitmq-server has been modified in some way to integrate with docker somewhere. I'm not sure if thats a recent addition.

I'm not sure what the sigusr2 is about.

I'm not sure why its happening in clusterer but that file was recently touched.

This problem only occurs on 1 of my 3 nodes but is cratering rabbitmq entirely.

I have duplicated this issue. Jeffrey please contact me asap - this is blocking the tag of rc2. This doesn't seem to happen on AIO but perhaps on multinode only. Here is the bactrace

Revision history for this message
Steven Dake (sdake) wrote :

This was fixed in my environment by executing kolla-ansible pull followed by a fresh deploy. At one point, our master images were broken and my system had those on them. Why deploy doesn't pull new images each time - unknown.

Changed in kolla:
milestone: newton-rc2 → newton-rc3
Changed in kolla:
assignee: nobody → Paul Bourke (pauldbourke)
Revision history for this message
Michał Jastrzębski (inc007) wrote :

So this is only development environment issue as you shouldn't really change images unless you upgrade, and upgrade won't have this issue because of tag difference. This will happen when you try to redeploy 3.0.0 (or any other single version) with stale images. Bottom line - use

kolla-ansible destroy -i ~/multinode --yes-i-really-really-mean-it -e destroy_include_images=yes

on your dev env.

Revision history for this message
Steven Dake (sdake) wrote :

IF anyone has any rabbitmq crashes with rc2, PLEASE REPORT THEM in this bug with the full backtrace. Make sure your using rc2 images. If working with a registry, deploy does not repull images that already exist in the docker cache. I guess this is rationalized by the fact that each version would have a separate tag (and result in a separate pull). Still seems fishy to me. This particular "old image" problem should affect only developers in the short term, and I think we can handle some pain :)

Thanks
-steve

Changed in kolla:
status: Confirmed → Incomplete
Revision history for this message
Paul Bourke (pauldbourke) wrote :

Just deployed master copy of multinode rabbitmq, no issues (git ce23dbe)

Revision history for this message
Matt McEuen (mm9745) wrote :

Thanks, Steve - pulling fresh images resolved the issue for me!

Revision history for this message
Matt McEuen (mm9745) wrote :

Sorry, spoke too soon. Pulling fresh images got ansible to get through deployment successfully, but my rabbitmq is still restarting incessantly without any errors that I can find. I'll try nuking and starting from scratch when I get a chance.

Revision history for this message
Steven Dake (sdake) wrote :

Matt,

How are you pulling? Is openstack_release set to 3.0.0 and your pulling from your local registry of built images?

If not, that could be the problem, as 2.0.2 may have problems with rabbitmq (that we will be backporting).

Revision history for this message
Matt McEuen (mm9745) wrote :

Unfortunately I'm pulling 3.0.0 binaries, but thanks for the idea, Steve.

Revision history for this message
Christian Berendt (berendt) wrote :

Any news about the backport for 2.0.2? I think bug can be set to invalid for Newton and Liberty (EOL).

Changed in kolla:
status: Incomplete → Invalid
Revision history for this message
Aric Renzo (ar7520) wrote :

I seem to be running into this issue as well while attempting a mulinode deployment of Kolla on Ubuntu 16.04 target bare metal machines, kolla version 3.0.1. I am getting this error when I attempt to deploy with centos-binary and ubuntu-binary container images. I am building these images to a local registry on the deploy host. The deployment is successfully finishing, but the rabbitmq container keeps restarting itself. See the below paste outputs:

The output of "docker ps -a" right after a fresh deployment: http://paste.openstack.org/show/593883/

Attempting to manually restart the container: http://paste.openstack.org/show/593889/

Everything under kolla_logs/rabbitmq: http://paste.openstack.org/show/593875/

The output of: "docker inspect rabbitmq": http://paste.openstack.org/show/593888/

I have destroyed this kolla installation multiple times, and the rabbitmq container never seems to stay up. The prechecks are passing, as well as the bootstrap-servers process. After the installation, I can log into the horizon dashboard, but I am unable to spin up Nova VMs due to the rabbitmq issue.

Revision history for this message
Aric Renzo (ar7520) wrote :

My RabbitMQ container restart issue has been resolved. Essentially, we had automation scripts which were causing duplicate /etc/host entries in each controller node in the environment. Each time the RabbitMQ container was starting, it was resolving its own IP addresses to 127.0.0.1 instead of the bind IP address specified in the kolla configuration. If you are experiencing a similar issue, check your /etc/hosts file, or DNS name resolution for potential issues.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.