rabbitmq crashed on centos binary deploy

Bug #1562701 reported by Jeffrey Zhang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
Fix Released
Critical
Vikram Hosakote
Mitaka
Fix Released
Critical
Vikram Hosakote

Bug Description

here is the crash log, more log please refer [0]

2016-03-28 04:42:51.806 | ++ docker ps -a --format '{{.Names}}' --filter status=exited
2016-03-28 04:42:51.833 | + failed_containers=rabbitmq
2016-03-28 04:42:51.833 | + for failed in '${failed_containers}'
2016-03-28 04:42:51.833 | + docker logs --tail all rabbitmq
2016-03-28 04:42:51.858 | INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
2016-03-28 04:42:51.858 | INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
2016-03-28 04:42:51.858 | INFO:__main__:Validating config file
2016-03-28 04:42:51.858 | INFO:__main__:Copying service configuration files
2016-03-28 04:42:51.858 | INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq-env.conf to /etc/rabbitmq/rabbitmq-env.conf
2016-03-28 04:42:51.859 | INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq-env.conf
2016-03-28 04:42:51.859 | INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq.config to /etc/rabbitmq/rabbitmq.config
2016-03-28 04:42:51.859 | INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq.config
2016-03-28 04:42:51.859 | INFO:__main__:Copying /var/lib/kolla/config_files/rabbitmq_clusterer.config to /etc/rabbitmq/rabbitmq_clusterer.config
2016-03-28 04:42:51.859 | INFO:__main__:Setting permissions for /etc/rabbitmq/rabbitmq_clusterer.config
2016-03-28 04:42:51.859 | INFO:__main__:Writing out command to execute
2016-03-28 04:42:51.859 | Running command: '/usr/sbin/rabbitmq-server'
2016-03-28 04:42:51.859 | {error_logger,{{2016,3,28},{4,40,11}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",econnrefused]}

2016-03-28 04:42:51.859 | {error_logger,{{2016,3,28},{4,40,11}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.93>,<0.18.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,27},{reductions,801}],[]]}

2016-03-28 04:42:51.859 | {error_logger,{{2016,3,28},{4,40,11}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfargs,{net_kernel,start_link,[[rabbitmqprelaunch1,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}

2016-03-28 04:42:51.859 | {error_logger,{{2016,3,28},{4,40,11}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}

2016-03-28 04:42:51.860 | {error_logger,{{2016,3,28},{4,40,11}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0>},{registered_name,[]},{error_info,{exit,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,133}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}},{ancestors,[<0.8.0>]},{messages,[{'EXIT',<0.10.0>,normal}]},{links,[<0.8.0>,<0.7.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,27},{reductions,117}],[]]}

2016-03-28 04:42:51.860 | {error_logger,{{2016,3,28},{4,40,11}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}},{type,permanent}]}

2016-03-28 04:42:51.860 | {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}}"}

[0] http://logs.openstack.org/82/296982/3/check/gate-kolla-dsvm-deploy-centos-binary/ebaca75/console.html#_2016-03-28_04_42_51_806

Changed in kolla:
importance: Undecided → Critical
milestone: none → newton-1
status: New → Confirmed
Revision history for this message
Swapnil Kulkarni (coolsvap-deactivatedaccount) wrote :

I am not able to reproduce this on local environment

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

I saw this again. see this link[0].
BTW, I can not reproduce this issue locally.

[0] http://logs.openstack.org/34/297434/2/check/gate-kolla-dsvm-deploy-centos-source/2b0a42b/console.html#_2016-04-04_16_13_45_009

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :
Martin André (mandre)
Changed in kolla:
assignee: nobody → Martin André (mandre)
Revision history for this message
Vikram Hosakote (vhosakot) wrote :

I googled the main error in the rabbitmq crash "Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",econnrefused]" and found that it is caused either due to dead epmd daemon due to IPv6 address, or when the hostname is renamed.

Here are some links.

Crash due to dead epmd daemon due to IPv6 address:
---------------------------------------------------------------------------------

https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1434395

https://bugs.launchpad.net/ubuntu/+source/erlang/+bug/1374109

http://stackoverflow.com/questions/26096126/epmd-error-opening-stream-socket-address-family-not-supported-by-protocol

https://github.com/mistio/mist.io/issues/428

Crash due to hostname renamed:
--------------------------------------------------

https://groups.google.com/forum/#!msg/zulip-devel/8qCQM252hr8/mMyEYLquBQAJ

http://www.techsfo.com/blog/2013/06/rabbitmq-breaks-when-you-rename-hostname/

Revision history for this message
Steven Dake (sdake) wrote :

Instead of a dead epmd daemon it maybe that the epmd demon is not started. This seemed to happen around the time i added the regex patch in replacement of that pid checking task that was there previously.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.openstack.org/303686

Revision history for this message
Vikram Hosakote (vhosakot) wrote :

Interestingly, every time rabbitmq crashes in centos gate, the task "Creating admin project, user, role, service, and endpoint" in ansible/roles/keystone/tasks/register.yml fails as well with the error "ValueError: No JSON object could be decoded".

http://logs.openstack.org/52/300852/2/check/gate-kolla-dsvm-deploy-centos-binary/ec324e9/console.html#_2016-04-05_03_07_36_211

Revision history for this message
Vikram Hosakote (vhosakot) wrote :

Commenting out "export ERL_EPMD_ADDRESS ..." in ansible/roles/rabbitmq/templates/rabbitmq-env.conf.j2 seems to have resolved this issue, and rabbitmq did not crash in gate on both CentOS binary and CentOS source and the nova VM booted fine and reached ACTIVE state in CentOS

https://github.com/openstack/kolla/blob/master/ansible/roles/rabbitmq/templates/rabbitmq-env.conf.j2#L9

Nova VM in ACTIVE state in CentOS source gate:

http://logs.openstack.org/86/303686/5/check/gate-kolla-dsvm-deploy-centos-source/8fe9507/console.html#_2016-04-09_19_28_19_793

Patch set is at https://review.openstack.org/#/c/303686/.

Revision history for this message
Swapnil Kulkarni (coolsvap-deactivatedaccount) wrote :
Changed in kolla:
assignee: Martin André (mandre) → Vikram Hosakote (vhosakot)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.openstack.org/303686
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=915d3f12b51d988c322f27f5292b0ada7e3dc617
Submitter: Jenkins
Branch: master

commit 915d3f12b51d988c322f27f5292b0ada7e3dc617
Author: Vikram Hosakote <email address hidden>
Date: Sat Apr 9 04:59:12 2016 +0000

    Fix rabbitmq crash in centos gate

    Please refer to the Closes-Bug identifier for detailed information
    pertaining to this issue.

    Closes-Bug: #1562701

    Change-Id: I77563930e14e11ea48e7edfef0bff80002279381

Changed in kolla:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/303969

Revision history for this message
Martin André (mandre) wrote :

As stated by sdake, binding with ERL_EPMD_ADDRESS can in some circumstances cause epmd not to bind to any interface and exit immediately. When epmd isn't present, erlang crashes taking with it rabbitmq. This condition only occurs when IPV6 is compiled into erlang.

Related (abandoned) patch set: https://review.openstack.org/#/c/303837/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/mitaka)

Reviewed: https://review.openstack.org/303951
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=898a19812e7791ff056903e05f1a2ec19ec5261e
Submitter: Jenkins
Branch: stable/mitaka

commit 898a19812e7791ff056903e05f1a2ec19ec5261e
Author: Vikram Hosakote <email address hidden>
Date: Sat Apr 9 04:59:12 2016 +0000

    Fix rabbitmq crash in centos gate

    Please refer to the Closes-Bug identifier for detailed information
    pertaining to this issue.

    Closes-Bug: #1562701

    Change-Id: I77563930e14e11ea48e7edfef0bff80002279381
    (cherry picked from commit 915d3f12b51d988c322f27f5292b0ada7e3dc617)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla (master)

Reviewed: https://review.openstack.org/303969
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=ed1c71837f3676ae14e1ad882b22780c2661f394
Submitter: Jenkins
Branch: master

commit ed1c71837f3676ae14e1ad882b22780c2661f394
Author: Martin André <email address hidden>
Date: Mon Apr 11 10:47:46 2016 +0200

    Clarify comment about binding erlang to IPv4

    The comment was confusing and not explaining what the real issue is
    when binding erlang to an IPv4 address.

    Change-Id: I819ea137fa37c0b2711efb1e7cb1e518ae26b9ab
    Related-Bug: #1562701

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/304091

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla (stable/mitaka)

Reviewed: https://review.openstack.org/304091
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=a9afbb8e27565eb0f1e69ffd626d67de594d2bcd
Submitter: Jenkins
Branch: stable/mitaka

commit a9afbb8e27565eb0f1e69ffd626d67de594d2bcd
Author: Martin André <email address hidden>
Date: Mon Apr 11 10:47:46 2016 +0200

    Clarify comment about binding erlang to IPv4

    The comment was confusing and not explaining what the real issue is
    when binding erlang to an IPv4 address.

    Change-Id: I819ea137fa37c0b2711efb1e7cb1e518ae26b9ab
    Related-Bug: #1562701
    (cherry picked from commit ed1c71837f3676ae14e1ad882b22780c2661f394)

tags: added: in-stable-mitaka
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/kolla 2.0.0.0rc4

This issue was fixed in the openstack/kolla 2.0.0.0rc4 release candidate.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 2.0.0

This issue was fixed in the openstack/kolla 2.0.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 1.1.0

This issue was fixed in the openstack/kolla 1.1.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 3.0.0.0b1

This issue was fixed in the openstack/kolla 3.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.openstack.org/369773
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=5480bd9b1d3a9efcd3618ddf12718d2621ceeb47
Submitter: Jenkins
Branch: master

commit 5480bd9b1d3a9efcd3618ddf12718d2621ceeb47
Author: Jeffrey Zhang <email address hidden>
Date: Wed Sep 14 09:52:38 2016 +0800

    Bind EPMD to api interface address

    Closes-Bug: #1562701
    Change-Id: Ica68bdee81223232995bc21ad5e5d5fbf9e8b05f

Revision history for this message
Dave Walker (davewalker) wrote :

I am seeing the old broken behaviour since this landed:

 https://review.openstack.org/369773

Blocking at:
TASK [keystone : Creating admin project, user, role, service, and endpoint] ****

Rabbitmq container restarting with log:

Running command: '/usr/sbin/rabbitmq-server'
{error_logger,{{2016,9,23},{18,36,31}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",econnrefused]}
{error_logger,{{2016,9,23},{18,36,31}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.23.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,344}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.603>,<0.20.0>]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,987},{stack_size,27},{reductions,858}],[]]}
{error_logger,{{2016,9,23},{18,36,31}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{id,net_kernel},{mfargs,{net_kernel,start_link,[[rabbitmqprelaunch6,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2016,9,23},{18,36,31}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}},{offender,[{pid,undefined},{id,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2016,9,23},{18,36,31}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0>},{registered_name,[]},{error_info,{exit,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}},[{application_master,init,4,[{file,"application_master.erl"},{line,134}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}},{ancestors,[<0.8.0>]},{messages,[{'EXIT',<0.10.0>,normal}]},{links,[<0.8.0>,<0.7.0>]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,376},{stack_size,27},{reductions,117}],[]]}
{error_logger,{{2016,9,23},{18,36,31}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{kernel,start,[normal,[]]}}}"}

Crash dump is being written to: erl_crash.dump...done
Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{'EXIT',nodistribution}}}}},{k

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

@Dave,

I reproduced this. when disable linux kernal ipv6 feature, this issue happend. We may should disable the bind address feature for epmd.

related issue: https://bugs.launchpad.net/ubuntu/+source/erlang/+bug/1374109

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.