RabbitMQ: unable to connect to node: nodedown during `Enable queue mirroring`

Bug #1513668 reported by Byron McCollum
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
High
Unassigned
Kilo
Fix Released
High
Unassigned
Liberty
Fix Released
High
Unassigned
Trunk
Fix Released
High
Unassigned

Bug Description

There is a tendency to run into the following failure, which appears to be transient:

TASK: [Enable queue mirroring] ************************************************
failed: [98ac7559-infra1_rabbit_mq_container-8167b789] => {"cmd": "/usr/sbin/rabbitmqctl -q -n rabbit list_policies -p /", "failed": true, "rc": 2}
stderr: Error: unable to connect to node 'rabbit@98ac7559-infra1_rabbit_mq_container-8167b789': nodedown

DIAGNOSTICS
===========

attempted to contact: ['rabbit@98ac7559-infra1_rabbit_mq_container-8167b789']

rabbit@98ac7559-infra1_rabbit_mq_container-8167b789:
  * connected to epmd (port 4369) on 98ac7559-infra1_rabbit_mq_container-8167b789
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on 98ac7559-infra1_rabbit_mq_container-8167b789
  * suggestion: start the node

current node details:
- node name: 'rabbitmq-cli-3155@98ac7559-infra1_rabbit_mq_container-8167b789'
- home dir: /var/lib/rabbitmq
- cookie hash: 8E8VlCxIkt05tZMekpVcGQ==

msg: Error:********@98ac7559-infra1_rabbit_mq_container-8167b789'
- home dir: /var/lib/rabbitmq
- cookie hash: 8E8VlCxIkt05tZMekpVcGQ==

Changed in openstack-ansible:
status: New → In Progress
assignee: nobody → Byron McCollum (byron-mccollum)
Revision history for this message
Byron McCollum (byron-mccollum) wrote :

After doing further digging, the problem isn't that RabbitMQ needs more time to startup, but rather a permission problem with some of the directories and files.

I have 3 rabbit nodes. 1 out of the 3 are working, the rest will not start.

a6232de5-infra1_rabbit_mq_container-ce59fdbb: failed
a6232de5-infra1_rabbit_mq_container-5a485765: failed
a6232de5-infra1_rabbit_mq_container-654a51db: working

Looking at the startup_log of the failed nodes indicates a permission related problem:

BOOT FAILED
===========

Error description:
   {error,
       {cannot_read_enabled_plugins_file,"/etc/rabbitmq/enabled_plugins",
           eacces}}

Log files (may contain more information):
   /var/log/rabbitmq/rabbit@a6232de5-infra1_rabbit_mq_container-ce59fdbb.log
   /var/log/rabbitmq/rabbit@a6232de5-infra1_rabbit_mq_container-ce59fdbb-sasl.log

Stack trace:
   [{rabbit_plugins,read_enabled,1,[]},
    {rabbit_plugins,setup,0,[]},
    {rabbit,broker_start,0,[]},
    {rabbit,start_it,1,[]},
    {init,start_it,1,[]},
    {init,start_em,1,[]}]

{"init terminating in do_boot",{error,{cannot_read_enabled_plugins_file,"/etc/rabbitmq/enabled_plugins",eacces}}}

Comparing the permissions of /etc/rabbitmq reveals the following:

a6232de5-infra1_rabbit_mq_container-654a51db | success | rc=0 >>
total 20
drwxr-xr-x 2 root root 4096 Nov 6 10:38 .
drwxr-xr-x 73 root root 4096 Nov 6 14:25 ..
-rw-r--r-- 1 root root 23 Nov 6 10:38 enabled_plugins
-rw-r--r-- 1 rabbitmq rabbitmq 1704 Nov 6 10:38 rabbitmq.key
-rw-r--r-- 1 rabbitmq rabbitmq 1363 Nov 6 10:38 rabbitmq.pem

a6232de5-infra1_rabbit_mq_container-5a485765 | success | rc=0 >>
total 20
drwxr-x--- 2 root root 4096 Nov 6 14:20 .
drwxr-xr-x 73 root root 4096 Nov 6 14:24 ..
-rw-r--r-- 1 root root 23 Nov 6 10:38 enabled_plugins
-rw-r----- 1 rabbitmq rabbitmq 1704 Nov 6 10:38 rabbitmq.key
-rw-r----- 1 rabbitmq rabbitmq 1363 Nov 6 10:38 rabbitmq.pem

a6232de5-infra1_rabbit_mq_container-ce59fdbb | success | rc=0 >>
total 20
drwxr-x--- 2 root root 4096 Nov 6 10:38 .
drwxr-xr-x 73 root root 4096 Nov 6 10:37 ..
-rw-r--r-- 1 root root 23 Nov 6 10:38 enabled_plugins
-rw-r----- 1 rabbitmq rabbitmq 1704 Nov 6 10:38 rabbitmq.key
-rw-r----- 1 rabbitmq rabbitmq 1363 Nov 6 10:38 rabbitmq.pem

The permissions of /etc/rabbitmq differ between the working (a6232de5-infra1_rabbit_mq_container-654a51db) and failed nodes. I can't understand how this came to be.

description: updated
Changed in openstack-ansible:
status: In Progress → New
assignee: Byron McCollum (byron-mccollum) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on openstack-ansible (master)

Change abandoned by Byron McCollum (<email address hidden>) on branch: master
Review: https://review.openstack.org/242306
Reason: Actual issue is related to permissions, not service startup time.

Revision history for this message
Byron McCollum (byron-mccollum) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/242595
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=5862d0e894429929062119016517bddb88fb17a1
Submitter: Jenkins
Branch: master

commit 5862d0e894429929062119016517bddb88fb17a1
Author: Major Hayden <email address hidden>
Date: Fri Nov 6 12:38:41 2015 -0600

    Fixing /etc/rabbitmq permission bug

    A change from the "Remove dir_mode from rabbit key distribution" commit caused
    a bug where non-primary RabbitMQ containers would keep /etc/rabbitmq
    permissions set to 0750 (the default is 0755). This prevented the plugins
    file from being read and it broke queue mirroring.

    This patch ensures that the default permission of 0755 is set on RabbitMQ
    and should prevent problems with future upgrades.

    Closes-bug: 1513668

    Change-Id: I62d6b09dad0eef0d9543442bb727f6c946d8738e

Changed in openstack-ansible:
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (liberty)

Fix proposed to branch: liberty
Review: https://review.openstack.org/243226

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (kilo)

Fix proposed to branch: kilo
Review: https://review.openstack.org/243227

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (liberty)

Reviewed: https://review.openstack.org/243226
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=24b7ea2b584776acee1426edbad2731a511aa85c
Submitter: Jenkins
Branch: liberty

commit 24b7ea2b584776acee1426edbad2731a511aa85c
Author: Major Hayden <email address hidden>
Date: Fri Nov 6 12:38:41 2015 -0600

    Fixing /etc/rabbitmq permission bug

    A change from the "Remove dir_mode from rabbit key distribution" commit caused
    a bug where non-primary RabbitMQ containers would keep /etc/rabbitmq
    permissions set to 0750 (the default is 0755). This prevented the plugins
    file from being read and it broke queue mirroring.

    This patch ensures that the default permission of 0755 is set on RabbitMQ
    and should prevent problems with future upgrades.

    Closes-bug: 1513668

    Change-Id: I62d6b09dad0eef0d9543442bb727f6c946d8738e
    (cherry picked from commit 5862d0e894429929062119016517bddb88fb17a1)

tags: added: in-liberty
tags: added: in-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (kilo)

Reviewed: https://review.openstack.org/243227
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=d5e2c653e8ffed0da5439e26351d0cf13c55ed4f
Submitter: Jenkins
Branch: kilo

commit d5e2c653e8ffed0da5439e26351d0cf13c55ed4f
Author: Major Hayden <email address hidden>
Date: Fri Nov 6 12:38:41 2015 -0600

    Fixing /etc/rabbitmq permission bug

    A change from the "Remove dir_mode from rabbit key distribution" commit caused
    a bug where non-primary RabbitMQ containers would keep /etc/rabbitmq
    permissions set to 0750 (the default is 0755). This prevented the plugins
    file from being read and it broke queue mirroring.

    This patch ensures that the default permission of 0755 is set on RabbitMQ
    and should prevent problems with future upgrades.

    Closes-bug: 1513668

    Change-Id: I62d6b09dad0eef0d9543442bb727f6c946d8738e
    (cherry picked from commit 5862d0e894429929062119016517bddb88fb17a1)

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 13.0.0

This issue was fixed in the openstack/openstack-ansible 13.0.0 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 12.0.11

This issue was fixed in the openstack/openstack-ansible 12.0.11 release.

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/openstack-ansible 11.2.14

This issue was fixed in the openstack/openstack-ansible 11.2.14 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/openstack-ansible 11.2.15

This issue was fixed in the openstack/openstack-ansible 11.2.15 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.