Rabbitmq fails to start on two nodes on HA IPv6 configuration

Bug #1627729 reported by Gabriele Cerami
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Haïkel Guémar

Bug Description

Ha ipv6 configuration take a lot of time to finish. After the deployment is complete a nova list on the overcloud succeeds. However, pcs status shows rabbitmq has not started on two of the tree overcloud nodes.

Cluster name: tripleo_cluster
Last updated: Mon Sep 26 12:26:00 2016 Last change: Mon Sep 26 09:11:16 2016 by root via cibadmin on overcloud-controller-0
Stack: corosync
Current DC: overcloud-controller-2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum
3 nodes and 19 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-fd00.fd00.fd00.2000..14 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.0.2.8 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-2001.db8.fd00.1000..19 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 ip-fd00.fd00.fd00.3000..11 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-0 ]
     Slaves: [ overcloud-controller-1 overcloud-controller-2 ]
 ip-fd00.fd00.fd00.2000..10 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 ip-fd00.fd00.fd00.4000..19 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-1

Failed Actions:
* rabbitmq_start_0 on overcloud-controller-2 'unknown error' (1): call=66, status=complete, exitreason='none',
    last-rc-change='Mon Sep 26 08:47:31 2016', queued=0ms, exec=5945ms
* rabbitmq_start_0 on overcloud-controller-1 'unknown error' (1): call=61, status=complete, exitreason='none',
    last-rc-change='Mon Sep 26 08:47:02 2016', queued=0ms, exec=18237ms

log in /var/log/rabbitmq show a lot of crashes with this report
=CRASH REPORT==== 26-Sep-2016::09:43:50 ===
  crasher:
    initial call: rabbit_reader:init/4
    pid: <0.1570.0>
    registered_name: []
    exception exit: {aborted,
                        {no_exists,[rabbit_runtime_parameters,cluster_name]}}
      in function mnesia:abort/1 (mnesia.erl, line 313)
      in call from rabbit_runtime_parameters:lookup0/2 (src/rabbit_runtime_parameters.erl, line 272)
      in call from rabbit_runtime_parameters:value0/2 (src/rabbit_runtime_parameters.erl, line 268)
      in call from rabbit_reader:server_properties/1 (src/rabbit_reader.erl, line 282)
      in call from rabbit_reader:start_connection/3 (src/rabbit_reader.erl, line 1091)
      in call from rabbit_reader:handle_input/3 (src/rabbit_reader.erl, line 1041)
      in call from rabbit_reader:recvloop/4 (src/rabbit_reader.erl, line 446)
      in call from rabbit_reader:run/1 (src/rabbit_reader.erl, line 428)
    ancestors: [<0.1568.0>,<0.847.0>,<0.846.0>,<0.845.0>,rabbit_sup,
                  <0.697.0>]
    messages: [{'EXIT',#Port<0.9969>,normal}]
    links: [<0.1568.0>]
    dictionary: [{process_name,
                      {rabbit_reader,
                          <<"[FD00:FD00:FD00:2000::10]:57696 -> [FD00:FD00:FD00:2000::18]:5672">>}}]
    trap_exit: true
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1613
  neighbours:

journalctl on one of the failing nodes show

Sep 26 12:56:19 overcloud-controller-2 pengine[3007]: warning: Forcing rabbitmq-clone away from overcloud-controller-2 after 1000000 failures (max=1000000)

look at http://logs.openstack.org/74/363674/27/experimental-tripleo/gate-tripleo-ci-centos-7-ovb-ha-ipv6/74c9d65/ for more informations

Changed in tripleo:
status: New → Confirmed
description: updated
Revision history for this message
Gabriele Cerami (gcerami) wrote :

Latest findings:

pacemaker reports rabbitmq has failed, but appears to be up and running, altought there are a lot of crash reports on the logs.

ocf::heartbeat:rabbitmq-cluster is implemented in /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster which is owned by resource-agents package. But the package doesn't appear to be installed

Revision history for this message
Gabriele Cerami (gcerami) wrote :

My bad, the package is there

Revision history for this message
Michele Baldessari (michele) wrote :

I can reproduce as well on a fresh install. Full sosreports for the controllers + compute nodes are here: http://acksyn.org/files/tripleo/newton-rabbit-ipv6/

Revision history for this message
Michele Baldessari (michele) wrote :

So I had a chat with Peter this morning and we need to at least try with updated packages. Namely:
- rabbitmq 3.6.3-3
- erlang-erts-18.3.4.1

The versions I had in my image are the following:
[root@overcloud-controller-1 heat-admin]# rpm -qa |grep -E "^rabbitmq|erlang-erts"
rabbitmq-server-3.6.2-3.el7.noarch
erlang-erts-18.3.3-1.el7.x86_64

Revision history for this message
Michele Baldessari (michele) wrote :

So after *a lot* of rpm chasing and package upgrading I can confirm that the issue is gone:
Full list of resources:

 ip-fd00.fd00.fd00.3000..12 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 overcloud-controller-3 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 overcloud-controller-3 ]
 ip-192.0.2.9 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1
 ip-2001.db8.fd00.1000..12 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2
 ip-fd00.fd00.fd00.2000..15 (ocf::heartbeat:IPaddr2): Started overcloud-controller-3
 ip-fd00.fd00.fd00.2000..17 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 overcloud-controller-3 ]

I used the following:
[root@overcloud-controller-3 ~]# rpm -qa |grep -E "^rabbitmq|erlang-ert"
rabbitmq-server-3.6.5-1.el7.noarch
erlang-erts-18.3.4.2-1.el7.centos.x86_64

Changed in tripleo:
milestone: none → newton-rc2
importance: Undecided → Critical
Changed in tripleo:
status: Confirmed → Triaged
Changed in tripleo:
assignee: nobody → Haïkel Guémar (hguemar)
Revision history for this message
Emilien Macchi (emilienm) wrote :

Package has been updated in RDO, but we haven't checked it works fine in CI now. I'll keep it open. If anyone confirms it now works, please set it to "Fix released".

Changed in tripleo:
milestone: newton-rc2 → ocata-1
tags: added: newton-backport-potential
Revision history for this message
Michele Baldessari (michele) wrote :

So Ben noted that rabbitmq-server is still at the old version here:
http://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-newton/common/

It is rabbitmq-server-3.6.2-3.el7.noarch.rpm but it should be 3.6.3-5 or 3.6.5-something

Revision history for this message
Gabriele Cerami (gcerami) wrote :

After Heikel pushed the new rabbitmq package, the problem was solved, rabbitmq now starts correctly on all 3 controllers.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.