Rabbitmq cluster member may fail to start after a cloud reboot

Bug #1915220 reported by Nikolay Vinogradov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Triaged
Medium
Unassigned

Bug Description

Running Bionic/Ussuri OpenStack cloud with hardware offloading enabled. During reboot test, after all nodes were rebooted, one of the rabbitmq cluster members were failing to start reporting the following error:

BOOT FAILED
===========

Error description:
   {could_not_start,rabbit,
       {{function_clause,
            [{rabbit_exchange,callback,
                 [undefined,remove_bindings,transaction,
                  [undefined,
                   [{binding,
                        {resource,<<"nagios-rabbitmq-server-0">>,exchange,
                            <<"test_exchange">>},
                        <<"test_mq">>,
                        {resource,<<"nagios-rabbitmq-server-0">>,queue,
                            <<"test_exchange_queue">>},
                        []}]]],
                 [{file,"src/rabbit_exchange.erl"},{line,122}]},
             {rabbit_binding,x_callback,4,
                 [{file,"src/rabbit_binding.erl"},{line,568}]},
             {rabbit_binding,'-process_deletions/1-fun-0-',2,
                 [{file,"src/rabbit_binding.erl"},{line,550}]},
             {dict,map_bucket,2,[{file,"dict.erl"},{line,481}]},
             {dict,map_bkt_list,2,[{file,"dict.erl"},{line,477}]},
             {dict,map_bkt_list,2,[{file,"dict.erl"},{line,477}]},
             {dict,map_seg_list,2,[{file,"dict.erl"},{line,472}]},
             {dict,map_dict,2,[{file,"dict.erl"},{line,467}]}]},
        {rabbit,start,[normal,[]]}}}

Log files (may contain more information):
   /<email address hidden>
   /<email address hidden>

Error: {could_not_start,rabbit,
           {{function_clause,
                [{rabbit_exchange,callback,
                     [undefined,remove_bindings,transaction,
                      [undefined,
                       [{binding,
                            {resource,<<"nagios-rabbitmq-server-0">>,
                                exchange,<<"test_exchange">>},
                            <<"test_mq">>,
                            {resource,<<"nagios-rabbitmq-server-0">>,queue,
                                <<"test_exchange_queue">>},
                            []}]]],
                     [{file,"src/rabbit_exchange.erl"},{line,122}]},
                 {rabbit_binding,x_callback,4,
                     [{file,"src/rabbit_binding.erl"},{line,568}]},
                 {rabbit_binding,'-process_deletions/1-fun-0-',2,
                     [{file,"src/rabbit_binding.erl"},{line,550}]},
                 {dict,map_bucket,2,[{file,"dict.erl"},{line,481}]},
                 {dict,map_bkt_list,2,[{file,"dict.erl"},{line,477}]},
                 {dict,map_bkt_list,2,[{file,"dict.erl"},{line,477}]},
                 {dict,map_seg_list,2,[{file,"dict.erl"},{line,472}]},
                 {dict,map_dict,2,[{file,"dict.erl"},{line,467}]}]},
            {rabbit,start,[normal,[]]}}}

The cluster itself was operational. Looking deeper into RabbitMQ entities it turned out that the binding nagios-rabbitmq-server-0 existed but the corresponding queue was missing. As nrpe check the charm provides connects to the rabbitmq using localhost address, it wasn't able to reinitialize the queue.

I tried to rebuild the failing member mnesia db and readd the member back to the cluster, but it didn't help, most likely the problem was in the db itself. What helped was re-running Nagios NRPE check for the broken unit, from the good unit - it re-recreated the queue and the binding and the rabbimq-server-0 member started succesfully.

Tags: cold-start
tags: added: cold-start
Changed in charm-rabbitmq-server:
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.