Rabbitmq app fails to start due to mnesia dir for plugins never cleaned up

Bug #1457766 reported by Bogdan Dobrelya
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Bogdan Dobrelya
5.1.x
Won't Fix
Critical
Denis Meltsaykin
6.0.x
Won't Fix
Critical
Denis Meltsaykin

Bug Description

How to reproduce:
1) Move master to node-2
2) wait for ostf ha passed
3) kill the node-2 and start count for failover time in a wait loop (wait for ostf ha passed)
4) power on node-2 and wait for it joined the rabbit cluster
5) repeat 1-4

Expected: the node-2 will always rejoin cluster after the step #4.

Actual: after few iterations, the node-2 ended up in bad state - cannot start rabbit app, even mnesia erasing doesn't help.

Here is the output of the manual check of the action start with OCF wrapper command:
ocf_handler_rabbitmq-server start 2>&1 | tee start
http://pastebin.com/nAzMdaP8

Here is the output of commands doing start rabbitmq-server in foreground and try to start_app:

# RABBITMQ_NODE_ONLY=1 su rabbitmq -s /bin/sh -c "/usr/sbin/rabbitmq-server" &
(note, beam starting as always and kernel module reported as running)
# rabbitmqctl start_app
Starting node 'rabbit@node-2' ...

BOOT FAILED
===========

Error description:
   {case_clause,{ok,[]}}

Log files (may contain more information):
   /<email address hidden>
   /<email address hidden>

Stack trace:
   [{rabbit_plugins,plugin_info,2,
                    [{file,"src/rabbit_plugins.erl"},{line,163}]},
    {rabbit_plugins,'-list/1-lc$^2/1-2-',2,
                    [{file,"src/rabbit_plugins.erl"},{line,64}]},
    {rabbit_plugins,list,1,[{file,"src/rabbit_plugins.erl"},{line,64}]},
    {rabbit_plugins,active,0,[{file,"src/rabbit_plugins.erl"},{line,49}]},
    {rabbit,'-start/0-fun-1-',0,[{file,"src/rabbit.erl"},{line,317}]},
    {rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,354}]},
    {rpc,'-handle_call_call/6-fun-0-',5,[{file,"rpc.erl"},{line,205}]}]

Error: {rabbit,failure_during_boot,{case_clause,{ok,[]}}}

This happens due to incomplete mnesia cleaning method in OCF script:
# rm -rf /var/lib/rabbitmq/mnesia/rabbit@node-2/
This is not enougth and may cause reported error!

Plugins dir should be erased as well:
# rm -rf /var/lib/rabbitmq/mnesia/rabbit@node-2-plugins-expand/

After that, If beam restarted, rabbit app will start w/o issues:
# rabbitmqctl start_app
Starting node 'rabbit@node-2' ...
...done.

ISO info:
      build_id: 2015-05-20_08-41-33
      build_number: '441'

Changed in fuel:
importance: Undecided → Critical
milestone: none → 6.1
status: New → Confirmed
assignee: nobody → Bogdan Dobrelya (bogdando)
Changed in fuel:
status: Confirmed → Triaged
summary: - Rabbitmq app fails to start
+ Rabbitmq app fails to start due to mnesia dir for plugins never cleaned
+ up
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/184971

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/184971
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=fa448a62a92eebbcbe2d460c8ceb97122a764b53
Submitter: Jenkins
Branch: master

commit fa448a62a92eebbcbe2d460c8ceb97122a764b53
Author: Bogdan Dobrelya <email address hidden>
Date: Fri May 22 10:16:13 2015 +0200

    Fix rabbit OCF reset_mnesia

    W/o this fix, when rabbit app cannot start due to
    corrupted mnesia state, the mnesia would be cleaned
    not completely. This may prevent the rabbit app from
    start and take the node out of the cluster permanently.

    The solution is to remove all rabbit node related mnesia
    files.

    Closes-bug: #1457766

    Change-Id: I680efbf573c22aa9a13d8429d985b5a57235b2bf
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
Sergey Novikov (snovikov) wrote :

Verified on fuel-6.1-466-2015-05-25_20-55-26.iso.

Steps to verify:
   1. Move 'master_p_rabbitmq-server' to node-2
   2. Wait for OSTF HA passed
   3. Force shutdown node-2 and start count for failover time in a wait loop (wait for OSTF HA passed)
   4. Power on node-2 and wait for it joined the rabbit cluster
   5. Repeat steps 1-4 10 times

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.