RabbitMQ cluster node removal operation may hang for ever as rabbitmqctl may hang

Bug #1459173 reported by Bogdan Dobrelya
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
5.1.x
Won't Fix
High
Denis Meltsaykin
6.0.x
Won't Fix
High
Denis Meltsaykin

Bug Description

This bug is not easy to reproduce. I managed to reproduce it only after ~300 consequent node failovers. The repro steps can be found here: https://bugs.launchpad.net/fuel/+bug/1458830

The issue is what the following commands may does not work as expected (we're expecting that disconnecting a node should help to kick it from the cluster, but the disconnect sometimes may fail and return false):
# rabbitmqctl eval "disconnect_node(list_to_atom(\"rabbit@node-1\"))."; time rabbitmqctl forget_cluster_node rabbit@node-1
and hangs for ever ending up in the situation when none of rabbitmq nodes can re-join the cluster on faiover because they can't be forgotten and join_cluster reports they are already clustered.

Note, that for the given scenario, the AMQP cluster retains completely down as nodes cannot join mnesia master and the latter one is running in
broken state - rabbitmqctl list_channels hangs as well. Perhaps, only solution is to detect in monitor if list_channels hangs and restart the
affected nodes. This will introduce full cluster downtime until new mnesia-master elected but at least will ensure the cluster reassembled.

ISO info:
      build_id: 2015-05-20_08-41-33
      build_number: '441'
but manifests was synced with current master.

Tags: ha rabbitmq
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Bogdan Dobrelya (bogdando)
milestone: none → 6.1
summary: - RabbitMQ may hang on the cluster node removal
+ RabbitMQ cluster node removal operation may hang for ever
description: updated
description: updated
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote : Re: RabbitMQ cluster node removal operation may hang for ever

Example of the lrmd.log
2015-05-27T09:00:45.267143+00:00 info: INFO: p_rabbitmq-server: unjoin_nodes_from_cluster(): node 'rabbit@node-1' disconnected succesfully.
2015-05-27T09:01:36.174799+00:00 info: INFO: p_rabbitmq-server: unjoin_nodes_from_cluster(): Execute forget_cluster_node with timeout: 60
2015-05-27T09:02:36.220831+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/rabbitmqctl forget_cluster_node rabbit@node-1
2015-05-27T09:02:36.224676+00:00 warning: WARNING: p_rabbitmq-server: unjoin_nodes_from_cluster(): unjoining node 'rabbit@node-1' failed.

description: updated
description: updated
summary: - RabbitMQ cluster node removal operation may hang for ever
+ RabbitMQ cluster node removal operation may hang for ever as rabbitmqctl
+ may hang
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/186002
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=e8e777a55b6d31e197c97cc6380c2c0e49927b0a
Submitter: Jenkins
Branch: master

commit e8e777a55b6d31e197c97cc6380c2c0e49927b0a
Author: Bogdan Dobrelya <email address hidden>
Date: Wed May 27 15:47:42 2015 +0200

    Check if the rabbitmqctl command is responding

    W/o this fix, rabbitmqctl sometimes may hang failing
    many commands. This is a problem as it brings the rabbit node
    to unresponsive and broken state. This also may affect
    entire cluster operations, for example, when the failed command is
    the forget_cluster_node.

    The solution is to check for the cases when the command rabbitmqctl
    list_channels timed out and killed or termintated with exit codes
    137 or 124 and return generic error.
    There is also related confusing error message "get_status() returns generic
    error" may be logged when the rabbit node is running out of the cluster
    and fixed as well.

    Closes-bug: #1459173

    Change-Id: Ia52fc5f2ab7adb36252a7194f9209ab87ce487de
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
tags: added: ha rabbitmq
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.