Comment 36 for bug 1460762

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/251760
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=ff320db58993770880aee126cdc27dad9e41aa7f
Submitter: Jenkins
Branch: stable/6.1

commit ff320db58993770880aee126cdc27dad9e41aa7f
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Jun 8 15:24:36 2015 +0200

    Adjust the RabbitMQ kernel net_ticktime parameter

    W/o this fix, the situation is possible when some
    arbitrary queue masters got stuck due to high
    short-time CPU load spikes ending up in the
    rabbitmqctl hanged and affected rabbit node erased
    and restarted by OCF monitor logic of the related
    pacemaker RA.

    This is an issue as it seems that Oslo.messaging yet
    to be in a good shape to survive this short-time
    outage of underlying AMQP layer without disrupting
    interrupted RPC calls being executed.

    The workaround is to reduce the net_ticktime kernel
    parameter from default 60 seconds to 10 seconds.
    What would allow the RabbitMQ cluster to better detect
    short-time partitions and autoheal them. And when the
    partition has been detected and healed, the rabbitmqctl
    would not hang hopefully as stucked queue masters will
    be recovered.

    Partial-bug: #1460762

    Change-Id: I3aa47b51ae080bb4a8b298c61a629ac8225d2abd
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 3ffae36b9de967747a1cc1045eae0923d6bd893c)