Fuel for OpenStack

Bug #1463433
Comment #41

Comment 41 for bug 1463433

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-06: Fix merged to fuel-library (stable/6.1)

#41

Reviewed: https://review.openstack.org/222608
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=599961e60858f182811dd5bc166b4d76e3b3de36
Submitter: Jenkins
Branch: stable/6.1

commit 599961e60858f182811dd5bc166b4d76e3b3de36
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jun 10 13:44:53 2015 +0200

Restart rabbit if can't list queues or found memory alert

    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.

    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.

Additional checks added to the monitor also require extending
the timeout window for the monitor action from 60 to 180 seconds.

    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.

    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.

Closes-bug: #1463433

    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit bf604f80d72f69e771152b153973fa38fa83afd8)

Reviewed:  https://review.openstack.org/222608
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=599961e60858f182811dd5bc166b4d76e3b3de36
Submitter: Jenkins
Branch:    stable/6.1

commit 599961e60858f182811dd5bc166b4d76e3b3de36
Author: Bogdan Dobrelya <bdobrelia@mirantis.com>
Date:   Wed Jun 10 13:44:53 2015 +0200

Restart rabbit if can't list queues or found memory alert
    
    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.
    
    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.
    
    Additional checks added to the monitor also require extending
    the timeout window for the monitor action from 60 to 180 seconds.
    
    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.
    
    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.
    
    Closes-bug: #1463433
    
    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    (cherry picked from commit bf604f80d72f69e771152b153973fa38fa83afd8)