Status of alarms and queues is silently ignored in rabbit monitoring ocf

Bug #1487517 reported by Ilya Shakhat on 2015-08-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Dmitry Mescheryakov
6.1.x
High
Denis Meltsaykin

Bug Description

Status of queues is checked by the following code:
    local queues=`su_rabbit_cmd "${OCF_RESKEY_ctl} -q list_queues memory messages consumer_utilisation"`
    local rc_queues=$?
However value of 'rc_queues' will always be 0 (since it is exit code of local var assignment). The same issue is for alarms detection (https://github.com/stackforge/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L1155-L1183). It is expected that when one of statuses fails then monitoring should return error exit code.

The issue found analyzing the logs:
2015-08-20T08:46:33.714650+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 137: /usr/sbin/rabbitmqctl -q list_queues memory messages consumer_utilisation
2015-08-20T08:46:33.719737+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_monitor function ready to return 0

Ilya Shakhat (shakhat) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "189"
  build_id: "2015-08-18_17-24-26"
  nailgun_sha: "d0b727cdd0d8e7ce5e17e6ea1306d835bfdfb5e7"
  python-fuelclient_sha: "6ad5e0eb4dbbca6cade3444554606927ecd3f16f"
  fuel-agent_sha: "57145b1d8804389304cd04322ba0fb3dc9d30327"
  fuel-nailgun-agent_sha: "e01693992d7a0304d926b922b43f3b747c35964c"
  astute_sha: "e24ca066bf6160bc1e419aaa5d486cad1aaa937d"
  fuel-library_sha: "7c80eed2119260cc15a700068b9eb20ccc773926"
  fuel-ostf_sha: "235f21b230fea15724d625b2dc44ade0464527e2"
  fuelmain_sha: "c9dad194e82a60bf33060eae635fff867116a9ce"

Changed in fuel:
milestone: none → 7.0
assignee: nobody → MOS Oslo (mos-oslo)
tags: added: ocf rabbitmq
Bogdan Dobrelya (bogdando) wrote :

The exit code should be fine. I tested it with a simple script and it returns 1:

# cat test.sh
foo(){
  kill -0 9999; local rc=$?
  echo $rc
}

foo

Changed in fuel:
status: New → Invalid
Ilya Shakhat (shakhat) wrote :

Bogdan, you example is wrong, in the ocf script function result is assigned to local variable. I.e. instead of "kill -0 9999; " you should try: "local v=`kill -0 9999`".

Here's my sample to confirm the issue:

root@node-2:~# cat confirm.sh
#!/bin/bash

func() {
  kill -0 9999
  local rc=$?
  echo "Direct call: ${rc}"

  local a=`kill -0 9999`
  rc=$?
  echo "Assignment to local: ${rc}"
}

func

root@node-2:~# ./confirm.sh
./confirm.sh: line 4: kill: (9999) - No such process
Direct call: 1
./confirm.sh: line 8: kill: (9999) - No such process
Assignment to local: 0

Changed in fuel:
status: Invalid → New
status: New → Confirmed
Changed in fuel:
status: Confirmed → New
Bogdan Dobrelya (bogdando) wrote :

Indeed. Good catch, thank you.
We discussed this with Dmitry Mescheryakov, and he suggested to fix this not directly, but instead introduce a rabbitmqctl failures counter to be kept in CIB alongside with rabbit uptime.

The issue is that is seems, there are too many restart actions inducted by hanged rabbitmqctl * checks under load. Fixing this bug directly, would have onle increased restarts numbers under high load even more. So, the idea is to increase that counter on each rabbitmqctl failed check and report the resource failed only then this threshold has exceeded.

Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: MOS Oslo (mos-oslo) → Dmitry Mescheryakov (dmitrymex)

I love this new approach Dmitry and Bogdan. Thanks!

Changed in fuel:
status: Confirmed → In Progress
Changed in fuel:
assignee: Dmitry Mescheryakov (dmitrymex) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Dmitry Mescheryakov (dmitrymex)

Reviewed: https://review.openstack.org/217738
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=2707a5ebbff7012a94de77b60fd594f5bcb29e05
Submitter: Jenkins
Branch: master

commit 2707a5ebbff7012a94de77b60fd594f5bcb29e05
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Aug 25 17:38:44 2015 +0300

    Make RabbitMQ OCF script tolerate rabbitmqctl timeouts

    The change makes OCF script ignore small number of timeouts of rabbitmqctl
    for 'heavy' operations: list_channels, get_alarms and list_queues.
    Number of tolerated timeouts in a row is configured through a new variable
    'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl
    timeouts are not tolerated at all.

    Bug #1487517 is fixed by extracting declaration of local variables
    'rc_alarms' and 'rc_queues' from assignment operations.

    Text for Operations Guide:

    If on node where RabbitMQ is deployed
    other processes consume significant part of CPU, RabbitMQ starts
    responding slow to queries by 'rabbitmqctl' utility. The utility is
    used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
    When utility fails to return in pre-defined timeout, OCF script
    considers RabbitMQ to be down and restarts it, which might lead to
    a limited (several minutes) OpenStack downtime. Such restarts
    are undesirable as they cause downtime without benefit. To
    mitigate the issue, the OCF script might be told to tolerate
    certain amount of rabbitmqctl timeouts in a row using the following
    command:
      crm_resource --resource p_rabbitmq-server --set-parameter \
          max_rabbitmqctl_timeouts --parameter-value N

    Here N should be replaced with the number of timeouts. For instance,
    if it is set to 3, the OCF script will tolerate two rabbitmqctl
    timeouts in a row, but fail if the third one occurs.

    By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
    tolerated at all. The downside of increasing the parameter is that
    if a real issue occurs which causes rabbitmqctl timeout, OCF script
    will detect that only after N monitor runs and so the restart, which
    might fix the issue, will be delayed.

    To understand that RabbitMQ's restart was caused by rabbitmqctl timeout
    you should examine lrmd.log of the corresponding controller on Fuel
    master node in /var/log/docker-logs/remote/ directory. Here lines like
    "the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..."

    indicate rabbitmqctl timeout. The next line will explain if it
    caused restart or not. For example:
    "rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."

    DocImpact: user-guide, operations-guide

    Closes-Bug: #1479815
    Closes-Bug: #1487517
    Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verivication
Veronica Krayneva (vkrayneva) wrote :

Checked on 301 iso. Fixed.

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verivication

Reviewed: https://review.openstack.org/222614
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=a304fac9bf1ee4e98cfc355e3058b9664c2768c2
Submitter: Jenkins
Branch: stable/6.1

commit a304fac9bf1ee4e98cfc355e3058b9664c2768c2
Author: Dmitry Mescheryakov <email address hidden>
Date: Tue Aug 25 17:38:44 2015 +0300

    Make RabbitMQ OCF script tolerate rabbitmqctl timeouts

    The change makes OCF script ignore small number of timeouts of rabbitmqctl
    for 'heavy' operations: list_channels, get_alarms and list_queues.
    Number of tolerated timeouts in a row is configured through a new variable
    'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl
    timeouts are not tolerated at all.

    Bug #1487517 is fixed by extracting declaration of local variables
    'rc_alarms' and 'rc_queues' from assignment operations.

    Text for Operations Guide:

    If on node where RabbitMQ is deployed
    other processes consume significant part of CPU, RabbitMQ starts
    responding slow to queries by 'rabbitmqctl' utility. The utility is
    used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
    When utility fails to return in pre-defined timeout, OCF script
    considers RabbitMQ to be down and restarts it, which might lead to
    a limited (several minutes) OpenStack downtime. Such restarts
    are undesirable as they cause downtime without benefit. To
    mitigate the issue, the OCF script might be told to tolerate
    certain amount of rabbitmqctl timeouts in a row using the following
    command:
      crm_resource --resource p_rabbitmq-server --set-parameter \
          max_rabbitmqctl_timeouts --parameter-value N

    Here N should be replaced with the number of timeouts. For instance,
    if it is set to 3, the OCF script will tolerate two rabbitmqctl
    timeouts in a row, but fail if the third one occurs.

    By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
    tolerated at all. The downside of increasing the parameter is that
    if a real issue occurs which causes rabbitmqctl timeout, OCF script
    will detect that only after N monitor runs and so the restart, which
    might fix the issue, will be delayed.

    To understand that RabbitMQ's restart was caused by rabbitmqctl timeout
    you should examine lrmd.log of the corresponding controller on Fuel
    master node in /var/log/docker-logs/remote/ directory. Here lines like
    "the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..."

    indicate rabbitmqctl timeout. The next line will explain if it
    caused restart or not. For example:
    "rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."

    DocImpact: user-guide, operations-guide

    Closes-Bug: #1479815
    Closes-Bug: #1487517
    Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca
    (cherry picked from commit 2707a5ebbff7012a94de77b60fd594f5bcb29e05)

Vitaly Sedelnik (vsedelnik) wrote :

Reassigned to Denis Meltsaykin to create release notes entry for MOS 6.1 MU 3

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers