collect_rabbitmq_stats.sh sometimes creates cache files check_rabbitmq_queues.py can't parse

Bug #1850948 reported by Andrea Ieri
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
Medium
Tianqi Xiao

Bug Description

Not sure under whih conditions this happens, but the collect_rabbitmq_stats.sh has just created these entries in *_queue_stats.dat on one of our clouds:

landscape landscape.notifications-queue.b9557ad1-9908-425e-a860-d424e34f63d7 0 0 0 0 34952 1572621605
landscape landscape.notifications-queue.8b1cdd8c-f098-4f48-abd0-7fdeea076308 0 0 0 0 34952 1572621605
landscape landscape.notifications-queue.473e25d6-1df5-47e0-9935-45fc06463921 0 0 0 0 34952 1572621605
landscape landscape.notifications-queue.80a52f1f-b96a-48ea-a24e-6e26a9c717c7 0 0 0 0 34952 1572621605
landscape landscape.notifications-queue.b43d82b1-3378-4420-9da9-4eba9994657e 0 0 0 0 34952 1572621605
landscape landscape.notifications-queue.6f703d3a-d5da-47fd-9a43-491c6662221c 0 0 0 0 34952 1572621605
landscape landscape.notifications-queue.d65a2e58-23bd-42ca-8220-38bbd05e7168 1572621605
landscape landscape.notifications-queue.a5068dad-e162-4815-942b-7af8650d0204 1572621605
landscape landscape.notifications-queue.aed6fb68-b1ff-4a68-980e-df6adf786beb 1572621605

[formatting is poor, but the last three entries only have 3 columns]

The check_rabbitmq_queues.py nrpe check relies on the dat file having 8 columns. If some are blank, the tokenization in the check breaks and a "ERROR: problem parsing the stats file" alert is produced.

The nrpe check should be more resilient, and at most generate a warning if some queues cannot be parsed, not a critical.

Andrea Ieri (aieri)
description: updated
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

There's not really a lot to go on with this report. Please could you add some logs, versions of charms/ubuntu/openstack, etc. Thanks.

Changed in charm-rabbitmq-server:
status: New → Incomplete
Revision history for this message
Andrea Ieri (aieri) wrote :

This cloud is bionic-queens, running charms 19.07. rmq-server is rev 95.

Unfortunately I don't really have logs, because the cronjob seems to never have logged anything in syslog (besides the job having run), and /var/lib/rabbitmq/logs/list_queues.log gets overwritten everytime the cronjob runs.

It's also odd that this has not happened anymore.

Anyway, I think that regardless of the reason why /usr/local/bin/collect_rabbitmq_stats.sh sometimes generates incorrect output, it would be good if /usr/local/lib/nagios/plugins/check_rabbitmq_queues.py was a little more resilient when it encounters unexpected entries.

Changed in charm-rabbitmq-server:
status: Incomplete → New
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The offending lines in collect_rabbitmq_stats.sh are (starting at line 47):

echo "#Vhost Name Messages_ready Messages_unacknowledged Messages Consumers Memory Time" > ${TMP_DATA_FILE}
/usr/sbin/rabbitmqctl -q list_vhosts | \
while read VHOST; do
    /usr/sbin/rabbitmqctl -q list_queues -p $VHOST name messages_ready messages_unacknowledged messages consumers memory | \
    awk "{print \"$VHOST \" \$0 \" $(date +'%s') \"}" >> ${TMP_DATA_FILE} 2>${LOG_DIR}/list_queues.log
done

So basically, rabbitmqctl list_queues isn't printing the "messages_ready", "messages_unacknowledged", "message", "comsumers" and "memory" queueinfoitems.

From the rabbitmqctl man page (state is another option):

state
    The state of the queue. Normally "running", but may be "{syncing, message_count}" if the queue is synchronising.

    Queues which are located on cluster nodes that are currently down will be shown with a status of "down" (and most other queueinfoitem will be unavailable).

So I suspect that the state is "down" for those queues, and the information is missing. So probably the "check_rabbitmq_queues.py" should, as you've suggested, handle this (now expected, I suggest) condition, and not fail.

Changed in charm-rabbitmq-server:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-rabbitmq-server (master)
Changed in charm-rabbitmq-server:
status: Triaged → In Progress
Tianqi Xiao (txiao)
Changed in charm-rabbitmq-server:
assignee: nobody → Tianqi Xiao (txiao)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-rabbitmq-server (master)

Reviewed: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/836053
Committed: https://opendev.org/openstack/charm-rabbitmq-server/commit/9376aeb8e66c9672baf8af950ce4164b1da4c466
Submitter: "Zuul (22348)"
Branch: master

commit 9376aeb8e66c9672baf8af950ce4164b1da4c466
Author: Tianqi <email address hidden>
Date: Thu Mar 31 17:45:03 2022 +0000

    Handle non-uniform queue stats output

    RabbitMQ sesrver sometimes creates non-uniform outputs that nrpe
    can't parse. Instead of breaking the check, this commit outputs
    the error messages and continue the check.

    This problem is most likely caused by queue state being
    "down" [1]. However, because the current charm doesn't show such
    information and the bug is hard to manually reproduce, this
    commit adds the state attribute when creating queue_state file
    for future debugging.

    [1] https://www.rabbitmq.com/rabbitmqctl.8.html#state_2

    Closes-Bug: #1850948
    Change-Id: Iaa493c8270f344cde8ad7c89bd2bb548f0ad71bd

Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
Changed in charm-rabbitmq-server:
milestone: none → 22.04
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.