nrpe queue check fails due missing data directory

Bug #1779171 reported by Márton Kiss
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Fix Released
Low
Robert Gildein

Bug Description

The check_rabbitmq_queue command is failing, because the /var/lib/rabbitmq/data directory is missing. As a consequence all rabbitmq services have a WARNING status in Nagios NRPE: Unable to read output.

root@juju-21ef40-17-lxd-6:/etc/nagios/nrpe.d# /usr/local/lib/nagios/plugins/check_rabbitmq_queues.py -c \* \* 100 200 /var/lib/rabbitmq/data/juju-21ef40-17-lxd-6_queue_stats.dat
Traceback (most recent call last):
  File "/usr/local/lib/nagios/plugins/check_rabbitmq_queues.py", line 80, in <module>
    stats_collated = collate_stats(stats, args.c)
  File "/usr/local/lib/nagios/plugins/check_rabbitmq_queues.py", line 36, in collate_stats
    for vhost, queue, m_all in stats:
  File "/usr/local/lib/nagios/plugins/check_rabbitmq_queues.py", line 21, in gen_stats
    for line in data_lines:
  File "/usr/local/lib/nagios/plugins/check_rabbitmq_queues.py", line 14, in gen_data_lines
    with open(filename, "rb") as fin:
IOError: [Errno 2] No such file or directory: '/var/lib/rabbitmq/data/juju-21ef40-17-lxd-6_queue_stats.dat'
root@juju-21ef40-17-lxd-6:/etc/nagios/nrpe.d# ll /var/lib/rabbitmq/data
ls: cannot access '/var/lib/rabbitmq/data': No such file or directory

https://github.com/openstack/charm-rabbitmq-server/blob/master/hooks/rabbitmq_server_relations.py#L121

I see no code to create or validate the presence and permission of this data directory.

Revision history for this message
Liam Young (gnuoy) wrote :

This issue only exists for at most 5 minutes immediately after the nrpe relation is formed. A conjob is setup ( /etc/cron.d/rabbitmq-stats ) which in turn calls /usr/local/bin/collect_rabbitmq_stats.sh . collect_rabbitmq_stats.sh creates the directories and sets their permissions.

The check is designed like this to avoid escalating the permissions of the nagios user.

Changed in charm-rabbitmq-server:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Márton Kiss (marton-kiss) wrote :

In this specific case, the problem seems to be permanent, because the /etc/cron.d/rabbitmq-stats file is simply missing from the filesystem:

root@juju-21ef40-17-lxd-6:~# ll /etc/cron.d/
total 20
drwxr-xr-x 2 root root 4096 Jun 14 20:00 ./
drwxr-xr-x 105 root root 4096 Jun 28 06:22 ../
-rw-r--r-- 1 root root 589 Jul 16 2014 mdadm
-rw-r--r-- 1 root root 102 Apr 5 2016 .placeholder
-rw-r--r-- 1 root root 190 Jun 12 08:21 popularity-contest

However, the cron scheduler have a valid value:

$ juju config rabbitmq-server
  ...
  stats_cron_schedule:
    default: '*/5 * * * *'
    description: |
      Cron schedule used to generate rabbitmq stats. To disable,
      either unset this config option or set it to an empty string ('').
    source: default
    type: string
    value: '*/5 * * * *'

And the juju unit log have the following entry:
2018-06-14 20:00:35 DEBUG juju-log cluster:43: Writing file /etc/cron.d/rabbitmq-stats root:root 444

Revision history for this message
Liam Young (gnuoy) wrote :

Can you attach the /var/log/juju/unit* logs from the affected unit please?

Revision history for this message
Liam Young (gnuoy) wrote :

From an initial look this appears to be caused by this bit of code:

https://github.com/openstack/charm-rabbitmq-server/blob/master/hooks/rabbitmq_server_relations.py#L620

This change in behaviour was introduced by https://review.openstack.org/#/c/519594/ which looks like a mistake.

Work around is: juju config rabbitmq-server management_plugin=True

Revision history for this message
Liam Young (gnuoy) wrote :

Err, I mean the placing of the "if config('management_plugin'):" block was a mistake not the whole change.

Revision history for this message
Lorenzo Cavassa (lorenzo-cavassa) wrote :

I set management_plugin=True, in order to workaround this issue.

After some hours RMQ got an 'error' state on 2 units.

Looks like I faced something similar to https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1783203

I had to set again 'management_plugin=False' to avoid it.

Revision history for this message
Alvaro Uria (aluria) wrote :

I have been discussing this bug with a colleague and we've seen that the condition mentioned by Liam is now "config('stats_cron_schedule').
https://github.com/openstack/charm-rabbitmq-server/blob/master/hooks/rabbitmq_server_relations.py#L651

However, we have been reviewing the scripts' code and here is what we see:
1) cronjob script: writes "{schedule} root timeout ... {timeout} {command}" in /etc/cron.d
2) {command} is the collect_rabbitmq_stats script, which generates 2 output files
3) check_rabbitmq.py sends probe messages via rmq
4) check_rabbitmq_queues.py parses one of the files generated by the cronjob at #2

I think the fix involves 2 scenarios:
a) the one mentioned by Liam, where the output file parsed by #4 does not exist (the first 5 minutes until the first run of the cronjob). A possible fix can be found at https://review.opendev.org/661814

b) when the cronjob is removed, the nrpe check that calls check_rabbitmq_queues.py should also be removed. The reason is that the output file wouldn't be updated (or it would not even exist in the first place), so a check should no exist. The only reason cronjobs are used on charms are: permission issue (checks are run as "nagios", but "root" may be needed); and time taken to run a check (rmq checks can take more than 10 seconds, so check_nrpe could return a "socket timeout" if we don't run it asynchronously [just checking the output file]).

To fix b),
b.1) rsync of check_rabbitmq_queues.py from scripts/* to NAGIOS_PLUGINS should be done in the same place where the cronjob is copied
b.2) when the cronjob is removed, the script could also be removed
b.3) when the nrpe check is added, under the same condition as above, nrpe_compat.remove_check should be called instead of add_check.

tags: added: canonical-bootstack
Changed in charm-rabbitmq-server:
assignee: nobody → Robert Gildein (rgildein)
Changed in charm-rabbitmq-server:
status: In Progress → Fix Committed
milestone: none → 21.01
David Ames (thedac)
Changed in charm-rabbitmq-server:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-rabbitmq-server (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/661814
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Alex Kavanagh <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/579105
Reason: The review is overtaken by events by https://review.opendev.org/c/openstack/charm-rabbitmq-server/+/759887 which has solved the same problem in a different way.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.