Comment 7 for bug 1779171

Revision history for this message
Alvaro Uria (aluria) wrote :

I have been discussing this bug with a colleague and we've seen that the condition mentioned by Liam is now "config('stats_cron_schedule').
https://github.com/openstack/charm-rabbitmq-server/blob/master/hooks/rabbitmq_server_relations.py#L651

However, we have been reviewing the scripts' code and here is what we see:
1) cronjob script: writes "{schedule} root timeout ... {timeout} {command}" in /etc/cron.d
2) {command} is the collect_rabbitmq_stats script, which generates 2 output files
3) check_rabbitmq.py sends probe messages via rmq
4) check_rabbitmq_queues.py parses one of the files generated by the cronjob at #2

I think the fix involves 2 scenarios:
a) the one mentioned by Liam, where the output file parsed by #4 does not exist (the first 5 minutes until the first run of the cronjob). A possible fix can be found at https://review.opendev.org/661814

b) when the cronjob is removed, the nrpe check that calls check_rabbitmq_queues.py should also be removed. The reason is that the output file wouldn't be updated (or it would not even exist in the first place), so a check should no exist. The only reason cronjobs are used on charms are: permission issue (checks are run as "nagios", but "root" may be needed); and time taken to run a check (rmq checks can take more than 10 seconds, so check_nrpe could return a "socket timeout" if we don't run it asynchronously [just checking the output file]).

To fix b),
b.1) rsync of check_rabbitmq_queues.py from scripts/* to NAGIOS_PLUGINS should be done in the same place where the cronjob is copied
b.2) when the cronjob is removed, the script could also be removed
b.3) when the nrpe check is added, under the same condition as above, nrpe_compat.remove_check should be called instead of add_check.