collect_rabbitmq_stats.sh locking can timeout causing a monitoring self-DOS

Bug #1830036 reported by James Troup
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack RabbitMQ Server Charm
Incomplete
Medium
Martin Kalcok

Bug Description

collect_rabbitmq_statsh.sh uses lockfile-create. lockfile-create's manpage says this:

  Once a file is locked, the lock must be touched at least once every five minutes or the lock will be considered stale

The end result of this is that if your rabbitmq server is not down but also not returning results, the monitoring will keep running new collect_rabbitmq_stats.sh instances until you run out of resources and possibly crash the machine.

I believe the locking should be switched to use something which does not make purely time based assumptions about freshness.

Changed in charm-rabbitmq-server:
status: New → Triaged
importance: Undecided → Medium
Changed in charm-rabbitmq-server:
assignee: nobody → Martin Kalcok (martin-kalcok)
status: Triaged → In Progress
Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

Isn't main issue that we can have 'collect_rabbitmq_stats.sh' scripts hanging forever (or at least for more than a 5 minutes)? I'm not that familiar with RabbitMQ but 5 minutes just to collect stats seems a lot to me.

We could use "-t TIMEOUT" on rabbitmqctl command [1] to ensure that the nrpe script does not hang forever and we could report back unresponsive services.

[1] https://www.rabbitmq.com/rabbitmqctl.8.html

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

During normal operation rabbitmq shouldn't take this long, but we have seen rabbitmq hanging nevertheless.

Relatedly I've proposed https://review.opendev.org/#/c/757955/5 to have the queue check (which consumes collect_rabbitmq_statsh.sh output) checking not only for queue size but also stats freshness

Revision history for this message
Alvaro Uria (aluria) wrote :

Hey Martin. Let me give you some background. Nagios checks use check_nrpe to call the commands configured on each unit (./check_nrpe -H <remote-unit-ip> -c <command> [-a <args>]). The default timeout for check_nrpe is 10s, but sometimes we extend it to 30s (if we find checks take more time, maybe due to network/dns issues). On the other hand, check_nrpe checks are called every 5 minutes (Nagios' default).

There are nrpe checks that may take longer than 10s/30s, and for those cases, we use a couple of configs:
1) setup a cronjob (e.g. in the RMQ units) that runs every 5 minutes and generates an output file (so health updates match the calls Nagios does to check_nrpe)
2) configure a nrpe check (e.g in the RMQ units), in /etc/nagios/nrpe.d, which parses the generated output file (which is expected to take less than 10s/30s)

Another reason a cronjob+nrpe-check are used is when privileges are needed. An example of such case is check_ceph_health, where "ceph status" needs to be run as root, but check_ceph_health runs as "nagios" user.

In the case of RMQ, it is arguable that the check should take more than 5 minutes to run, but what it seems it should happen is:
1) cronjob runs every 5 minutes
2) if another process is still running, then skip the execution (by doing this, we make sure a single process will be running)
3) nrpe check that parses the generated output (this already exists)
4) nrpe check which verifies the timestamp of the generated file (nagios_plugin3.py, part of the charm-nrpe, already provides a function to run such check). By doing this, and verifying that a file has not timestamp older than e.g. 15 minutes, we monitor that the cronjob in step #1 is not stuck and is continually reporting updates

Alternatively, if "-t <timeout>" is used, that value should be configurable via Juju config. Bear in mind the cronjob is fixed to run every 5 minutes (which aligns with Nagios default value). If the timeout is bigger than 5min, the cronjob script should skip further runs until 5min later.

In the end, we don't want multiple copies of the same health script running in parallel and exhausting resources. By not having this script returning data before <timeout>, it should make it alert (probably with "UNKNOWN" return code [1]).

1. https://nagios-plugins.org/doc/guidelines.html#AEN78

Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

Are there any steps to reproduce this issue? (I mean the self-DOS itself). Because after closer inspection, I found that the 'collect_rabbitmq_statsh.sh' script is already run with timeout (by cron) that has default value of 300s. So it seems unlikely to me that the stats collecting scripts keep piling up and eventually deplete resources on the unit.

Revision history for this message
Alvaro Uria (aluria) wrote :

This is not an easy bug to reproduce because you need an unresponsive RMQ instance. However, "timeout" should send the KILL (9) signal if we really want to make sure the monitoring process is terminated. Other signals may be caught, so the "timeout" option wouldn't effectively terminate the rabbitmqctl process.

OTOH, I think the collect_rabbitmq_stats.sh script writes into a temporary output file and then moves it to the final destination. If that is correct, the output file would not be updated and the check introduced on bug 1898523 would catch that a problem exists (because the mtime of the file is too old).

Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

> However, "timeout" should send the KILL (9) signal if we really want to make sure the monitoring process is terminated.

It does that. It has (default) 300s timeout, then it sends SIGINT and then if the process is not dead after another 10s, it sends SIGKILL

Changed in charm-rabbitmq-server:
status: In Progress → Incomplete
Revision history for this message
Martin Kalcok (martin-kalcok) wrote :

There does not seem to be enough information to reproduce this issue or to determine it's real cause at the moment.

If this ever occurs again, it would be helpful to see output of 'ps waxu | grep -i collect_rabbitmq_stats' and maybe also overview of resources consumed by the rabbitmq-server process.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.