Bug #1830036 “collect_rabbitmq_stats.sh locking can timeout caus...” : Bugs : OpenStack RabbitMQ Server Charm

Alex Kavanagh (ajkavanagh) on 2019-10-28

Changed in charm-rabbitmq-server:
status:	New → Triaged
importance:	Undecided → Medium

Martin Kalcok (martin-kalcok) on 2020-10-28

Changed in charm-rabbitmq-server:
assignee:	nobody → Martin Kalcok (martin-kalcok)
status:	Triaged → In Progress

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2020-10-28:

#1

Isn't main issue that we can have 'collect_rabbitmq_stats.sh' scripts hanging forever (or at least for more than a 5 minutes)? I'm not that familiar with RabbitMQ but 5 minutes just to collect stats seems a lot to me.

We could use "-t TIMEOUT" on rabbitmqctl command [1] to ensure that the nrpe script does not hang forever and we could report back unresponsive services.

[1] https://www.rabbitmq.com/rabbitmqctl.8.html

Revision history for this message

Peter Sabaini (peter-sabaini) wrote on 2020-10-28:

#2

During normal operation rabbitmq shouldn't take this long, but we have seen rabbitmq hanging nevertheless.

Relatedly I've proposed https://review.opendev.org/#/c/757955/5 to have the queue check (which consumes collect_rabbitmq_statsh.sh output) checking not only for queue size but also stats freshness

Revision history for this message

Alvaro Uria (aluria) wrote on 2020-10-28:

#3

Hey Martin. Let me give you some background. Nagios checks use check_nrpe to call the commands configured on each unit (./check_nrpe -H <remote-unit-ip> -c <command> [-a <args>]). The default timeout for check_nrpe is 10s, but sometimes we extend it to 30s (if we find checks take more time, maybe due to network/dns issues). On the other hand, check_nrpe checks are called every 5 minutes (Nagios' default).

There are nrpe checks that may take longer than 10s/30s, and for those cases, we use a couple of configs:
1) setup a cronjob (e.g. in the RMQ units) that runs every 5 minutes and generates an output file (so health updates match the calls Nagios does to check_nrpe)
2) configure a nrpe check (e.g in the RMQ units), in /etc/nagios/nrpe.d, which parses the generated output file (which is expected to take less than 10s/30s)

Another reason a cronjob+nrpe-check are used is when privileges are needed. An example of such case is check_ceph_health, where "ceph status" needs to be run as root, but check_ceph_health runs as "nagios" user.

In the case of RMQ, it is arguable that the check should take more than 5 minutes to run, but what it seems it should happen is:
1) cronjob runs every 5 minutes
2) if another process is still running, then skip the execution (by doing this, we make sure a single process will be running)
3) nrpe check that parses the generated output (this already exists)
4) nrpe check which verifies the timestamp of the generated file (nagios_plugin3.py, part of the charm-nrpe, already provides a function to run such check). By doing this, and verifying that a file has not timestamp older than e.g. 15 minutes, we monitor that the cronjob in step #1 is not stuck and is continually reporting updates

Alternatively, if "-t <timeout>" is used, that value should be configurable via Juju config. Bear in mind the cronjob is fixed to run every 5 minutes (which aligns with Nagios default value). If the timeout is bigger than 5min, the cronjob script should skip further runs until 5min later.

In the end, we don't want multiple copies of the same health script running in parallel and exhausting resources. By not having this script returning data before <timeout>, it should make it alert (probably with "UNKNOWN" return code [1]).

1. https://nagios-plugins.org/doc/guidelines.html#AEN78

Hey Martin. Let me give you some background. Nagios checks use check_nrpe to call the commands configured on each unit (./check_nrpe -H <remote-unit-ip> -c <command> [-a <args>]). The default timeout for check_nrpe is 10s, but sometimes we extend it to 30s (if we find checks take more time, maybe due to network/dns issues). On the other hand, check_nrpe checks are called every 5 minutes (Nagios' default).

There are nrpe checks that may take longer than 10s/30s, and for those cases, we use a couple of configs:
1) setup a cronjob (e.g. in the RMQ units) that runs every 5 minutes and generates an output file (so health updates match the calls Nagios does to check_nrpe)
2) configure a nrpe check (e.g in the RMQ units), in /etc/nagios/nrpe.d, which parses the generated output file (which is expected to take less than 10s/30s)

Another reason a cronjob+nrpe-check are used is when privileges are needed. An example of such case is check_ceph_health, where "ceph status" needs to be run as root, but check_ceph_health runs as "nagios" user.

In the case of RMQ, it is arguable that the check should take more than 5 minutes to run, but what it seems it should happen is:
1) cronjob runs every 5 minutes
2) if another process is still running, then skip the execution (by doing this, we make sure a single process will be running)
3) nrpe check that parses the generated output (this already exists)
4) nrpe check which verifies the timestamp of the generated file (nagios_plugin3.py, part of the charm-nrpe, already provides a function to run such check). By doing this, and verifying that a file has not timestamp older than e.g. 15 minutes, we monitor that the cronjob in step #1 is not stuck and is continually reporting updates

Alternatively, if "-t <timeout>" is used, that value should be configurable via Juju config. Bear in mind the cronjob is fixed to run every 5 minutes (which aligns with Nagios default value). If the timeout is bigger than 5min, the cronjob script should skip further runs until 5min later.

In the end, we don't want multiple copies of the same health script running in parallel and exhausting resources. By not having this script returning data before <timeout>, it should make it alert (probably with "UNKNOWN" return code [1]).

1. https://nagios-plugins.org/doc/guidelines.html#AEN78

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2020-11-05:

#4

Are there any steps to reproduce this issue? (I mean the self-DOS itself). Because after closer inspection, I found that the 'collect_rabbitmq_statsh.sh' script is already run with timeout (by cron) that has default value of 300s. So it seems unlikely to me that the stats collecting scripts keep piling up and eventually deplete resources on the unit.

Revision history for this message

Alvaro Uria (aluria) wrote on 2020-11-05:

#5

This is not an easy bug to reproduce because you need an unresponsive RMQ instance. However, "timeout" should send the KILL (9) signal if we really want to make sure the monitoring process is terminated. Other signals may be caught, so the "timeout" option wouldn't effectively terminate the rabbitmqctl process.

OTOH, I think the collect_rabbitmq_stats.sh script writes into a temporary output file and then moves it to the final destination. If that is correct, the output file would not be updated and the check introduced on bug 1898523 would catch that a problem exists (because the mtime of the file is too old).

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2020-11-05:

#6

> However, "timeout" should send the KILL (9) signal if we really want to make sure the monitoring process is terminated.

It does that. It has (default) 300s timeout, then it sends SIGINT and then if the process is not dead after another 10s, it sends SIGKILL

Martin Kalcok (martin-kalcok) on 2020-11-05

Changed in charm-rabbitmq-server:
status:	In Progress → Incomplete

Revision history for this message

Martin Kalcok (martin-kalcok) wrote on 2020-11-05:

#7

There does not seem to be enough information to reproduce this issue or to determine it's real cause at the moment.

If this ever occurs again, it would be helpful to see output of 'ps waxu | grep -i collect_rabbitmq_stats' and maybe also overview of resources consumed by the rabbitmq-server process.

OpenStack RabbitMQ Server Charm

collect_rabbitmq_stats.sh locking can timeout causing a monitoring self-DOS

Bug Description

Other bug subscribers

Remote bug watches