2016-05-10 17:37:29 |
Swann Croiset |
description |
MOS 8.0 build 589, ElasticSearch from origin/master
Environment:
3 controllers
193 compute (20 of them are also ceph nodes)
3 elasticsearch node
3 influxdb nodes
1 infra alerting node (apache2/nagios3)
How to reproduce:
just deploy the env described above
Actual result:
*Some service status are : "UNKNOWN: No data received for at least 130 seconds " (and flap OK -> UNKN -> OK ..)
* The operator receive false alerts
* CPU 100% usage
* high fork rate ~110/s
Expected result:
services status stays OK or at least have "stable" status
Diagnostic:
Apache cannot handle the load: all nodes send their status (AFD) directly to Nagios through CGI and the aggregator send cluster status (GSE)
There are 1109 afd/gse with post message to apache every 10 seconds: ~111 req/s |
MOS 8.0 build 589, Infrastructure Alerting plugin from origin/master
Environment:
3 controllers
193 compute (20 of them are also ceph nodes)
3 elasticsearch node
3 influxdb nodes
1 infra alerting node (apache2/nagios3)
How to reproduce:
just deploy the env described above
Actual result:
*Some service status are : "UNKNOWN: No data received for at least 130 seconds " (and flap OK -> UNKN -> OK ..)
* The operator receive false alerts
* CPU 100% usage
* high fork rate ~110/s
Expected result:
services status stays OK or at least have "stable" status
Diagnostic:
Apache cannot handle the load: all nodes send their status (AFD) directly to Nagios through CGI and the aggregator send cluster status (GSE)
There are 1109 afd/gse with post message to apache every 10 seconds: ~111 req/s |
|