Comment 5 for bug 1552772

Revision history for this message
Swann Croiset (swann-w) wrote :

Optimal buffer sizes for nagios outputs
===============================
Knowing that an AFD/GSE message couldn't be bigger than 2KB and the buffering size must be greater than the max_message_size of heka message (currently 256KB) we can compute the required size.

There are by default 27 AFD per controllers and 7 per compute/storage configured :
 -> 2KB * 27 = 54KB per controller --> for x2 additional AFD = 114KB
 -> 2KB * 7 = 14KB per compute/storage --> for x2 additional AFD = 28KB

There are by default 15 global clusers and 6 node cluster: 2KB * 15 * 6 = 180KB --> 200KB

The 'theorical' size:
for controller = 200 + 114 = 314KB
for compute/storage = 260KB (>max_message_size)

Idealy we should configure buffer sizing options per output queue type (AFD, GSE_global, GSE_service).

max_buffer_size = 321536 # 314KB
max_file_size = 266240 # 260KB > max_message_size 256KB

Conclusion: we can safely reduce the buffer size from 2MB to 500KB

Test results with buffer size 400KB
===========================
not enough, after a down time of 5 minutes (shutdown of apache) the same behavior is observed
however, the test successed with 100 nodes (with a long period of high load on infra_alerting node to catch up the buffered messages)

Recommendation to mitigate this issue shortly
=====================================
1/ decrease the buffer sizes to 500KB
2/ increase the interval of all AFD filters from 10s to 20s (at least on compute nodes)

Long term solution
===============
This issue will be fixed when the CGI script will be replaced by a lightweight interface (nagios-api)