No data on the Grafana and Kibana dashboard after about a month of properly running

Bug #1545414 reported by Hamza
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StackLight
Incomplete
Low
LMA-Toolchain Fuel Plugins

Bug Description

Hi,

We deployed Fuel 7.0 along with LMA-Toolchain. everything was working properly for about a month,then suddenly we are no longer seeing the data on both Grafana and Kibana dashboard.

the lma_collector service is running on all the nodes, below are the logs picked up from the monitoring node and the controller nodes:

1. Monitoring node:

root@node-112:~# tail /var/log/lma_collector.log
2016/02/14 07:17:55 Plugin 'aggregator_tcpoutput' error: writing to 192.168.0.2:5565: write tcp 192.168.0.2:5565: broken pipe
2016/02/14 07:17:55 Plugin 'elasticsearch_output' error: ElasticSearch server reported error within JSON: {"took":60036,"errors":true,"items":[{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}}]}
2016/02/14 07:18:55 Plugin 'elasticsearch_output' error: ElasticSearch server reported error within JSON: {"took":60001,"errors":true,"items":[{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}}]}
2016/02/14 07:18:55 Plugin 'aggregator_tcpoutput' error: writing to 192.168.0.2:5565: write tcp 192.168.0.2:5565: broken pipe
2016/02/14 07:19:55 Plugin 'elasticsearch_output' error: ElasticSearch server reported error within JSON: {"took":60002,"errors":true,"items":[{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}}]}
2016/02/14 07:20:55 Plugin 'elasticsearch_output' error: ElasticSearch server reported error within JSON: {"took":60001,"errors":true,"items":[{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}}]}
2016/02/14 07:20:55 Plugin 'aggregator_tcpoutput' error: writing to 192.168.0.2:5565: write tcp 192.168.0.2:5565: broken pipe
2016/02/14 07:21:55 Plugin 'elasticsearch_output' error: ElasticSearch server reported error within JSON: {"took":60001,"errors":true,"items":[{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}}]}
2016/02/14 07:22:55 Plugin 'elasticsearch_output' error: ElasticSearch server reported error within JSON: {"took":60001,"errors":true,"items":[{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}}]}
2016/02/14 07:23:55 Plugin 'elasticsearch_output' error: ElasticSearch server reported error within JSON: {"took":60001,"errors":true,"items":[{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}},{"index":{"_index":"log-2016.02.14","_type":"message","_id":null,"status":503,"error":"ProcessClusterEventTimeoutException[failed to process cluster event (create-index [log-2016.02.14], cause [auto(bulk api)]) within 1m]"}}]}
2016/02/14 07:23:55 Plugin 'aggregator_tcpoutput' error: writing to 192.168.0.2:5565: write tcp 192.168.0.2:5565: broken pipe

2. on the primary controller node:
root@node-110:~# crm resource status lma_collector
resource lma_collector is running on: node-110.cdta.net
root@node-110:~# tail /var/log/lma_collector.log
2016/02/14 08:44:26
2016/02/14 08:44:56 Diagnostics: 9 packs have been idle more than 120 seconds.
2016/02/14 08:44:56 Diagnostics: (inject) Plugin names and quantities found on idle packs:
2016/02/14 08:44:56 Diagnostics: 39 packs have been idle more than 120 seconds.
2016/02/14 08:44:56 Diagnostics: (input) Plugin names and quantities found on idle packs:
2016/02/14 08:44:56 Diagnostics: influxdb_accumulator_filter: 9
2016/02/14 08:44:56
2016/02/14 08:44:56 Diagnostics: http_metrics_filter: 20
2016/02/14 08:44:56 Diagnostics: elasticsearch_output: 32
2016/02/14 08:44:56

Please help

Thank you,
Hamza

Revision history for this message
Hamza (h16mara) wrote :

Hi,

I managed to get the data back on both dashboards by restarting the hekad processes on all the nodes:

/etc/init.d/heka stop
/etc/init.d/heka start

But still don't know the root cause of the problem.

Regards,
Hamza

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

The diagnostic messages are saying that messages are being fed into output and filter plugins but they're never being freed up. It is not a good situation but it can happens if Heka is overloaded and cannot consume messages in the queue. The root cause is hard to know because they are many possibilities. Try to see if there is a special load (like stress test for example) before the problem occurs.

Changed in lma-toolchain:
status: New → Incomplete
importance: Undecided → Low
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

So I set this bug report to incomplete and if the same situation occurs in the future don't hesitate to report it again.

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

And also next time tell us the version of LMA you are using.

Revision history for this message
Hamza (h16mara) wrote :

Thank you Guillaume.
 It did happen again, it just happened on one of the controller nodes this time not on all the cluster like before.
Likewise, i restart the Heka service and the DATA are back, but this time i am still getting the message "UNKNOWN: No data received for at least 130 seconds " within Nagios GUI.

Here is our configuration:
Controllers+Ceph:3 nodes, Computes:9 nodes, LMA ( Version 0.8-0.8.0-1 ):1 node.
Neutron-GRE, 1 network interface for Admin network, 3 network interfaces (LACP bond) for the other networks.

we are not doing any kind of stress,we have some Host Aggregates and Ratio configuration but i think this is unrelated.

I can provide you with any logs

Thank you
Hamza

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

If you have data in Grafana it means that metrics are pushed and LMA is working at some point.

These error means that LMA is not pushing any data to Nagios. So you can check that Heka output for Nagios are started on the controller. In /var/log/lma_collector.log you should see:

2016/02/23 08:47:02 Output started: nagios_afd_nodes_output
2016/02/23 08:47:02 Output started: nagios_gse_global_clusters_output
2016/02/23 08:47:02 Output started: nagios_gse_node_clusters_output

without any errors after.

And then check that Nagios is working well on the monitoring node. Look in /var/log/nagios3/nagios.log

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

I assumed that the message "UNKNOWN: No data received for at least 130 seconds" was for controller but if it is not the case you need to check that Heka nagios output are started on the node that has this messages.

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

And if it is not on the controller but on a compute node you need to check in /var/log/upstart/lma_collector.log

Revision history for this message
Hamza (h16mara) wrote :

Thank you Guillaume.
It appears that Heka Nagios output never started in my case, there is no sight of nagios in the logs:

 controller:
root@node-110:~# grep nagios /var/log/lma_collector.log
root@node-110:~#

 compute:
root@node-115:~# grep nagios /var/log/upstart/lma_collector.log
root@node-115:~#

The monitoring node:
root@node-112:~# tail /var/log/nagios3/nagios.log
[1453728651] Nagios 3.5.1 starting... (PID=37021)
[1453728651] Local time is Mon Jan 25 13:30:51 UTC 2016
[1453728651] LOG VERSION: 2.0
[1453728651] Finished daemonizing... (New PID=37022)
[1453728654] Caught SIGTERM, shutting down...
[1453728654] Successfully shutdown... (PID=37022)

Nagios daemon is running:
root@node-112:~# pgrep -lf nagios
12092 nagios3
root@node-112:~#

Revision history for this message
Hamza (h16mara) wrote :

I will try to continue troubleshooting the issue and let you know if i make progress.
Thank you
Hamza

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.