No Metrics in Influx DB or Grafana for Certain Nodes MOS 9

Bug #1626762 reported by Matt Jansen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StackLight
New
Undecided
Unassigned

Bug Description

I have run into this issue multiple times and have not been able to find a solution. I have tried the following:

Restarting metric_collector on broken nodes via "CRM" or the "restart" command.
Running "crm resource cleanup metric_collector" on controllers
Force quiting collectd and hekad and restarting metric collector
Completely deleting and rebuilding my fuel environment from scratch
Running hiera and post-deployment as detailed in the LMA collector user guide.
Moving the Stacklight node to different physical servers.

My environment consists of 14 nodes including 3 controllers, 7 compute, 3 ceph, and 1 stacklight. All of the compute nodes except one and the stacklight node show metric data in Grafana. The controllers and ceph nodes DO NOT show any data in Grafana. If I manually run the queries in influxdb no data is shown. As far as I can tell all the correct collector processes are running (hekad, collectd, metric_collector) on the nodes that have no data.

It appears that metrics for the broken nodes are shown at least partly during deployment, however once deployment completes the metrics no longer show. The screenshots detail this as well.

I have attached the LMA diagnostic snapshot for all nodes. I have also placed a link below with screenshots of my Grafana dashboard. Please let me know if you need any more information, this is currently running in a DEV environment.

Screenshots: http://imgur.com/a/mBEiu

Thanks!

Revision history for this message
Matt Jansen (janse180) wrote :
Revision history for this message
Swann Croiset (swann-w) wrote :

Thanks for the report.

Looks like InfluxDB cannot handle the load (probably due to a "slow" disk):
* disks are busy (sdb and sdc)
* collectors bufferize metrics (/var/cache/metric_collector/output_queues/influxdb_output)

IIRC, Elasticsearch is using the same disk which doesn't help.

I'd recommend to either:
* ensure that Elasticsearch and InfluxDB do not use the same disk
* OR use a SSD disk for InfluxDB
* OR enable the InfluxDB option: "Store WAL files in memory"

Revision history for this message
Soeren Schatz (soerens) wrote :

I have the same problem on my mos 9 enviroment. But I don't think that is related to a IO problem on the influxdb node.
I have an environment with 3 controller, 6 compute, 1 alerting and 1 influxdb node.
The alerting and influxdb node are located on a kvm host with ssds. The rest of the environment is bare metal.
The IO on all nodes looks fine. For compute nodes I get accurate data without lags. Even when I run several virtual machines on the cloud. But all services and the metrics of the controllers theirself have lags. I reviewed the IO (network and disk) on the controllers and it seems to be fine. The cache files of the metric collector don't grow (the whole folder has a constant level below 100MB) and they hold data of the last 2 minutes. I noticed that the controllers (specially the primary) have a high cpu load caused by hekad.
Maybe version 1.0 of the toolchain has to many metrics, which my controllers can't handle, but I can't figure out where the bottleneck is. On version 0.10 everything works fine. But we would like to use version 1.0 because of the extended metrics.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.