StackLight

Bug #1626762
Comment #3

Comment 3 for bug 1626762

Revision history for this message

Soeren Schatz (soerens) wrote on 2016-11-09:

lma_diagnostics.2016-11-08_13-34-1478640863.tgz Edit (1.9 MiB, application/x-tar)

I have the same problem on my mos 9 enviroment. But I don't think that is related to a IO problem on the influxdb node.
I have an environment with 3 controller, 6 compute, 1 alerting and 1 influxdb node.
The alerting and influxdb node are located on a kvm host with ssds. The rest of the environment is bare metal.
The IO on all nodes looks fine. For compute nodes I get accurate data without lags. Even when I run several virtual machines on the cloud. But all services and the metrics of the controllers theirself have lags. I reviewed the IO (network and disk) on the controllers and it seems to be fine. The cache files of the metric collector don't grow (the whole folder has a constant level below 100MB) and they hold data of the last 2 minutes. I noticed that the controllers (specially the primary) have a high cpu load caused by hekad.
Maybe version 1.0 of the toolchain has to many metrics, which my controllers can't handle, but I can't figure out where the bottleneck is. On version 0.10 everything works fine. But we would like to use version 1.0 because of the extended metrics.