lma can't report services status after runing for a while

Bug #1531509 reported by Yaguang Tang
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
LMA-Toolchain Fuel Plugins
StackLight
Fix Committed
Medium
LMA-Toolchain Fuel Plugins

Bug Description

MOS version 7.0

plugins info and version

id | name | version | package_version
---|-----------------------------|---------|----------------
5 | openbook | 1.1.0 | 3.0.0
6 | lma_collector | 0.8.0 | 2.0.0
7 | influxdb_grafana | 0.8.0 | 3.0.0
8 | elasticsearch_kibana | 0.8.0 | 3.0.0
9 | lma_infrastructure_alerting | 0.8.0 | 3.0.0

10 nodes with 3 controllers and 1 node for LMA, after a few days , in the dashboard of grafana, it shows no data for service status.
in the collectd the log info:

[2016-01-06 21:37:26] haproxy: Mapping missing for "murano-api"
[2016-01-06 21:37:26] haproxy: Mapping missing for "murano-api"
[2016-01-06 21:37:26] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-06 21:37:26] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-06 21:37:26] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-06 21:37:26] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-06 21:37:26] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-06 21:37:26] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.
[2016-01-06 21:37:32] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.
[2016-01-06 21:37:34] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.

Tags: area-plugins
Yaguang Tang (heut2008)
summary: - lma can't report services status of runing for a while
+ lma can't report services status after runing for a while
Artem Roma (aroma-x)
tags: added: area-plugins
Changed in fuel:
milestone: none → 7.0-updates
assignee: nobody → Fuel Plugins Bugs (fuel-plugins-bugs)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Simon Pasquier (simon-pasquier) wrote :

Could you attach the following log files for all the nodes?
- /var/log/lma_collector.log
- /var/log/upstart/lma_collector.log
- /var/log/collectd.log
Thanks!

Changed in fuel:
assignee: Fuel Plugins Bugs (fuel-plugins-bugs) → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
status: Confirmed → Incomplete
Revision history for this message
Yaguang Tang (heut2008) wrote :
Revision history for this message
Yaguang Tang (heut2008) wrote :
Revision history for this message
Yaguang Tang (heut2008) wrote :
Revision history for this message
Yaguang Tang (heut2008) wrote :

In the collectd log it shows

[2016-01-13 17:33:52] haproxy: Mapping missing for "murano-api"
[2016-01-13 17:33:52] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-13 17:33:52] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-13 17:33:52] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-13 17:33:52] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-13 17:33:52] haproxy: Mapping missing for "murano_rabbitmq"
[2016-01-13 17:33:52] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.
[2016-01-13 17:33:55] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.
[2016-01-13 17:34:01] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.
[2016-01-13 17:34:02] haproxy: Mapping missing for "radosgw"

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

According to the lma_collector.log, last entries show that the metrics can not be send to the LMA aggregator because the tcp input plugin related to the aggregator cannot use the port that is already used. So it is like a process is already using the port.

Have you try to stop lma (crm resource stop lma_collector on the controller), check that no heka process is running and start the lma_collector (crm resource start lma_collector). And send the /var/log/lma_collector.log log file.

So I set it to incomplete.

Changed in lma-toolchain:
status: New → Incomplete
importance: Undecided → Medium
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
milestone: none → 0.8.0
Revision history for this message
Mike Nguyen (moozoo) wrote :

I'm getting the same issue at the moment on my deployment.

Also on MOS 7 and LMA 0.8.0.

A bit more concerning is the fact that on this controller (which is the current vip__management/vip__vrouter_pub/vip__vrouter), getting the same error messages, trying to kill everything related off to start things off "fresh", but the error just seems to be coming back.

- kill -9 collectd (because trying to kill it nicely just does not work)

- crm resource stop lma_collector (make sure hekad has really stopped, as it sometimes takes a minute or two)

- crm resource start lma_collector

At this point, everything seems fine, lma is running, no particular error messages, even after letting it run for 2 minutes.

- /etc/init.d/collectd start

And in /var/log/lma_collector.log, I'll start seeing this:

2016/02/23 21:34:41 Decoder 'collectd_httplisten-collectd_decoder' error: FATAL: process_message() /usr/share/lma_collector/decoders/collectd.lua:35: bad argument #1 to 'gsub' (string expected, got nil)
2016/02/23 21:34:41 Shutdown initiated.

Hekad proceeds to shutdown. Pacemaker proceeds to start it back up.

But again, that error message comes back:

2016/02/23 21:36:20 Decoder 'collectd_httplisten-collectd_decoder' error: FATAL: process_message() /usr/share/lma_collector/decoders/collectd.lua:35: bad argument #1 to 'gsub' (string expected, got nil)
2016/02/23 21:36:20 Shutdown initiated.

And the loop keeps going...

Of course, in /var/log/collectd.log, at this point, I'm going to start seeing:

[2016-02-23 21:41:27] write_http plugin: curl_easy_perform failed with status 7: Failed to connect to 127.0.0.1 port 8325: Connection refused

Revision history for this message
Max Mazur (mmaxur) wrote :

I faced with same issue and did small investigation

Looks like at least in my case for some reason firewall on elasticsearch node was not configured in correct way.
I was not able to connect to elastic search --> heka was not able to get messages --> collectd was not able to send messages

I re-applyed firewall.pp manually:

cd /etc/fuel/plugins/elasticsearch_kibana-0.8/ && puppet apply -d -v --modulepath=puppet/modules:/etc/puppet/modules puppet/manifests/firewall.pp

Diff between iptables config before and after:

diff /root/iptables-save /root/iptables-save-1
1c1
< # Generated by iptables-save v1.4.21 on Mon Feb 29 17:32:47 2016
---
> # Generated by iptables-save v1.4.21 on Mon Feb 29 17:34:53 2016
4,5c4,5
< :FORWARD ACCEPT [19:3138]
< :OUTPUT ACCEPT [36633:56370787]
---
> :FORWARD ACCEPT [1:68]
> :OUTPUT ACCEPT [256:179808]
10a11,12
> -A INPUT -p tcp -m multiport --ports 9200 -m comment --comment "100 elasticsearch" -j ACCEPT
> -A INPUT -p tcp -m multiport --ports 80 -m comment --comment "101 kibana" -j ACCEPT
16c18
< # Completed on Mon Feb 29 17:32:47 2016
---
> # Completed on Mon Feb 29 17:34:53 2016

After that I restarted collectd (kill -9 to stop it) and now it looks OK.

So for the first view it looks like puppet-related issue, not heka or collectd

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

@moozoo: The problem you are facing has been solved here -> https://bugs.launchpad.net/lma-toolchain/+bug/1517053

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

@Max: I already see this problem with the firewall. It is really strange because it happens "randomly" and I never see any errors in the puppet.log. So I really don't know what are the conditions for the failure. Need more investigations.

Changed in fuel:
status: Incomplete → Fix Committed
Changed in lma-toolchain:
status: Incomplete → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.