Logs and notifications are dropped during a "long" Elasticsearch outage

Bug #1566748 reported by Simon Pasquier
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StackLight
Fix Released
High
Swann Croiset

Bug Description

The current buffering policy for the Heka output plugins is 'drop'. So when the Elasticsearch server is down for a relatively long time, the Elasticsearch output plugin can fill the local queue (the limit is 1G) and it will start to drop the collected logs and notifications.

Changed in lma-toolchain:
milestone: 1.0.0 → 0.10.0
Changed in lma-toolchain:
assignee: LMA-Toolchain Fuel Plugins (mos-lma-toolchain) → Swann Croiset (swann-w)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-lma-collector (master)

Reviewed: https://review.openstack.org/300447
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=ebac150f8a0f3bb6e13c6759ad7c4ddaf2fad226
Submitter: Jenkins
Branch: master

commit ebac150f8a0f3bb6e13c6759ad7c4ddaf2fad226
Author: Swann Croiset <email address hidden>
Date: Sun Mar 27 22:46:52 2016 +0200

    Separate the (L)og of the LMA collector

    This change separates the processing of the logs/notifications and
    metric/alerting into 2 dedicated hekad processes, these services are
    named 'log_collector' and 'metric_collector'.

    Both services are managed by Pacemaker on controller nodes and by Upstart on
    other nodes.

    All metrics computed by log_collector (HTTP response times and creation time
    for instances and volumes) are sent directly to the metric_collector via TCP.
    Elasticsearch output (log_collector) uses full_action='block' and the
    TCP output uses full_action='drop'.

    All outputs of metric_collector (InfluxDB, HTTP and TCP) use
    full_action='drop'.

    The buffer size configurations are:
    * metric_collector:
      - influxdb-output buffer size is increased to 1Gb.
      - aggregator-output (tcp) buffer size is decreased to 256Mb (vs 1Gb).
      - nagios outputs (x3) buffer size are decreased to 1Mb.
    * log_collector:
      - elasticsearch-output buffer size is decreased to 256Mb (vs 1Gb).
      - tcp-output buffer size is set to 256Mb.

    Implements: blueprint separate-lma-collector-pipelines
    Fixes-bug: #1566748

    Change-Id: Ieadb93b89f81e944e21cf8e5a65f4d683fd0ffb8

Changed in lma-toolchain:
status: In Progress → Fix Committed
Changed in lma-toolchain:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.