nagios failure "check-graylog-health" fails with Indexer failures 69000

Bug #1999080 reported by Alexander Balderson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Graylog Charm
Won't Fix
Medium
Unassigned

Bug Description

Running against the latest/stable graylog charm, and using the graylog 3/stable channel for the snap SQA had all the nagios units fail the "check-graylog-health" check because of "Indexer failures: 69000"

we run elasticsearch with the settings `auto-create-index: .watches,.triggered_watches,.watcher-history-*`

elastic is on an 12G mem and 2 CPU system with 50G disk, and graylog is on a 8G mem, 2 CPU system with a 40G disk.

Nothing stands out in the elastic or graylog logs as to what the problem may be further than this.
outputs from nagios are:

- check: check-graylog-health
  id: '124'
  results:
    check-output: 'CRITICAL: Indexer failures: 69000

      OK: Indexer cluster health: green; Journal uncommitted messages: 3; Outstanding
      notifications: 0'
    return-code: 0
  status: completed
  timing:
    completed: 2022-12-07 07:59:37 +0000 UTC
    enqueued: 2022-12-07 07:59:36 +0000 UTC
    started: 2022-12-07 07:59:37 +0000 UTC
  unit: nrpe/7
- check: check-graylog-health
  id: '126'
  results:
    check-output: 'CRITICAL: Indexer failures: 69000

      OK: Indexer cluster health: green; Journal uncommitted messages: 0; Outstanding
      notifications: 0'
    return-code: 0
  status: completed
  timing:
    completed: 2022-12-07 07:59:37 +0000 UTC
    enqueued: 2022-12-07 07:59:36 +0000 UTC
    started: 2022-12-07 07:59:37 +0000 UTC
  unit: nrpe/6
- check: check-graylog-health
  id: '121'
  results:
    check-output: 'CRITICAL: Indexer failures: 69000

      OK: Indexer cluster health: green; Journal uncommitted messages: 103; Outstanding
      notifications: 0'
    return-code: 0
  status: completed
  timing:
    completed: 2022-12-07 07:59:37 +0000 UTC
    enqueued: 2022-12-07 07:59:36 +0000 UTC
    started: 2022-12-07 07:59:37 +0000 UTC
  unit: nrpe/1

testrun can be found at:
https://solutions.qa.canonical.com/v2/testruns/6e60d071-7912-492a-8784-3c53d7d532ce/
with bundle at:
https://oil-jenkins.canonical.com/artifacts/6e60d071-7912-492a-8784-3c53d7d532ce/generated/generated/lma-openstack/bundle.yaml
and crashdump at:
https://oil-jenkins.canonical.com/artifacts/6e60d071-7912-492a-8784-3c53d7d532ce/generated/generated/lma-maas/juju-crashdump-lma-maas-2022-12-07-08.01.03.tar.gz

summary: - nagios failure "check-graylog-health" fails with too many uncommited
- journal messages
+ nagios failure "check-graylog-health" fails with Indexer failures 69000
Eric Chen (eric-chen)
tags: added: bseng-910
Revision history for this message
Andy Wu (qch2012) wrote (last edit ):

hit the same issue during ps6 deployment

Eric Chen (eric-chen)
Changed in charm-graylog:
importance: Undecided → Medium
Eric Chen (eric-chen)
Changed in charm-graylog:
importance: Medium → High
Revision history for this message
Chi Wai CHAN (raychan96) wrote :

I've tried to reproduce the issue with the bundle you provided (+ nagios and nrpe), but I could not see any "Indexer failures" from my test environment. I checked the custom nagios check that produce this issue in "check_graylog_health.py" https://git.launchpad.net/charm-graylog/tree/src/files/check_graylog_health.py. The related CRITICAL msg is pretty generic: it only tells you that there are more than 10 indexer operations failed in the past , but it did not tell you what's the error or is it a critical operation, but having 69000 failed indexer operations should be unusual.

Given that you mentioned that there was nothing standing out in the elastic or graylog logs, I can only assume that the operations might be unimportant or are recovered later. More relevant information that can help troubleshoot this issue can be found in graylog's web ui under "System > Overview" tab.

Eric Chen (eric-chen)
Changed in charm-graylog:
importance: High → Medium
Revision history for this message
Eric Chen (eric-chen) wrote :

This charm is under maintenance mode. Only critical bug will be handled.
Please consider using the new Canonical Observability Stack instead. (https://charmhub.io/topics/canonical-observability-stack)

Changed in charm-graylog:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.