Bug #1552258 “Kibana dashboards unavailable after an ElasticSear...” : Bugs : StackLight

Swann Croiset (swann-w) on 2016-03-03

Changed in lma-toolchain:
status:	New → Confirmed
importance:	Undecided → Medium
assignee:	nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
tags:	added: elasticsearch
tags:	added: scale
description:	updated

Revision history for this message

Swann Croiset (swann-w) wrote on 2016-03-03:

#1

Download full text (3.4 KiB)

Impact:
=====
Kibana dashboards are not available and the Elasticsearch cluster is 'red' (CRITICAL from POV of LMA)

Analysis:
=======

the cluster health is red due to kibana-int index in bad shape.

curl node-189:9200/_cluster/health?pretty
{
  "cluster_name" : "lma",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 15,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 15,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

All request to the kibana-int index respond with this error:

{"error":"NoShardAvailableActionException[[kibana-int][2] null]","status":503}

The detailed status of this index:

    "kibana-int" : {
      "status" : "red",
      "number_of_shards" : 5,
      "number_of_replicas" : 2,
      "active_primary_shards" : 0,
      "active_shards" : 0,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 15,
      "shards" : {
        "0" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "1" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "2" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "3" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "4" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        }
      }

Workaround
==========
To solve the issue, here the commands to run against the Elasticsearch VIP:

# disable the auto expand setting
curl -XPUT 192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "auto_expand_replicas": false } }'

# force the number_of_replicas to zero
curl -XPUT 192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "number_of_replicas": 0 } }'

# explicitly increase the number_of_replicas to 2
curl -XPUT 192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "number_of_replicas": 2 } }'

Diagnostic
========

the kibana-int index is configured with setting: index.auto_expand_replicas = "0-all".
see https://github.com/openstack/fuel-plugin-elasticsearch-kibana/blob/stable/0.9/deployment_scripts/puppet/modules/lma_logging_analytics/templates/es_template_kibana.json.erb#L4

This setting configures the number_of_replicas for the index kibana-int dynamically depending of the total number of ES instances within the cluster. This feature h...

Impact:
=====
Kibana dashboards are not available and the Elasticsearch cluster is 'red' (CRITICAL from POV of LMA)

Analysis:
=======

the cluster health is red due to kibana-int index in bad shape.

curl node-189:9200/_cluster/health?pretty
{
  "cluster_name" : "lma",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 15,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 15,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

All request to the kibana-int index respond with this error:

{"error":"NoShardAvailableActionException[[kibana-int][2] null]","status":503}

The detailed status of this index:

"kibana-int" : {
      "status" : "red",
      "number_of_shards" : 5,
      "number_of_replicas" : 2,
      "active_primary_shards" : 0,
      "active_shards" : 0,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 15,
      "shards" : {
        "0" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "1" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "2" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "3" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "4" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        }
      }

Workaround
==========
To solve the issue, here the commands to run against the Elasticsearch VIP:

# disable the auto expand setting
curl -XPUT  192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "auto_expand_replicas": false } }'

# force the number_of_replicas to zero
curl -XPUT  192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "number_of_replicas": 0 } }'

# explicitly increase the number_of_replicas to 2
curl -XPUT  192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "number_of_replicas": 2 } }'

Diagnostic
========

the kibana-int index is configured with setting: index.auto_expand_replicas = "0-all". 
see https://github.com/openstack/fuel-plugin-elasticsearch-kibana/blob/stable/0.9/deployment_scripts/puppet/modules/lma_logging_analytics/templates/es_template_kibana.json.erb#L4

This setting configures the number_of_replicas for the index kibana-int dynamically depending of the total number of ES instances within the cluster. This feature has worked as expected several times in same conditions (scaling up the cluster from 1 to 3 nodes).
The difference here is the size of the deployment (200 nodes) which seems to introduce a race condition when ES configures dynaically the number_of_replicas.

Revision history for this message

Swann Croiset (swann-w) wrote on 2016-03-03:

#2

This bug will be solved definitively by removing the "auto_expand_replicas" setting and by explicitly configure the "number_of_replicas" by the plugin after a scaling operation.

Changed in lma-toolchain:
milestone:	none → 1.0.0
description:	updated

Swann Croiset (swann-w) on 2016-03-14

summary:

- ElasticSearch failed to scale up from 1 to 3 nodes
+ Kibana dashboards unavailable after an ElasticSearch scale up from 1 to
+ 3 nodes

Swann Croiset (swann-w) on 2016-03-22

Changed in lma-toolchain:
milestone:	1.0.0 → 0.9.0

Simon Pasquier (simon-pasquier) on 2016-05-20

no longer affects:

lma-toolchain/1.0

Revision history for this message

guillaume thouvenin (guillaume-thouvenin) wrote on 2016-06-06:

#3

So it should be fixed by https://review.openstack.org/#/c/272092/
I will check this.

Revision history for this message

guillaume thouvenin (guillaume-thouvenin) wrote on 2016-06-08:

#4

I just tested with last stable/0.9 (commit bbf79a6379e6de6e0544802183a461b2b1a48967) and everything is green after upgrading from 1 to 3 nodes in ES.

root@node-11:~# curl 10.109.6.9:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open log-2016.06.08 5 0 797919 0 179.8mb 179.8mb
green open kibana-int 5 2 2 0 78.4kb 26.1kb
green open log-2016.06.07 5 0 981046 0 196.8mb 196.8mb
green open notification-2016.06.08 5 0 360 0 620.2kb 620.2kb
green open notification-2016.06.07 5 0 431 0 549.8kb 549.8kb

And

root@node-11:~# curl 10.109.6.9:9200/_cluster/health?pretty=true
{
  "cluster_name" : "lma",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 25,
  "active_shards" : 35,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

Revision history for this message

guillaume thouvenin (guillaume-thouvenin) wrote on 2016-06-08:

#5

So I close the bug and if it happens again we will reopen it.

Simon Pasquier (simon-pasquier) on 2016-06-13

Changed in lma-toolchain:
status:	Confirmed → Won't Fix
status:	Won't Fix → Fix Committed
milestone:	0.9.0 → 0.10.0
assignee:	LMA-Toolchain Fuel Plugins (mos-lma-toolchain) → guillaume thouvenin (guillaume-thouvenin)

Simon Pasquier (simon-pasquier) on 2016-07-26

Changed in lma-toolchain:
status:	Fix Committed → Fix Released

Revision history for this message

Swann Croiset (swann-w) wrote on 2016-08-04:

#6

I've reproduced this issue with 0.10.
The workaround in comment #1 is still valid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-09: Fix proposed to fuel-plugin-elasticsearch-kibana (master)

#7

Fix proposed to branch: master
Review: https://review.openstack.org/417955

Changed in lma-toolchain:
assignee:	nobody → Simon Pasquier (simon-pasquier)
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-11: Fix merged to fuel-plugin-elasticsearch-kibana (master)

#8

Reviewed: https://review.openstack.org/417955
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-elasticsearch-kibana/commit/?id=1c35a4eda8d1a10411835bd156588324b4e1140a
Submitter: Jenkins
Branch: master

commit 1c35a4eda8d1a10411835bd156588324b4e1140a
Author: Simon Pasquier <email address hidden>
Date: Mon Jan 9 17:20:20 2017 +0100

Get rid of auto_expand_replicas for Kibana indices

    Setting auto_expand_replicas to "0-all" can make the Kibana indices
    unavailable when scaling up the Elasticsearch cluster. As a consequence,
    the Kibana service is unavailable and the operator needs to manually fix
    the problem. This change applies the same replication settings as for
    the log and notification indices leading to more predictable behavior.
    It also enforces the replication settings in the provision_services.pp
    manifest to deal with scale-down and scale-up operations.

Change-Id: I8979f3d006ccd908711cbe0862032dc7b73d9b62
Closes-Bug: #1552258

Changed in lma-toolchain:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-13: Fix proposed to fuel-plugin-elasticsearch-kibana (stable/1.0)

#9

Fix proposed to branch: stable/1.0
Review: https://review.openstack.org/419940

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-13: Fix merged to fuel-plugin-elasticsearch-kibana (stable/1.0)

#10

Reviewed: https://review.openstack.org/419940
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-elasticsearch-kibana/commit/?id=5a65cb7d331ab26735acc531f8f9b568fee55470
Submitter: Jenkins
Branch: stable/1.0

commit 5a65cb7d331ab26735acc531f8f9b568fee55470
Author: Simon Pasquier <email address hidden>
Date: Mon Jan 9 17:20:20 2017 +0100

Get rid of auto_expand_replicas for Kibana indices

    Setting auto_expand_replicas to "0-all" can make the Kibana indices
    unavailable when scaling up the Elasticsearch cluster. As a consequence,
    the Kibana service is unavailable and the operator needs to manually fix
    the problem. This change applies the same replication settings as for
    the log and notification indices leading to more predictable behavior.
    It also enforces the replication settings in the provision_services.pp
    manifest to deal with scale-down and scale-up operations.

    Change-Id: I8979f3d006ccd908711cbe0862032dc7b73d9b62
    Closes-Bug: #1552258
    (cherry picked from commit 1c35a4eda8d1a10411835bd156588324b4e1140a)

Simon Pasquier (simon-pasquier) on 2017-02-21

no longer affects:	lma-toolchain/1.0
Changed in lma-toolchain:
status:	Fix Committed → Fix Released

StackLight

Kibana dashboards unavailable after an ElasticSearch scale up from 1 to 3 nodes

Bug Description

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	StackLight	Fix Released	Medium	Simon Pasquier	StackLight 1.0.0
	0.9	Won't Fix	High	guillaume thouvenin