Kibana dashboards unavailable after an ElasticSearch scale up from 1 to 3 nodes

Bug #1552258 reported by Ivan Lozgachev
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StackLight
Fix Released
Medium
Simon Pasquier
0.9
Won't Fix
High
guillaume thouvenin

Bug Description

MOS 8.0 build 589, ElasticSearch from origin/master

Environment:
3 controllers
194 compute (20 of them are also ceph nodes)
1 elasticsearch node
3 influxdb nodes

How to reproduce:
1. Go to Fuel dashboard and add 2 new ElasticSearch nodes.
2. Deploy changes
3. Check status of ElasticSearch

Actual result:
curl node-189:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open log-2016.03.02 5 0 13575537 0 2.8gb 2.8gb
red open kibana-int 5 2
green open notification-2016.03.02 5 0 43833 0 30.4mb 30.4mb
green open log-2016.03.01 5 0 2432417 0 412mb 412mb

The index kibana-int is "red"

Red status shows failure

LMA reports CRITICAL status for ES cluster
Kibana dashboards are unavailable

Notice: there is no impact for logs and notifications indexation, the cluster works as expected for these data, only kibana index is impacted

Swann Croiset (swann-w)
Changed in lma-toolchain:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → LMA-Toolchain Fuel Plugins (mos-lma-toolchain)
tags: added: elasticsearch
tags: added: scale
description: updated
Revision history for this message
Swann Croiset (swann-w) wrote :
Download full text (3.4 KiB)

Impact:
=====
Kibana dashboards are not available and the Elasticsearch cluster is 'red' (CRITICAL from POV of LMA)

Analysis:
=======

the cluster health is red due to kibana-int index in bad shape.

curl node-189:9200/_cluster/health?pretty
{
  "cluster_name" : "lma",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 15,
  "active_shards" : 15,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 15,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

All request to the kibana-int index respond with this error:

{"error":"NoShardAvailableActionException[[kibana-int][2] null]","status":503}

The detailed status of this index:

    "kibana-int" : {
      "status" : "red",
      "number_of_shards" : 5,
      "number_of_replicas" : 2,
      "active_primary_shards" : 0,
      "active_shards" : 0,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 15,
      "shards" : {
        "0" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "1" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "2" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "3" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        },
        "4" : {
          "status" : "red",
          "primary_active" : false,
          "active_shards" : 0,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 3
        }
      }

Workaround
==========
To solve the issue, here the commands to run against the Elasticsearch VIP:

# disable the auto expand setting
curl -XPUT 192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "auto_expand_replicas": false } }'

# force the number_of_replicas to zero
curl -XPUT 192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "number_of_replicas": 0 } }'

# explicitly increase the number_of_replicas to 2
curl -XPUT 192.168.0.4:9200/kibana-int/_settings -d '
{ "index": { "number_of_replicas": 2 } }'

Diagnostic
========

the kibana-int index is configured with setting: index.auto_expand_replicas = "0-all".
see https://github.com/openstack/fuel-plugin-elasticsearch-kibana/blob/stable/0.9/deployment_scripts/puppet/modules/lma_logging_analytics/templates/es_template_kibana.json.erb#L4

This setting configures the number_of_replicas for the index kibana-int dynamically depending of the total number of ES instances within the cluster. This feature h...

Read more...

Revision history for this message
Swann Croiset (swann-w) wrote :

This bug will be solved definitively by removing the "auto_expand_replicas" setting and by explicitly configure the "number_of_replicas" by the plugin after a scaling operation.

Changed in lma-toolchain:
milestone: none → 1.0.0
description: updated
Swann Croiset (swann-w)
summary: - ElasticSearch failed to scale up from 1 to 3 nodes
+ Kibana dashboards unavailable after an ElasticSearch scale up from 1 to
+ 3 nodes
Swann Croiset (swann-w)
Changed in lma-toolchain:
milestone: 1.0.0 → 0.9.0
no longer affects: lma-toolchain/1.0
Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

So it should be fixed by https://review.openstack.org/#/c/272092/
I will check this.

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

I just tested with last stable/0.9 (commit bbf79a6379e6de6e0544802183a461b2b1a48967) and everything is green after upgrading from 1 to 3 nodes in ES.

root@node-11:~# curl 10.109.6.9:9200/_cat/indices?v
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open log-2016.06.08 5 0 797919 0 179.8mb 179.8mb
green open kibana-int 5 2 2 0 78.4kb 26.1kb
green open log-2016.06.07 5 0 981046 0 196.8mb 196.8mb
green open notification-2016.06.08 5 0 360 0 620.2kb 620.2kb
green open notification-2016.06.07 5 0 431 0 549.8kb 549.8kb

And

root@node-11:~# curl 10.109.6.9:9200/_cluster/health?pretty=true
{
  "cluster_name" : "lma",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 25,
  "active_shards" : 35,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

Revision history for this message
guillaume thouvenin (guillaume-thouvenin) wrote :

So I close the bug and if it happens again we will reopen it.

Changed in lma-toolchain:
status: Confirmed → Won't Fix
status: Won't Fix → Fix Committed
milestone: 0.9.0 → 0.10.0
assignee: LMA-Toolchain Fuel Plugins (mos-lma-toolchain) → guillaume thouvenin (guillaume-thouvenin)
Changed in lma-toolchain:
status: Fix Committed → Fix Released
Revision history for this message
Swann Croiset (swann-w) wrote :

I've reproduced this issue with 0.10.
The workaround in comment #1 is still valid

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-elasticsearch-kibana (master)

Fix proposed to branch: master
Review: https://review.openstack.org/417955

Changed in lma-toolchain:
assignee: nobody → Simon Pasquier (simon-pasquier)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-elasticsearch-kibana (master)

Reviewed: https://review.openstack.org/417955
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-elasticsearch-kibana/commit/?id=1c35a4eda8d1a10411835bd156588324b4e1140a
Submitter: Jenkins
Branch: master

commit 1c35a4eda8d1a10411835bd156588324b4e1140a
Author: Simon Pasquier <email address hidden>
Date: Mon Jan 9 17:20:20 2017 +0100

    Get rid of auto_expand_replicas for Kibana indices

    Setting auto_expand_replicas to "0-all" can make the Kibana indices
    unavailable when scaling up the Elasticsearch cluster. As a consequence,
    the Kibana service is unavailable and the operator needs to manually fix
    the problem. This change applies the same replication settings as for
    the log and notification indices leading to more predictable behavior.
    It also enforces the replication settings in the provision_services.pp
    manifest to deal with scale-down and scale-up operations.

    Change-Id: I8979f3d006ccd908711cbe0862032dc7b73d9b62
    Closes-Bug: #1552258

Changed in lma-toolchain:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-plugin-elasticsearch-kibana (stable/1.0)

Fix proposed to branch: stable/1.0
Review: https://review.openstack.org/419940

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-plugin-elasticsearch-kibana (stable/1.0)

Reviewed: https://review.openstack.org/419940
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-elasticsearch-kibana/commit/?id=5a65cb7d331ab26735acc531f8f9b568fee55470
Submitter: Jenkins
Branch: stable/1.0

commit 5a65cb7d331ab26735acc531f8f9b568fee55470
Author: Simon Pasquier <email address hidden>
Date: Mon Jan 9 17:20:20 2017 +0100

    Get rid of auto_expand_replicas for Kibana indices

    Setting auto_expand_replicas to "0-all" can make the Kibana indices
    unavailable when scaling up the Elasticsearch cluster. As a consequence,
    the Kibana service is unavailable and the operator needs to manually fix
    the problem. This change applies the same replication settings as for
    the log and notification indices leading to more predictable behavior.
    It also enforces the replication settings in the provision_services.pp
    manifest to deal with scale-down and scale-up operations.

    Change-Id: I8979f3d006ccd908711cbe0862032dc7b73d9b62
    Closes-Bug: #1552258
    (cherry picked from commit 1c35a4eda8d1a10411835bd156588324b4e1140a)

no longer affects: lma-toolchain/1.0
Changed in lma-toolchain:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.