Graylog Rev. 54 error: hook failed: "*-relation-joined"

Bug #1970964 reported by Bas de Bruijne
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Elasticsearch Charm
Invalid
Medium
Sudeep Bhandari
Graylog Charm
Fix Released
High
Chi Wai CHAN

Bug Description

In testrun:
https://solutions.qa.canonical.com/testruns/testRun/9715b21a-be66-4e66-a5f7-8684aeaf60f5
With console output:
https://oil-jenkins.canonical.com/job/fce_build/1508//console

graylog dies:
graylog/0* error idle (...) hook failed: "elasticsearch-relation-joined"

In the logs we see
-------------------------------------------------
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Error calling graylog api: HTTPConnectionPool(host='127.0.0.1', port=9000): Max retries exceeded with url: /api/?pretty=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fed44324f70>: Failed to establish a new connection: [Errno 111] Connection refused'))
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Invoking reactive handler: reactive/graylog.py:718:configure_elasticsearch
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Trying to connect to endpoint http://172.31.39.237:9200
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Endpoint http://172.31.39.237:9200 is not a valid endpoint
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Can not connect to ES API
unit-graylog-0: 13:19:23 ERROR unit.graylog/0.juju-log elasticsearch:13: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-graylog-0/charm/reactive/graylog.py", line 736, in configure_elasticsearch
    es = esapi.ElasticSearchApi(http_hosts)
  File "/var/lib/juju/agents/unit-graylog-0/charm/lib/charms/layer/elasticsearch/api.py", line 27, in __init__
    raise ConnectionError
ConnectionError
-------------------------------------------------

https://oil-jenkins.canonical.com/job/fce_debug_help/1526//console shows more context.

This is not a bug we have seen before, I suspect it was introduced in charm revision 54.

Link to crashdumps etc:
https://oil-jenkins.canonical.com/artifacts/7084908e-36b3-4a45-8d19-c7d0a165479c/index.html

Related branches

description: updated
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We are also seeing this on 53 so its probably not introduced in 54. Maybe its related to Elasticsearch 59?

Running juju resolved fixes the error, so its probably a race condition.

Revision history for this message
Alexander Balderson (asbalderson) wrote :

we also see this on revision 55

Revision history for this message
Bas de Bruijne (basdbruijne) wrote (last edit ):

Changing the title because the error is not only "elasticsearch-relation-joined" but also:

"website-relation-changed" in https://solutions.qa.canonical.com/testruns/testRun/0efd25b2-fc1a-451c-a9a9-8a707875717f

"mongodb-relation-joined" in https://solutions.qa.canonical.com/testruns/testRun/2adf0ca6-f387-453b-bcd9-c71ab3235733

"nrpe-external-master-relation-changed" in https://solutions.qa.canonical.com/testruns/testRun/6bc4d432-51d3-420f-8a9b-3a493b32e28e

summary: - Graylog Rev. 54 error: hook failed: "elasticsearch-relation-joined"
+ Graylog Rev. 54 error: hook failed: "*-relation-joined"
Revision history for this message
Alexander Balderson (asbalderson) wrote :

I can reproduce this by deploying graylog and mongodb and waiting for the two to come and idle out.

juju bootstrap localhost --config automatically-retry-hooks=false
juju deploy graylog
juju relate graylog:mongodb mongodb:database

wait for idle

juju deploy elasticsearch
juju relate graylog:elasticsearch elasticsearch:client

wait for elasticsearch to start install hook
juju add-unit -n 2 elasticsearch

wait for graylog to error.

The pause here between the first and following 2 units is to simulate the delay it may take for all 3 units to come up.

graylog does try to connect to the first elasticsearch unit which comes up. But it seems like elasticsearch isn't ready for the connection yet.

Is it possible that the code in graylog which connects to elasticsearch happens before elasticsearch is certainly ready for a connection?

In my re-production environment I can see that graylog is trying to connect to elasticsearch about 10 seconds before elasticsearch runs the "client-relation-joined" hook for graylog.

Eric Chen (eric-chen)
Changed in charm-graylog:
importance: Undecided → High
tags: added: bseng-191
Eric Chen (eric-chen)
Changed in charm-graylog:
status: New → Triaged
Eric Chen (eric-chen)
Changed in charm-graylog:
status: Triaged → New
Changed in charm-graylog:
assignee: nobody → Sudeep Bhandari (sudeephb)
Changed in charm-elasticsearch:
assignee: nobody → Sudeep Bhandari (sudeephb)
Revision history for this message
Sudeep Bhandari (sudeephb) wrote :

Replicated following the steps stated above by @asbalderson. The issue appears only when elasticsearch peers are added, and not when only one instance of ES is running.
Also, the fix of https://bugs.launchpad.net/charm-elasticsearch/+bug/1846366 has also fixed the issue for elasticsearch-relation-joined and elasticsearch-relation-changed hook errors.

Revision history for this message
Bartłomiej Poniecki-Klotz (barteus) wrote :

This also happens in the client environment (LMA).

The workaround is to wait until other components go to idle, error or blocked and then use juju resolved on all graylog units.

This helped during the last 4 redeployments.

tags: added: cdo-qa
Eric Chen (eric-chen)
Changed in charm-elasticsearch:
status: New → Won't Fix
status: Won't Fix → Invalid
importance: Undecided → Medium
Changed in charm-graylog:
assignee: Sudeep Bhandari (sudeephb) → Chi Wai CHAN (raychan96)
Changed in charm-graylog:
status: New → Fix Committed
milestone: none → 23.01
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :

We notice a new occurrence of this bug in a jammy baremetal run in lma-maas layer with elasticsearch & graylog charms in latest/candidate channel: https://solutions.qa.canonical.com/testruns/testRun/84242d24-91f1-4614-81ee-21a13d2f92b1

Changed in charm-graylog:
status: Fix Committed → Fix Released
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote (last edit ):
Revision history for this message
Chi Wai CHAN (raychan96) wrote :

Hi, thanks for reporting this. From the logs, I notice that there's a short down time (~10s) of ES cluster immediately after the health check, while graylog is trying configure ES after the health check and during the downtime. I proposed a change on how graylog should configure ES, and hopefully it can fix the problem. Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.