Graylog Charm

Graylog Rev. 54 error: hook failed: "*-relation-joined"

Bug #1970964 reported by Bas de Bruijne on 2022-04-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Elasticsearch Charm	Invalid	Medium	Sudeep Bhandari
	Graylog Charm	Fix Released	High	Chi Wai CHAN	Graylog Charm 23.01

Bug Description

In testrun:
https://solutions.qa.canonical.com/testruns/testRun/9715b21a-be66-4e66-a5f7-8684aeaf60f5
With console output:
https://oil-jenkins.canonical.com/job/fce_build/1508//console

graylog dies:
graylog/0* error idle (...) hook failed: "elasticsearch-relation-joined"

In the logs we see
-------------------------------------------------
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Error calling graylog api: HTTPConnectionPool(host='127.0.0.1', port=9000): Max retries exceeded with url: /api/?pretty=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fed44324f70>: Failed to establish a new connection: [Errno 111] Connection refused'))
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Invoking reactive handler: reactive/graylog.py:718:configure_elasticsearch
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Trying to connect to endpoint http://172.31.39.237:9200
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Endpoint http://172.31.39.237:9200 is not a valid endpoint
unit-graylog-0: 13:19:23 INFO unit.graylog/0.juju-log elasticsearch:13: Can not connect to ES API
unit-graylog-0: 13:19:23 ERROR unit.graylog/0.juju-log elasticsearch:13: Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-graylog-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-graylog-0/charm/reactive/graylog.py", line 736, in configure_elasticsearch
    es = esapi.ElasticSearchApi(http_hosts)
  File "/var/lib/juju/agents/unit-graylog-0/charm/lib/charms/layer/elasticsearch/api.py", line 27, in __init__
    raise ConnectionError
ConnectionError
-------------------------------------------------

https://oil-jenkins.canonical.com/job/fce_debug_help/1526//console shows more context.

This is not a bug we have seen before, I suspect it was introduced in charm revision 54.

Link to crashdumps etc:
https://oil-jenkins.canonical.com/artifacts/7084908e-36b3-4a45-8d19-c7d0a165479c/index.html

See original description

Tags:

Related branches

~raychan96/charm-graylog:fix_temporarily_es_down

Merged into charm-graylog:master at revision 06202c3a18baf796676ac309994f1291b3cf4e0e

Ramesh Sattaru (community): Approve on 2023-02-24

Erhan Sunar (community): Approve on 2023-02-24

🤖 prod-jenkaas-bootstack (community): Approve (continuous-integration) on 2023-02-24

JamesLin: Needs Fixing on 2023-02-21

~raychan96/charm-graylog:1970964_and_1835156

Merged into charm-graylog:master at revision 4f92d5ba4a8fe0f32622190d9214300e8dc3281e

JamesLin: Approve on 2022-12-26

🤖 prod-jenkaas-bootstack (community): Approve (continuous-integration) on 2022-12-23

Eric Chen: Approve on 2022-12-23

BootStack Reviewers: Pending requested 2022-12-21

Bas de Bruijne (basdbruijne) on 2022-04-29

description:

updated

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-04-29:

We are also seeing this on 53 so its probably not introduced in 54. Maybe its related to Elasticsearch 59?

Running juju resolved fixes the error, so its probably a race condition.

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2022-05-06:

we also see this on revision 55

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-05-09 (last edit on 2022-05-09):

Changing the title because the error is not only "elasticsearch-relation-joined" but also:

"website-relation-changed" in https://solutions.qa.canonical.com/testruns/testRun/0efd25b2-fc1a-451c-a9a9-8a707875717f

"mongodb-relation-joined" in https://solutions.qa.canonical.com/testruns/testRun/2adf0ca6-f387-453b-bcd9-c71ab3235733

"nrpe-external-master-relation-changed" in https://solutions.qa.canonical.com/testruns/testRun/6bc4d432-51d3-420f-8a9b-3a493b32e28e

summary:

- Graylog Rev. 54 error: hook failed: "elasticsearch-relation-joined"
+ Graylog Rev. 54 error: hook failed: "*-relation-joined"

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2022-05-20:

I can reproduce this by deploying graylog and mongodb and waiting for the two to come and idle out.

juju bootstrap localhost --config automatically-retry-hooks=false
juju deploy graylog
juju relate graylog:mongodb mongodb:database

wait for idle

juju deploy elasticsearch
juju relate graylog:elasticsearch elasticsearch:client

wait for elasticsearch to start install hook
juju add-unit -n 2 elasticsearch

wait for graylog to error.

The pause here between the first and following 2 units is to simulate the delay it may take for all 3 units to come up.

graylog does try to connect to the first elasticsearch unit which comes up. But it seems like elasticsearch isn't ready for the connection yet.

Is it possible that the code in graylog which connects to elasticsearch happens before elasticsearch is certainly ready for a connection?

In my re-production environment I can see that graylog is trying to connect to elasticsearch about 10 seconds before elasticsearch runs the "client-relation-joined" hook for graylog.

Eric Chen (eric-chen) on 2022-06-24

Changed in charm-graylog:
importance:	Undecided → High
tags:	added: bseng-191

Eric Chen (eric-chen) on 2022-06-24

Changed in charm-graylog:
status:	New → Triaged

Eric Chen (eric-chen) on 2022-06-24

Changed in charm-graylog:
status:	Triaged → New

Sudeep Bhandari (sudeephb) on 2022-07-26

Changed in charm-graylog:
assignee:	nobody → Sudeep Bhandari (sudeephb)

Sudeep Bhandari (sudeephb) on 2022-07-26

Changed in charm-elasticsearch:
assignee:	nobody → Sudeep Bhandari (sudeephb)

Revision history for this message

Sudeep Bhandari (sudeephb) wrote on 2022-07-26:

Replicated following the steps stated above by @asbalderson. The issue appears only when elasticsearch peers are added, and not when only one instance of ES is running.
Also, the fix of https://bugs.launchpad.net/charm-elasticsearch/+bug/1846366 has also fixed the issue for elasticsearch-relation-joined and elasticsearch-relation-changed hook errors.

Revision history for this message

Bartłomiej Poniecki-Klotz (barteus) wrote on 2022-08-12:

This also happens in the client environment (LMA).

The workaround is to wait until other components go to idle, error or blocked and then use juju resolved on all graylog units.

This helped during the last 4 redeployments.

Jeffrey Chang (modern911) on 2022-11-16

tags:

added: cdo-qa

Eric Chen (eric-chen) on 2023-01-03

Changed in charm-elasticsearch:
status:	New → Won't Fix
status:	Won't Fix → Invalid
importance:	Undecided → Medium

Sudeep Bhandari (sudeephb) on 2023-01-11

Changed in charm-graylog:
assignee:	Sudeep Bhandari (sudeephb) → Chi Wai CHAN (raychan96)

Sudeep Bhandari (sudeephb) on 2023-01-11

Changed in charm-graylog:
status:	New → Fix Committed
milestone:	none → 23.01

Revision history for this message

Konstantinos Kaskavelis (kaskavel) wrote on 2023-01-23:

We notice a new occurrence of this bug in a jammy baremetal run in lma-maas layer with elasticsearch & graylog charms in latest/candidate channel: https://solutions.qa.canonical.com/testruns/testRun/84242d24-91f1-4614-81ee-21a13d2f92b1

Sudeep Bhandari (sudeephb) on 2023-02-14

Changed in charm-graylog:
status:	Fix Committed → Fix Released

Revision history for this message

Konstantinos Kaskavelis (kaskavel) wrote on 2023-02-16 (last edit on 2023-02-16):

We are actually still seeing this bug, eg:

https://solutions.qa.canonical.com/testruns/testRun/4f834f60-f1f6-40c2-a9af-57b56602ff20

Revision history for this message

Chi Wai CHAN (raychan96) wrote on 2023-02-27:

Hi, thanks for reporting this. From the logs, I notice that there's a short down time (~10s) of ES cluster immediately after the health check, while graylog is trying configure ES after the health check and during the downtime. I proposed a change on how graylog should configure ES, and hopefully it can fix the problem. Thanks!

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.