elasticsearch errors with message hook failed: "peer-relation-joined" while elasticsearch is not even running

Bug #1992796 reported by Konstantinos Kaskavelis
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Elasticsearch Charm
Won't Fix
Low
Unassigned

Bug Description

SQA team has a failing run where elasticsearch errors whith the message:
hook failed: "peer-relation-joined"

From the logs we see that elasticsearch is not running (Timeout when waiting for 127.0.0.1:9200):

<localhost> PUT /root/.ansible/tmp/ansible-local-252613yr5bpb3/tmpr2np4_mo TO /root/.ansible/tmp/ansible-tmp-1664550140.4646742-121386142852947/AnsiballZ_wait_for.py
<localhost> EXEC /bin/sh -c 'chmod u+x /root/.ansible/tmp/ansible-tmp-1664550140.4646742-121386142852947/ /root/.ansible/tmp/ansible-tmp-1664550140.4646742-121386142852947/AnsiballZ_wait_for.py && sleep 0'
<localhost> EXEC /bin/sh -c '/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1664550140.4646742-121386142852947/AnsiballZ_wait_for.py && sleep 0'
<localhost> EXEC /bin/sh -c 'rm -f -r /root/.ansible/tmp/ansible-tmp-1664550140.4646742-121386142852947/ > /dev/null 2>&1 && sleep 0'
fatal: [localhost]: FAILED! => {
    "changed": false,
    "elapsed": 300,
    "invocation": {
        "module_args": {
            "active_connection_states": [
                "ESTABLISHED",
                "FIN_WAIT1",
                "FIN_WAIT2",
                "SYN_RECV",
                "SYN_SENT",
                "TIME_WAIT"
            ],
            "connect_timeout": 5,
            "delay": 0,
            "exclude_hosts": null,
            "host": "127.0.0.1",
            "msg": null,
            "path": null,
            "port": 9200,
            "search_regex": null,
            "sleep": 1,
            "state": "started",
            "timeout": 300
        }
    },
    "msg": "Timeout when waiting for 127.0.0.1:9200"
}

PLAY RECAP *********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined Traceback (most recent call last):
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined File "/var/lib/juju/agents/unit-elasticsearch-1/charm/hooks/peer-relation-joined", line 244, in <module>
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined hooks.execute(sys.argv)
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined File "/var/lib/juju/agents/unit-elasticsearch-1/charm/hooks/charmhelpers/contrib/ansible/__init__.py", line 292, in execute
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined charmhelpers.contrib.ansible.apply_playbook(
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined File "/var/lib/juju/agents/unit-elasticsearch-1/charm/hooks/charmhelpers/contrib/ansible/__init__.py", line 226, in apply_playbook
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined raise e
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined File "/var/lib/juju/agents/unit-elasticsearch-1/charm/hooks/charmhelpers/contrib/ansible/__init__.py", line 218, in apply_playbook
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined subprocess.check_output(call, env=env)
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined File "/usr/lib/python3.8/subprocess.py", line 516, in run
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined raise CalledProcessError(retcode, process.args,
unit-elasticsearch-1: 15:07:21 WARNING unit.elasticsearch/1.peer-relation-joined subprocess.CalledProcessError: Command '['ansible-playbook', '-vvv', '-c', 'local', 'playbook.yaml', '--tags', 'peer-relation-joined']' returned non-zero exit status 2.
unit-elasticsearch-1: 15:07:21 ERROR juju.worker.uniter.operation hook "peer-relation-joined" (via explicit, bespoke hook script) failed: exit status 1
unit-elasticsearch-1: 15:07:21 DEBUG juju.machinelock created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
unit-elasticsearch-1: 15:07:21 DEBUG juju.machinelock machine lock released for elasticsearch/1 uniter (run relation-joined (0; unit: elasticsearch/0) hook)
unit-elasticsearch-1: 15:07:21 DEBUG juju.worker.uniter.operation lock released for elasticsearch/1
unit-elasticsearch-1: 15:07:21 INFO juju.worker.uniter awaiting error resolution for "relation-joined" hook
unit-elasticsearch-1: 15:07:21 DEBUG juju.worker.uniter [AGENT-STATUS] error: hook failed: "peer-relation-joined"

Test run:

https://solutions.qa.canonical.com/testruns/testRun/5c197da2-6de7-4d65-ba3e-d9335465a8ad

Tags: bseng-495
Revision history for this message
Chi Wai CHAN (raychan96) wrote :

One possibility of this error is that elasticsearch isn't running due to JVM memory issue. This can be checked in the system service and elasticsearch's log. One way to avoid JVM memory issue is to add appropriate memory constraint when deploying elasticsearch (e.g. --constraints mem=4G).

Eric Chen (eric-chen)
tags: added: bseng-495
Revision history for this message
JamesLin (jneo8) wrote :

I deploy Elastchsearch on serverstack with 5 m1-small(2G RAM, 20G Disk, CPU) machines. Run functional test which has scale test inside, also use command `juju add-unit -n 2` to scale up the elastchsearch cluster. And everything seems work fine.

And if I understand right, hook peer-relation-join is just setup elasticsearch by using ansible, and wait for it available on 9200.

We may need more information to reproduce it.

Revision history for this message
Eric Chen (eric-chen) wrote :

There is no new record after 10/13, This issue may be resolved after the memory become 4G.

No new record:
https://solutions.qa.canonical.com/bugs/bugs/bug/1992796

Before we create #1992796, this issue will be categoried into #1846366, and there is no new records either.
https://solutions.qa.canonical.com/bugs/bugs/bug/1846366

We can close it after 1 week observasion.

JamesLin (jneo8)
Changed in charm-elasticsearch:
status: New → Won't Fix
Eric Chen (eric-chen)
Changed in charm-elasticsearch:
status: Won't Fix → Invalid
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :

We have a new occurrence of the bug in https://solutions.qa.canonical.com/v2/testruns/6bd0dab8-e01b-4e3d-acb2-e5b4ec9b85cd, could we take another look at it?

Revision history for this message
Jeffrey Chang (modern911) wrote :

Changed status to new since we have more data to be investigated.
Solution QA hit this 5(on #1846366)+2(on this) times in Nov.

Changed in charm-elasticsearch:
status: Invalid → New
Eric Chen (eric-chen)
Changed in charm-elasticsearch:
importance: Undecided → Low
Revision history for this message
Eric Chen (eric-chen) wrote :

This charm is under maintenance mode. Only critical bug will be handled.
Please consider using the new Canonical Observability Stack (https://charmhub.io/topics/canonical-observability-stack) or the opensearch-operator (https://github.com/canonical/opensearch-operator) instead.

Changed in charm-elasticsearch:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.