Bug #1880729 “Elasticsearch fails on peer-relation-joined due to...” : Bugs : Elasticsearch Charm

Revision history for this message

Marian Gasparovic (marosg) wrote on 2020-05-29:

#1

also here, during peer-relation-changed hook

https://solutions.qa.canonical.com/#/qa/testRun/29e92d25-fd8b-4b04-a8cd-e0cd8e31454b

Revision history for this message

Diko Parvanov (dparv) wrote on 2020-06-03:

#2

Might be related to recent commit: https://code.launchpad.net/~xavpaice/charm-elasticsearch/+git/charm-elasticsearch/+merge/384414 where the elasticsearch listen address gets changed to 0.0.0.0

Diko Parvanov (dparv) on 2020-06-03

Changed in charm-elasticsearch:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Diko Parvanov (dparv) wrote on 2020-06-03:

#4

Tried to reproduce issue and got the following:

[2020-06-03T09:29:48,961][INFO ][o.e.t.TransportService ] [juju-860c22-1] publish_address {192.168.1.102:9300}, bound_addresses {[::]:9300}
[2020-06-03T09:29:49,228][INFO ][o.e.b.BootstrapChecks ] [juju-860c22-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2020-06-03T09:29:49,247][ERROR][o.e.b.Bootstrap ] [juju-860c22-1] node validation exception
[2] bootstrap checks failed

Elastic service does not start:

Adding
transport.host: 127.0.0.1
To:
/etc/elasticsearch/elasticsearch.yml

fixes the issues.

Could you please re-check with cs:~dparv/elasticsearch-1 and see if the issue is solved? I will submit MP to fix it if yes.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2020-06-08:

#5

This continues to block SQA runs - sub'd to field high.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2020-06-08:

#6

We tried cs:~dparv/elasticsearch-1 over the weekend and continued to hit this issue, here is an example test run:

https://solutions.qa.canonical.com/#/qa/testRun/4acc5d1f-13db-4c57-b326-4321bbf2054a

Jeremy Lounder (jldev) on 2020-06-08

Changed in charm-elasticsearch:
importance:	Medium → High

Revision history for this message

Paul Goins (vultaire) wrote on 2020-06-09:

#7

It'd be very useful to get the juju log output of the failed unit. Can this be provided?

I am trying to reproduce this, and while I did hit a separate bug with focal (https://bugs.launchpad.net/charm-elasticsearch/+bug/1882824), on Bionic I'm not yet reproducing this issue. This is with the promulgated charm, not with Diko's version.

Revision history for this message

Michael Skalka (mskalka) wrote on 2020-06-09:

#8

Paul,

If you follow the links to the test run above there is a link to the full juju crash dump for that run at the bottom of the page.

Revision history for this message

Paul Goins (vultaire) wrote on 2020-06-09:

#9

Additionally, if you can stand up an environment where this issue occurs outside of CI, it'd be great if you could SSH into one of the ES units and run the following? It'd be really helpful to get a pastebin of the output.

$ sudo su -
# cd /var/lib/juju/agents/unit-elasticsearch-*/charm
# ansible-playbook -vvv -c local playbook.yaml --tags peer-relation-joined

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2020-06-12:

#10

Paul, here's the output from the command you requested in #9:

http://paste.ubuntu.com/p/9nqB8nrRbj/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2020-06-12:

#11

In elasticsearch.log, it looks like it's trying to use fan IP addresses:
[2020-06-12T18:43:37,442][WARN ][o.e.d.z.ZenDiscovery ] [lnCuF_C] failed to connect to master [{2KHB_Si}{2KHB_SigQQa3eqKxqaA_lw}{xcl7FFdeTguXT2ONNMHcXw}{252.140.0.1}{252.140.0.1:9300}], retrying...

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2020-06-12:

#12

These two addresses are the native openstack address and the fan address for the same node. I can reach elasticsearch on the native openstack one, but not on the fan one.

ubuntu@juju-59bcbe-kubernetes-4:~$ telnet 172.16.0.143 9300
Trying 172.16.0.143...
Connected to 172.16.0.143.
Escape character is '^]'.
^]
telnet> Connection closed.
ubuntu@juju-59bcbe-kubernetes-4:~$ telnet 252.143.0.1 9300
Trying 252.143.0.1...

I can ping the fan one though:
ubuntu@juju-59bcbe-kubernetes-4:~$ ping 252.143.0.1
PING 252.143.0.1 (252.143.0.1) 56(84) bytes of data.
64 bytes from 252.143.0.1: icmp_seq=1 ttl=64 time=0.840 ms

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2020-06-12:

#13

Download full text (9.4 KiB)

So the problem is the charm/elasticsearch is trying to connect using the fan addresses, but firewall rules are only setup to allow access via openstack:

ubuntu@juju-59bcbe-kubernetes-3:~$ sudo iptables -L -n
Chain INPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-before-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-logging-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-reject-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-track-input all -- 0.0.0.0/0 0.0.0.0/0

Chain FORWARD (policy DROP)
target prot opt source destination
ufw-before-logging-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-before-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-logging-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-reject-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-track-forward all -- 0.0.0.0/0 0.0.0.0/0

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-before-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-logging-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-reject-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-track-output all -- 0.0.0.0/0 0.0.0.0/0

Chain ufw-after-forward (1 references)
target prot opt source destination

Chain ufw-after-input (1 references)
target prot opt source destination
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:137
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:138
ufw-skip-to-policy-input tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:139
ufw-skip-to-policy-input tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:445
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:67
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:68
ufw-skip-to-policy-input all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type BROADCAST

Chain ufw-after-logging-forward (1 references)
target prot opt source destination
LOG all -- 0.0.0.0/0 0.0.0.0/0 limit: avg 3/min burst 10 LOG flags 0 level 4 prefix "[UFW BLOCK] "

Chain ufw-after-logging-input (1 references)
target prot opt source destination

Chain ufw-after-logging-output (1 references)
target prot opt source destination

Chain ufw-after-output (1 references)
target prot opt source destination

Chain ufw-bef...

So the problem is the charm/elasticsearch is trying to connect using the fan addresses, but firewall rules are only setup to allow access via openstack:

ubuntu@juju-59bcbe-kubernetes-3:~$ sudo iptables -L   -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ufw-before-logging-input  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-before-input  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-after-input  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-after-logging-input  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-reject-input  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-track-input  all  --  0.0.0.0/0            0.0.0.0/0

Chain FORWARD (policy DROP)
target     prot opt source               destination         
ufw-before-logging-forward  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-before-forward  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-after-forward  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-after-logging-forward  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-reject-forward  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-track-forward  all  --  0.0.0.0/0            0.0.0.0/0

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
ufw-before-logging-output  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-before-output  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-after-output  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-after-logging-output  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-reject-output  all  --  0.0.0.0/0            0.0.0.0/0           
ufw-track-output  all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-after-forward (1 references)
target     prot opt source               destination

Chain ufw-after-input (1 references)
target     prot opt source               destination         
ufw-skip-to-policy-input  udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:137
ufw-skip-to-policy-input  udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:138
ufw-skip-to-policy-input  tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:139
ufw-skip-to-policy-input  tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:445
ufw-skip-to-policy-input  udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:67
ufw-skip-to-policy-input  udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:68
ufw-skip-to-policy-input  all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type BROADCAST

Chain ufw-after-logging-forward (1 references)
target     prot opt source               destination         
LOG        all  --  0.0.0.0/0            0.0.0.0/0            limit: avg 3/min burst 10 LOG flags 0 level 4 prefix "[UFW BLOCK] "

Chain ufw-after-logging-input (1 references)
target     prot opt source               destination

Chain ufw-after-logging-output (1 references)
target     prot opt source               destination

Chain ufw-after-output (1 references)
target     prot opt source               destination

Chain ufw-before-forward (1 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 3
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 11
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 12
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 8
ufw-user-forward  all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-before-input (1 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
ufw-logging-deny  all  --  0.0.0.0/0            0.0.0.0/0            ctstate INVALID
DROP       all  --  0.0.0.0/0            0.0.0.0/0            ctstate INVALID
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 3
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 11
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 12
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0            icmptype 8
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp spt:67 dpt:68
ufw-not-local  all  --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     udp  --  0.0.0.0/0            224.0.0.251          udp dpt:5353
ACCEPT     udp  --  0.0.0.0/0            239.255.255.250      udp dpt:1900
ufw-user-input  all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-before-logging-forward (1 references)
target     prot opt source               destination

Chain ufw-before-logging-input (1 references)
target     prot opt source               destination

Chain ufw-before-logging-output (1 references)
target     prot opt source               destination

Chain ufw-before-output (1 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
ufw-user-output  all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-logging-allow (0 references)
target     prot opt source               destination         
LOG        all  --  0.0.0.0/0            0.0.0.0/0            limit: avg 3/min burst 10 LOG flags 0 level 4 prefix "[UFW ALLOW] "

Chain ufw-logging-deny (2 references)
target     prot opt source               destination         
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            ctstate INVALID limit: avg 3/min burst 10
LOG        all  --  0.0.0.0/0            0.0.0.0/0            limit: avg 3/min burst 10 LOG flags 0 level 4 prefix "[UFW BLOCK] "

Chain ufw-not-local (1 references)
target     prot opt source               destination         
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type MULTICAST
RETURN     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type BROADCAST
ufw-logging-deny  all  --  0.0.0.0/0            0.0.0.0/0            limit: avg 3/min burst 10
DROP       all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-reject-forward (1 references)
target     prot opt source               destination

Chain ufw-reject-input (1 references)
target     prot opt source               destination

Chain ufw-reject-output (1 references)
target     prot opt source               destination

Chain ufw-skip-to-policy-forward (0 references)
target     prot opt source               destination         
DROP       all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-skip-to-policy-input (7 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-skip-to-policy-output (0 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-track-forward (1 references)
target     prot opt source               destination

Chain ufw-track-input (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            ctstate NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            ctstate NEW

Chain ufw-track-output (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            ctstate NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            ctstate NEW

Chain ufw-user-forward (1 references)
target     prot opt source               destination

Chain ufw-user-input (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  172.16.0.61          0.0.0.0/0            tcp dpt:9200
DROP       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9200
DROP       udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:9200
ACCEPT     tcp  --  172.16.0.140         0.0.0.0/0            tcp dpt:9300
ACCEPT     tcp  --  172.16.0.13          0.0.0.0/0            tcp dpt:9300
DROP       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9300
DROP       udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:9300

Chain ufw-user-limit (0 references)
target     prot opt source               destination         
LOG        all  --  0.0.0.0/0            0.0.0.0/0            limit: avg 3/min burst 5 LOG flags 0 level 4 prefix "[UFW LIMIT BLOCK] "
REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-port-unreachable

Chain ufw-user-limit-accept (0 references)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0

Chain ufw-user-logging-forward (0 references)
target     prot opt source               destination

Chain ufw-user-logging-input (0 references)
target     prot opt source               destination

Chain ufw-user-logging-output (0 references)
target     prot opt source               destination

Chain ufw-user-output (1 references)
target     prot opt source               destination

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2020-06-12:

#14

Without fan networking, we can still get a hook failure, but I feel like that's a different underlying issue than what is reported here. This issue - with the fan networking IP address being used - doesn't correct on retry. The hook failure with fan disabled does.

Revision history for this message

Paul Goins (vultaire) wrote on 2020-06-12:

#15

I'm still taking a look at this, but based on what I've been able to gather so far - from unit logs and /var/log/elasticsearch/elasticsearch.log - this appears to be some sort of clustering issue, and may be a race condition.

If we have an acceptable workaround here (don't use fan networking, and retry hooks on initial failure), do we still think this is a field-high issue, or does the priority lower somewhat?

Clearly there is an issue re: fan networking, and I think this issue should be retitled for that specifically. And, there may be an issue regarding initial clustering race conditions - I can make a bug for that.

Revision history for this message

Paul Goins (vultaire) wrote on 2020-06-12:

#16

Actually, I did find a bug for the latter: https://bugs.launchpad.net/charm-elasticsearch/+bug/1835410

Revision history for this message

Michael Skalka (mskalka) wrote on 2020-06-12:

#17

Paul,

We explicitly do not enable retries in testing so as to catch issues like this. That has been a policy in SQA for at least two years now. We cannot accept a workaround that involves enabling hook retries.

Revision history for this message

Paul Goins (vultaire) wrote on 2020-06-13:

#18

@Michael: Thanks for your reply, and understood.

@All: With regards to 1835410 (which is the issue which remains when fan networking is disabled, and is what the retry workaround addresses): I'll comment further on that ticket, so as to keep these issues separate. I'd like to keep this issue focused on fan networking.

Jason Hobbs (jason-hobbs) on 2020-06-15

summary:

- Elasticsaerch fails on peer-relation-joined due to heath check
+ Elasticsearch fails on peer-relation-joined due to heath check

Jason Hobbs (jason-hobbs) on 2020-06-15

summary:

- Elasticsearch fails on peer-relation-joined due to heath check
+ Elasticsearch fails on peer-relation-joined due to heath check due to ES
+ trying to use FAN IPs, which aren't open on the firewall

Xav Paice (xavpaice) on 2020-07-20

Changed in charm-elasticsearch:
assignee:	nobody → Xav Paice (xavpaice)
status:	Triaged → In Progress

Revision history for this message

Xav Paice (xavpaice) wrote on 2020-07-20:

#19

The ufw setup is activated by `ufw: rule=allow src={{ lookup('dns', item['private-address']) }} port=9200 proto=tcp` for each client relation. If the client has several addresses, and private-address returns the "wrong" one then this will not open the firewall enough for a connection from that client address. The rules for port 9300 is very similar.

However, I've tried to reproduce this:

Elastic x 2 on bare machines, with fan network enabled, and client located in the fan network
Elastic x 1 on fan network, client on fan network

I've been unable to reproduce, using cs:elasticsearch-44 and Bionic.

Can you share the layout of the test runs that are failing so that I might be able to reproduce? Also, a current fail log would be good, as the link in earlier comments has expired.

Changed in charm-elasticsearch:
status:	In Progress → Incomplete

Revision history for this message

Jose Guedez (jfguedez) wrote on 2020-08-28:

#20

I was working on LP1881633 which I believe is a duplicate of this bug (which I was working and later discovering this one)

As some comments above mention, it is related to the handling fan-networking (or peer-to-peer connectivity over multiple interfaces in general). The seems to be caused by a mismatch between the configuration in Elasticsearch (bind to all interfaces, with no peer IPs set on discovery config until the relation propagates the data), and the firewall rules being set up by the charm.

The race occurs because Elasticsearch is restarted several times, while the firewall rules are being added. If Elasticsearch is restarted after the firewall rules are in place peer-traffic will be blocked, leading to the timeouts.

This issue can be reproduced reliably by setting up a cluster with multiple nodes (with a model with a fan-network configured), and restarting Elasticsearch after the model settles:

1-Create the model, configure fan-networking and deploy ES

juju add-model elasticsearch
juju model-config container-networking-method=fan
juju model-config fan-config=10.0.8.0/24=252.0.0.0/8 # adjust based on environment
juju deploy cs:elasticsearch-45 -n 3

2-Wait for the model to fully settle and restart.

juju run --app elasticsearch -- systemctl restart elasticsearch

3-After the update hooks happen multiple units will block (or form a single unit cluster), especially if there is an index with multiple shards:

juju status | grep blocked
elasticsearch/0* blocked idle 0 10.0.8.93 9200/tcp elasticsearch is reporting problems with local host - please check health
elasticsearch/1 blocked idle 1 10.0.8.41 9200/tcp elasticsearch is reporting problems with local host - please check health
elasticsearch/2 blocked idle 2 10.0.8.2 9200/tcp elasticsearch is reporting problems with local host - please check health

Configuring ES to only bind to the interface being whitelisted in the firewall would solve this issue. However, as far as I can tell this was set up this way to deal with LP1714126 so that would be considered a regression. Will discuss with the team the options here (probably need to properly support spaces, which might not be straightforward). Will also discuss the severity, due to the sensitivity to service restarts.

I was working on LP1881633 which I believe is a duplicate of this bug (which I was working and later discovering this one)

As some comments above mention, it is related to the handling fan-networking (or peer-to-peer connectivity over multiple interfaces in general). The seems to be caused by a mismatch between the configuration in Elasticsearch (bind to all interfaces, with no peer IPs set on discovery config until the relation propagates the data), and the firewall rules being set up by the charm.

The race occurs because Elasticsearch is restarted several times, while the firewall rules are being added. If Elasticsearch is restarted after the firewall rules are in place peer-traffic will be blocked, leading to the timeouts.

This issue can be reproduced reliably by setting up a cluster with multiple nodes (with a model with a fan-network configured), and restarting Elasticsearch after the model settles:

1-Create the model, configure fan-networking and deploy ES

juju add-model elasticsearch
juju model-config container-networking-method=fan
juju model-config fan-config=10.0.8.0/24=252.0.0.0/8 # adjust based on environment
juju deploy cs:elasticsearch-45 -n 3

2-Wait for the model to fully settle and restart.

juju run --app elasticsearch -- systemctl restart elasticsearch

3-After the update hooks happen multiple units will block (or form a single unit cluster), especially if there is an index with multiple shards:

juju status | grep blocked
elasticsearch/0* blocked idle 0 10.0.8.93 9200/tcp elasticsearch is reporting problems with local host - please check health
elasticsearch/1 blocked idle 1 10.0.8.41 9200/tcp elasticsearch is reporting problems with local host - please check health
elasticsearch/2 blocked idle 2 10.0.8.2 9200/tcp elasticsearch is reporting problems with local host - please check health

Configuring ES to only bind to the interface being whitelisted in the firewall would solve this issue. However, as far as I can tell this was set up this way to deal with LP1714126 so that would be considered a regression. Will discuss with the team the options here (probably need to properly support spaces, which might not be straightforward). Will also discuss the severity, due to the sensitivity to service restarts.

Changed in charm-elasticsearch:
status:	Incomplete → Confirmed

Jose Guedez (jfguedez) on 2020-08-31

Changed in charm-elasticsearch:
assignee:	Xav Paice (xavpaice) → Jose Guedez (jfguedez)

Jose Guedez (jfguedez) on 2020-08-31

Changed in charm-elasticsearch:
status:	Confirmed → In Progress

Revision history for this message

Jose Guedez (jfguedez) wrote on 2020-09-01:

#21

After some discussions with other members of the team the consensus was that the Elasticsearch charm should not be really adding and updating firewall rules.

If the user needs this functionality it (due to e.g. security concerns), it should be managed outside of the charm. In any case the current firewall rule logic is not very comprehensive and doesn't handle multiple interfaces properly - causing issues like this bug.

The ideal solution would be to use spaces, but that is not supported by the charm (see LP1714126)

Revision history for this message

Michael Skalka (mskalka) wrote on 2020-10-16:

#22

Seen again during this test run on AWS: https://solutions.qa.canonical.com/testruns/testRun/d8db2209-c6d9-4d02-8268-9088bd51a242

Crashdump here: https://oil-jenkins.canonical.com/artifacts/d8db2209-c6d9-4d02-8268-9088bd51a242/generated/generated/kubernetes/juju-crashdump-kubernetes-2020-10-15-21.21.28.tar.gz

Jose Guedez (jfguedez) on 2021-02-15

Changed in charm-elasticsearch:
assignee:	Jose Guedez (jfguedez) → nobody

Giuseppe Petralia (peppepetra) on 2021-12-14

Changed in charm-elasticsearch:
status:	In Progress → Fix Committed

Celia Wang (ziyiwang) on 2022-03-03

Changed in charm-elasticsearch:
status:	Fix Committed → Fix Released

Elasticsearch Charm

Elasticsearch fails on peer-relation-joined due to heath check due to ES trying to use FAN IPs, which aren't open on the firewall

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Remote bug watches