Elasticsearch fails on peer-relation-joined due to heath check due to ES trying to use FAN IPs, which aren't open on the firewall

Bug #1880729 reported by Michael Skalka
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Elasticsearch Charm
Fix Released
High
Unassigned

Bug Description

As seen on this test run: https://solutions.qa.canonical.com/#/qa/testRun/b7709dc7-e913-4a45-bb76-a294196d54c8

It appears as though elasticsearch is trying to init its cluster however during the peer-relation-joined hook it queries its own health (http://localhost:9200/_cluster/health) and fails.

Artifacts can be found at the bottom of the above link.

Related branches

Revision history for this message
Marian Gasparovic (marosg) wrote :
Revision history for this message
Diko Parvanov (dparv) wrote :

Might be related to recent commit: https://code.launchpad.net/~xavpaice/charm-elasticsearch/+git/charm-elasticsearch/+merge/384414 where the elasticsearch listen address gets changed to 0.0.0.0

Diko Parvanov (dparv)
Changed in charm-elasticsearch:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Diko Parvanov (dparv) wrote :

Tried to reproduce issue and got the following:

[2020-06-03T09:29:48,961][INFO ][o.e.t.TransportService ] [juju-860c22-1] publish_address {192.168.1.102:9300}, bound_addresses {[::]:9300}
[2020-06-03T09:29:49,228][INFO ][o.e.b.BootstrapChecks ] [juju-860c22-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2020-06-03T09:29:49,247][ERROR][o.e.b.Bootstrap ] [juju-860c22-1] node validation exception
[2] bootstrap checks failed

Elastic service does not start:

Adding
transport.host: 127.0.0.1
To:
/etc/elasticsearch/elasticsearch.yml

fixes the issues.

Could you please re-check with cs:~dparv/elasticsearch-1 and see if the issue is solved? I will submit MP to fix it if yes.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This continues to block SQA runs - sub'd to field high.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We tried cs:~dparv/elasticsearch-1 over the weekend and continued to hit this issue, here is an example test run:

https://solutions.qa.canonical.com/#/qa/testRun/4acc5d1f-13db-4c57-b326-4321bbf2054a

Jeremy Lounder (jldev)
Changed in charm-elasticsearch:
importance: Medium → High
Revision history for this message
Paul Goins (vultaire) wrote :

It'd be very useful to get the juju log output of the failed unit. Can this be provided?

I am trying to reproduce this, and while I did hit a separate bug with focal (https://bugs.launchpad.net/charm-elasticsearch/+bug/1882824), on Bionic I'm not yet reproducing this issue. This is with the promulgated charm, not with Diko's version.

Revision history for this message
Michael Skalka (mskalka) wrote :

Paul,

If you follow the links to the test run above there is a link to the full juju crash dump for that run at the bottom of the page.

Revision history for this message
Paul Goins (vultaire) wrote :

Additionally, if you can stand up an environment where this issue occurs outside of CI, it'd be great if you could SSH into one of the ES units and run the following? It'd be really helpful to get a pastebin of the output.

$ sudo su -
# cd /var/lib/juju/agents/unit-elasticsearch-*/charm
# ansible-playbook -vvv -c local playbook.yaml --tags peer-relation-joined

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Paul, here's the output from the command you requested in #9:

http://paste.ubuntu.com/p/9nqB8nrRbj/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In elasticsearch.log, it looks like it's trying to use fan IP addresses:
[2020-06-12T18:43:37,442][WARN ][o.e.d.z.ZenDiscovery ] [lnCuF_C] failed to connect to master [{2KHB_Si}{2KHB_SigQQa3eqKxqaA_lw}{xcl7FFdeTguXT2ONNMHcXw}{252.140.0.1}{252.140.0.1:9300}], retrying...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

These two addresses are the native openstack address and the fan address for the same node. I can reach elasticsearch on the native openstack one, but not on the fan one.

ubuntu@juju-59bcbe-kubernetes-4:~$ telnet 172.16.0.143 9300
Trying 172.16.0.143...
Connected to 172.16.0.143.
Escape character is '^]'.
^]
telnet> Connection closed.
ubuntu@juju-59bcbe-kubernetes-4:~$ telnet 252.143.0.1 9300
Trying 252.143.0.1...

I can ping the fan one though:
ubuntu@juju-59bcbe-kubernetes-4:~$ ping 252.143.0.1
PING 252.143.0.1 (252.143.0.1) 56(84) bytes of data.
64 bytes from 252.143.0.1: icmp_seq=1 ttl=64 time=0.840 ms

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Download full text (9.4 KiB)

So the problem is the charm/elasticsearch is trying to connect using the fan addresses, but firewall rules are only setup to allow access via openstack:

ubuntu@juju-59bcbe-kubernetes-3:~$ sudo iptables -L -n
Chain INPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-before-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-logging-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-reject-input all -- 0.0.0.0/0 0.0.0.0/0
ufw-track-input all -- 0.0.0.0/0 0.0.0.0/0

Chain FORWARD (policy DROP)
target prot opt source destination
ufw-before-logging-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-before-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-logging-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-reject-forward all -- 0.0.0.0/0 0.0.0.0/0
ufw-track-forward all -- 0.0.0.0/0 0.0.0.0/0

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
ufw-before-logging-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-before-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-after-logging-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-reject-output all -- 0.0.0.0/0 0.0.0.0/0
ufw-track-output all -- 0.0.0.0/0 0.0.0.0/0

Chain ufw-after-forward (1 references)
target prot opt source destination

Chain ufw-after-input (1 references)
target prot opt source destination
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:137
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:138
ufw-skip-to-policy-input tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:139
ufw-skip-to-policy-input tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:445
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:67
ufw-skip-to-policy-input udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:68
ufw-skip-to-policy-input all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type BROADCAST

Chain ufw-after-logging-forward (1 references)
target prot opt source destination
LOG all -- 0.0.0.0/0 0.0.0.0/0 limit: avg 3/min burst 10 LOG flags 0 level 4 prefix "[UFW BLOCK] "

Chain ufw-after-logging-input (1 references)
target prot opt source destination

Chain ufw-after-logging-output (1 references)
target prot opt source destination

Chain ufw-after-output (1 references)
target prot opt source destination

Chain ufw-bef...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Without fan networking, we can still get a hook failure, but I feel like that's a different underlying issue than what is reported here. This issue - with the fan networking IP address being used - doesn't correct on retry. The hook failure with fan disabled does.

Revision history for this message
Paul Goins (vultaire) wrote :

I'm still taking a look at this, but based on what I've been able to gather so far - from unit logs and /var/log/elasticsearch/elasticsearch.log - this appears to be some sort of clustering issue, and may be a race condition.

If we have an acceptable workaround here (don't use fan networking, and retry hooks on initial failure), do we still think this is a field-high issue, or does the priority lower somewhat?

Clearly there is an issue re: fan networking, and I think this issue should be retitled for that specifically. And, there may be an issue regarding initial clustering race conditions - I can make a bug for that.

Revision history for this message
Paul Goins (vultaire) wrote :

Actually, I did find a bug for the latter: https://bugs.launchpad.net/charm-elasticsearch/+bug/1835410

Revision history for this message
Michael Skalka (mskalka) wrote :

Paul,

We explicitly do not enable retries in testing so as to catch issues like this. That has been a policy in SQA for at least two years now. We cannot accept a workaround that involves enabling hook retries.

Revision history for this message
Paul Goins (vultaire) wrote :

@Michael: Thanks for your reply, and understood.

@All: With regards to 1835410 (which is the issue which remains when fan networking is disabled, and is what the retry workaround addresses): I'll comment further on that ticket, so as to keep these issues separate. I'd like to keep this issue focused on fan networking.

summary: - Elasticsaerch fails on peer-relation-joined due to heath check
+ Elasticsearch fails on peer-relation-joined due to heath check
summary: - Elasticsearch fails on peer-relation-joined due to heath check
+ Elasticsearch fails on peer-relation-joined due to heath check due to ES
+ trying to use FAN IPs, which aren't open on the firewall
Xav Paice (xavpaice)
Changed in charm-elasticsearch:
assignee: nobody → Xav Paice (xavpaice)
status: Triaged → In Progress
Revision history for this message
Xav Paice (xavpaice) wrote :

The ufw setup is activated by `ufw: rule=allow src={{ lookup('dns', item['private-address']) }} port=9200 proto=tcp` for each client relation. If the client has several addresses, and private-address returns the "wrong" one then this will not open the firewall enough for a connection from that client address. The rules for port 9300 is very similar.

However, I've tried to reproduce this:

Elastic x 2 on bare machines, with fan network enabled, and client located in the fan network
Elastic x 1 on fan network, client on fan network

I've been unable to reproduce, using cs:elasticsearch-44 and Bionic.

Can you share the layout of the test runs that are failing so that I might be able to reproduce? Also, a current fail log would be good, as the link in earlier comments has expired.

Changed in charm-elasticsearch:
status: In Progress → Incomplete
Revision history for this message
Jose Guedez (jfguedez) wrote :

I was working on LP1881633 which I believe is a duplicate of this bug (which I was working and later discovering this one)

As some comments above mention, it is related to the handling fan-networking (or peer-to-peer connectivity over multiple interfaces in general). The seems to be caused by a mismatch between the configuration in Elasticsearch (bind to all interfaces, with no peer IPs set on discovery config until the relation propagates the data), and the firewall rules being set up by the charm.

The race occurs because Elasticsearch is restarted several times, while the firewall rules are being added. If Elasticsearch is restarted after the firewall rules are in place peer-traffic will be blocked, leading to the timeouts.

This issue can be reproduced reliably by setting up a cluster with multiple nodes (with a model with a fan-network configured), and restarting Elasticsearch after the model settles:

1-Create the model, configure fan-networking and deploy ES

juju add-model elasticsearch
juju model-config container-networking-method=fan
juju model-config fan-config=10.0.8.0/24=252.0.0.0/8 # adjust based on environment
juju deploy cs:elasticsearch-45 -n 3

2-Wait for the model to fully settle and restart.

juju run --app elasticsearch -- systemctl restart elasticsearch

3-After the update hooks happen multiple units will block (or form a single unit cluster), especially if there is an index with multiple shards:

juju status | grep blocked
elasticsearch/0* blocked idle 0 10.0.8.93 9200/tcp elasticsearch is reporting problems with local host - please check health
elasticsearch/1 blocked idle 1 10.0.8.41 9200/tcp elasticsearch is reporting problems with local host - please check health
elasticsearch/2 blocked idle 2 10.0.8.2 9200/tcp elasticsearch is reporting problems with local host - please check health

Configuring ES to only bind to the interface being whitelisted in the firewall would solve this issue. However, as far as I can tell this was set up this way to deal with LP1714126 so that would be considered a regression. Will discuss with the team the options here (probably need to properly support spaces, which might not be straightforward). Will also discuss the severity, due to the sensitivity to service restarts.

Changed in charm-elasticsearch:
status: Incomplete → Confirmed
Jose Guedez (jfguedez)
Changed in charm-elasticsearch:
assignee: Xav Paice (xavpaice) → Jose Guedez (jfguedez)
Jose Guedez (jfguedez)
Changed in charm-elasticsearch:
status: Confirmed → In Progress
Revision history for this message
Jose Guedez (jfguedez) wrote :

After some discussions with other members of the team the consensus was that the Elasticsearch charm should not be really adding and updating firewall rules.

If the user needs this functionality it (due to e.g. security concerns), it should be managed outside of the charm. In any case the current firewall rule logic is not very comprehensive and doesn't handle multiple interfaces properly - causing issues like this bug.

The ideal solution would be to use spaces, but that is not supported by the charm (see LP1714126)

Revision history for this message
Michael Skalka (mskalka) wrote :
Jose Guedez (jfguedez)
Changed in charm-elasticsearch:
assignee: Jose Guedez (jfguedez) → nobody
Changed in charm-elasticsearch:
status: In Progress → Fix Committed
Celia Wang (ziyiwang)
Changed in charm-elasticsearch:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.