inadequate HAproxy health check for Galera

Bug #1796930 reported by Florian Engelmann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Triaged
Medium
Unassigned

Bug Description

Hi,

we are using kolla-ansible 7.0.0rc1 to deploy rocky on ubuntu 18.04. All kolla containers are built with base ubuntu and type source.

The haproxy configuration for Galera looks like:

listen mariadb
  mode tcp
  timeout client 3600s
  timeout server 3600s
  option tcplog
  option tcpka
  option mysql-check user haproxy post-41
  bind 10.10.10.5:3306
  server hosta 10.10.10.11:3306 check inter 2000 rise 2 fall 5
  server hostb 10.10.10.12:3306 check inter 2000 rise 2 fall 5 backup
  server hostc 10.10.10.13:3306 check inter 2000 rise 2 fall 5 backup
  server hostd 10.10.10.14:3306 check inter 2000 rise 2 fall 5 backup
  server hoste 10.10.10.15:3306 check inter 2000 rise 2 fall 5 backup

But "mysql-check" is not a good option to health check galera nodes. This check returns "true" even if a galera node is, eg. in status "Joining" or "Waiting on SST" because it just does a mysql login test.

wsrep status can be: Joining, Waiting on SST, Joined, Synced or Donor

Only:
"wsrep_local_state = 4" ("wsrep_local_state_comment = Synced")

means a node is healthy.

With HAproxy there is no "clean" method to check galera. We do use a perl script/daemon to check the node and is able to return HTTP 200 as long as the node is "Synced". This daemon listens on a dedicated port which is then used by HAProxy to do "http-check".
Another option is to use xinetd and some shell script.

I don't like any of those options and would recommend to use ProxySQL.

All the best,
Flo

Tags: rocky
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Interesting idea with that ProxySQL.

We actually do more than just bare mysql-check - the user being used is dynamically enabled/disabled when node state changes via wsrep-notify.sh. Though it is broken in that the initial JOINER transition does not allow to connect to local mariadb so we usually never disable haproxy user when need be. I guess someone tested it only on a working cluster and not on boot. Race conditions were not considered either.

That said we might go either the other check method or even ProxySQL. I would be glad to move mysql away from haproxy because I suspect it to be causing issues with haproxy reporting to keepalived (which also has a nasty bug which gives extra issues all over the board due to VIP moving).

Changed in kolla-ansible:
status: New → Triaged
importance: Undecided → Medium
Tom Fifield (fifieldt)
tags: added: rocky
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.