inadequate HAproxy health check for Galera
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla-ansible |
Triaged
|
Medium
|
Unassigned |
Bug Description
Hi,
we are using kolla-ansible 7.0.0rc1 to deploy rocky on ubuntu 18.04. All kolla containers are built with base ubuntu and type source.
The haproxy configuration for Galera looks like:
listen mariadb
mode tcp
timeout client 3600s
timeout server 3600s
option tcplog
option tcpka
option mysql-check user haproxy post-41
bind 10.10.10.5:3306
server hosta 10.10.10.11:3306 check inter 2000 rise 2 fall 5
server hostb 10.10.10.12:3306 check inter 2000 rise 2 fall 5 backup
server hostc 10.10.10.13:3306 check inter 2000 rise 2 fall 5 backup
server hostd 10.10.10.14:3306 check inter 2000 rise 2 fall 5 backup
server hoste 10.10.10.15:3306 check inter 2000 rise 2 fall 5 backup
But "mysql-check" is not a good option to health check galera nodes. This check returns "true" even if a galera node is, eg. in status "Joining" or "Waiting on SST" because it just does a mysql login test.
wsrep status can be: Joining, Waiting on SST, Joined, Synced or Donor
Only:
"wsrep_local_state = 4" ("wsrep_
means a node is healthy.
With HAproxy there is no "clean" method to check galera. We do use a perl script/daemon to check the node and is able to return HTTP 200 as long as the node is "Synced". This daemon listens on a dedicated port which is then used by HAProxy to do "http-check".
Another option is to use xinetd and some shell script.
I don't like any of those options and would recommend to use ProxySQL.
All the best,
Flo
Interesting idea with that ProxySQL.
We actually do more than just bare mysql-check - the user being used is dynamically enabled/disabled when node state changes via wsrep-notify.sh. Though it is broken in that the initial JOINER transition does not allow to connect to local mariadb so we usually never disable haproxy user when need be. I guess someone tested it only on a working cluster and not on boot. Race conditions were not considered either.
That said we might go either the other check method or even ProxySQL. I would be glad to move mysql away from haproxy because I suspect it to be causing issues with haproxy reporting to keepalived (which also has a nasty bug which gives extra issues all over the board due to VIP moving).