kolla

Bug #1578752
Activity log

Activity log for bug #1578752

Date	Who	What changed	Old value	New value	Message
2016-05-05 17:45:34	Mark Casey	bug			added bug
2016-05-05 20:03:49	Mark Casey	description	If/when a maria db node/container ("node" from here on, but I mean container) fails and is restarted it will run a state transfer from a donor node to get back into sync with the cluster. The sync may be an incremental sync (if the container has persistence on the datadir [/var/lib/mysql or similar]) or, if the data that has changed in the interim is too large, a complete retransfer of all data. Depending on the config this may make both the recovering node and the donor node fail any queries that are sent to them. This is a problem because although these nodes will not execute queries they will allow incoming DB connections, which means that HAProxy's mysql-check will not fail and the nodes will not be removed from the pool of healthy nodes. One solution is to run a separate HTTP daemon with has access via a dedicated user to check the mysql variable "wsrep_local_state" on each node. This daemon will return 200 OK if all is well and 503 Service Unavailable if the node in question is currently receiving from or acting as a donor in a Galera State Transfer. In this way the only queries which fail are those that HAProxy routes to the failed node between the time it fails and the next HAProxy call to this "checking daemon." This repo (https://github.com/olafz/percona-clustercheck) shows one such setup that uses xinetd, though obviously this could be done behind any HTTP daemon as a "sidecar container" to each db node. While I intend to work on a POC of this capability I wanted to go ahead and document as it may be several weeks before I can do so. I have implemented this xinetd method in the past and would also be happy to assist anyone else who wanted to implement.	If/when a maria db node/container ("node" from here on, but I mean container) fails and is restarted it will run a state transfer from a donor node to get back into sync with the cluster. The sync may be an incremental sync (if the container has persistence on the datadir [/var/lib/mysql or similar]) or, if the data that has changed in the interim is too large, a complete retransfer of all data. Depending on the config this may make both the recovering node and the donor node fail any queries that are sent to them. This is a problem because although these nodes will not execute queries they will allow incoming DB connections, which means that HAProxy's mysql-check will not fail and the nodes will not be removed from the pool of healthy nodes. One solution is to run a separate HTTP daemon with has access via a dedicated user to check the mysql variable "wsrep_local_state" on each node. This daemon will return 200 OK if all is well and 503 Service Unavailable if the node in question is currently receiving from or acting as a donor in a Galera State Transfer. In this way the only queries which fail are those that HAProxy routes to the failed node between the time it fails and the next HAProxy call to this "checking daemon." This repo (https://github.com/olafz/percona-clustercheck) shows one such setup that uses xinetd, though obviously this could be done behind any HTTP daemon as a "sidecar container" to each db node.
2016-05-05 20:03:53	Mark Casey	kolla: assignee		Mark Casey (mark-casey)
2016-05-09 19:56:55	Mark Casey	description	If/when a maria db node/container ("node" from here on, but I mean container) fails and is restarted it will run a state transfer from a donor node to get back into sync with the cluster. The sync may be an incremental sync (if the container has persistence on the datadir [/var/lib/mysql or similar]) or, if the data that has changed in the interim is too large, a complete retransfer of all data. Depending on the config this may make both the recovering node and the donor node fail any queries that are sent to them. This is a problem because although these nodes will not execute queries they will allow incoming DB connections, which means that HAProxy's mysql-check will not fail and the nodes will not be removed from the pool of healthy nodes. One solution is to run a separate HTTP daemon with has access via a dedicated user to check the mysql variable "wsrep_local_state" on each node. This daemon will return 200 OK if all is well and 503 Service Unavailable if the node in question is currently receiving from or acting as a donor in a Galera State Transfer. In this way the only queries which fail are those that HAProxy routes to the failed node between the time it fails and the next HAProxy call to this "checking daemon." This repo (https://github.com/olafz/percona-clustercheck) shows one such setup that uses xinetd, though obviously this could be done behind any HTTP daemon as a "sidecar container" to each db node.	If/when a maria db node/container ("node" from here on, but I mean container) fails and is restarted it will run a state transfer from a donor node to get back into sync with the cluster. The sync may be an incremental sync (if the container has persistence on the datadir [/var/lib/mysql or similar]) or, if the data that has changed in the interim is too large, a complete retransfer of all data. Depending on the config this may make both the recovering node and the donor node fail any queries that are sent to them. This is a problem because although these nodes will not execute queries they will (*LAST I CHECKED) allow incoming DB connections, which means that HAProxy's mysql-check will not fail and the nodes will not be removed from the pool of healthy nodes. One solution is to run a separate HTTP daemon with has access via a dedicated user to check the mysql variable "wsrep_local_state" on each node. This daemon will return 200 OK if all is well and 503 Service Unavailable if the node in question is currently receiving from or acting as a donor in a Galera State Transfer. In this way the only queries which fail are those that HAProxy routes to the failed node between the time it fails and the next HAProxy call to this "checking daemon." This repo (https://github.com/olafz/percona-clustercheck) shows one such setup that uses xinetd, though obviously this could be done behind any HTTP daemon as a "sidecar container" to each db node.
2016-05-27 13:42:44	Ettore Simone	bug			added subscriber Ettore Simone
2016-06-03 16:06:31	OpenStack Infra	kolla: status	New	Fix Released