During a VIP failover, services colocated with the VIP are slow to recover
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Undecided
|
Damien Ciabrini |
Bug Description
In HA OpenStack deployments, OpenStack services on controller nodes access the database via a virtual IP. Haproxy listens to the VIP and forwards traffic to the galera nodes located on the controllers.
When both the OpenStack service and the VIP to connect to are located on the same node, a connection to the VIP will result in a TCP socket having its src IP and destination IP both bound to the VIP. This causes issue when the VIP is failed over to another controller node _when_ there are packets in the socket's Send-Q at kernel level. Keepalive doesn't apply, rather the persist timer kicks in; eventually the kernel will return a "connection time out" to the Openstack service, but only after a very long time (by default more than 10min). During this period, Openstack service won't recreate new connection and will be marked as "down" on the controller.
In order to prevent such socket connection from being created, tripleo should configure the DB settings to bind source to the controller network NIC. This is possible in latest version of PyMysql upstream.
Changed in tripleo: | |
assignee: | nobody → Damien Ciabrini (dciabrin) |
Fix proposed to branch: master /review. openstack. org/414629
Review: https:/