Comment 4 for bug 1917068

Revision history for this message
Michal Arbet (michalarbet) wrote : Re: [Bug 1917068] Re: Connections to DB are refusing to die after VIP is switched

Well,

I've tested lot of keepalived configurations and nothing fixed issue.

What I proposed is standard way how this should be fixed ... Check subjects
I've attached...

Dne po 1. 3. 2021 12:20 uživatel Mark Goddard <email address hidden>
napsal:

> Thinking about the GARP. If the NIC was bounced, then it might not see
> the GARP from the new master. There do seem to be options to tune GARP
> transmission: https://serverfault.com/questions/821809/keepalived-send-
> gratuitous-arp-periodically
> <https://serverfault.com/questions/821809/keepalived-send-gratuitous-arp-periodically>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1917068
>
> Title:
> Connections to DB are refusing to die after VIP is switched
>
> Status in kolla-ansible:
> In Progress
> Status in kolla-ansible train series:
> New
> Status in kolla-ansible ussuri series:
> New
> Status in kolla-ansible victoria series:
> New
> Status in kolla-ansible wallaby series:
> In Progress
>
> Bug description:
> Hi,
>
> On production kolla-ansible installed ENV we found strange bahaviour
> when switching VIP between controllers (under load).
> When VIP is switched from master to backup keepalived, connections to DB
> are dead on host where VIP was before switch (keystone wsgi workers are all
> busy and waiting for DB reply).
>
>
> Test env:
> - 2 Controllers - Haproxy, keepalived, OS services, DB ..etc
> - 2 Computes
>
> How to reproduce:
>
> 1. Generate as big traffic as you can to replicate issue (curl token
> issue to keystone VIP:5000)
> 2. Check logs for keystone (there will be big amount of 201 on both
> controllers)
> 2. Restart keepalived OR restart networking OR ifup/ifdown interface on
> current keepalived master
> (VIP will be switched to secondary host)
> 3. Check logs for keystone
> 4. You can see that access log for keystone is freezed (on host where
> VIP was before), after while there will be 503,504
>
> Why this is happening ?
>
> Normally when master keepalived is not reachable, secondary keepalived
> take VIP and send GARP to network, and all clients will refresh ARP
> table, so everything should work.
>
> Problem is that wsgi processes has connectionPool to DB and these
> connections are dead after switch, they don't know that ARP changed
> (probably host refused GARP because there is very tiny window when VIP
> was assigned to him).
>
> So, wsgi processes are trying to write to filedescriptor/socket for DB
> connection, but waiting for reply infinite. Simply said these
> connection are totally dead, and app layer can't fix it, because app
> layer (oslo.db/sqlalchemy) don't know it's is broken.
>
> Above problem is solved itself after some time -> this time depends on
> user's kernel option net.ipv4.tcp_retries2 which is saying how many
> retries are sent for this TCP connection before kernel will kill it.
> In my case it was around 930-940 seconds every time I tried it
> (default value of net.ipv4.tcp_retries2=15). Of course retransmission
> will not work as VIP is gone and hosted by another host/mac.
>
> Decrease tcp_retries2 to 1 fixed issue immediately.
>
> Here is detailed article about tcp socket which are refusing to die ->
> https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
>
> RedHat is also suggesting to tune this kernel option for HA solutions
> as it is noted here -> https://access.redhat.com/solutions/726753
>
> "In a High Availability (HA) situation consider decreasing the setting
> to 3." << From RedHat
>
>
> Here is also video of issue (left controller0, right contoller1, bottom
> logs, middle VIP monitor switch)
>
> https://download.kevko.ultimum.cloud/video_debug.mp4
>
> I will provide fix and push for review.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/kolla-ansible/+bug/1917068/+subscriptions
>