kolla-ansible

Comment 4 for bug 1917068

Revision history for this message

Michal Arbet (michalarbet) wrote on 2021-03-01: Re: [Bug 1917068] Re: Connections to DB are refusing to die after VIP is switched

Well,

I've tested lot of keepalived configurations and nothing fixed issue.

What I proposed is standard way how this should be fixed ... Check subjects
I've attached...

Dne po 1. 3. 2021 12:20 uživatel Mark Goddard <email address hidden>
napsal:

> Thinking about the GARP. If the NIC was bounced, then it might not see
> the GARP from the new master. There do seem to be options to tune GARP
> transmission: https://serverfault.com/questions/821809/keepalived-send-
> gratuitous-arp-periodically
> <https://serverfault.com/questions/821809/keepalived-send-gratuitous-arp-periodically>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1917068
>
> Title:
> Connections to DB are refusing to die after VIP is switched
>
> Status in kolla-ansible:
> In Progress
> Status in kolla-ansible train series:
> New
> Status in kolla-ansible ussuri series:
> New
> Status in kolla-ansible victoria series:
> New
> Status in kolla-ansible wallaby series:
> In Progress
>
> Bug description:
> Hi,
>
> On production kolla-ansible installed ENV we found strange bahaviour
> when switching VIP between controllers (under load).
> When VIP is switched from master to backup keepalived, connections to DB
> are dead on host where VIP was before switch (keystone wsgi workers are all
> busy and waiting for DB reply).
>
>
> Test env:
> - 2 Controllers - Haproxy, keepalived, OS services, DB ..etc
> - 2 Computes
>
> How to reproduce:
>
> 1. Generate as big traffic as you can to replicate issue (curl token
> issue to keystone VIP:5000)
> 2. Check logs for keystone (there will be big amount of 201 on both
> controllers)
> 2. Restart keepalived OR restart networking OR ifup/ifdown interface on
> current keepalived master
> (VIP will be switched to secondary host)
> 3. Check logs for keystone
> 4. You can see that access log for keystone is freezed (on host where
> VIP was before), after while there will be 503,504
>
> Why this is happening ?
>
> Normally when master keepalived is not reachable, secondary keepalived
> take VIP and send GARP to network, and all clients will refresh ARP
> table, so everything should work.
>
> Problem is that wsgi processes has connectionPool to DB and these
> connections are dead after switch, they don't know that ARP changed
> (probably host refused GARP because there is very tiny window when VIP
> was assigned to him).
>
> So, wsgi processes are trying to write to filedescriptor/socket for DB
> connection, but waiting for reply infinite. Simply said these
> connection are totally dead, and app layer can't fix it, because app
> layer (oslo.db/sqlalchemy) don't know it's is broken.
>
> Above problem is solved itself after some time -> this time depends on
> user's kernel option net.ipv4.tcp_retries2 which is saying how many
> retries are sent for this TCP connection before kernel will kill it.
> In my case it was around 930-940 seconds every time I tried it
> (default value of net.ipv4.tcp_retries2=15). Of course retransmission
> will not work as VIP is gone and hosted by another host/mac.
>
> Decrease tcp_retries2 to 1 fixed issue immediately.
>
> Here is detailed article about tcp socket which are refusing to die ->
> https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
>
> RedHat is also suggesting to tune this kernel option for HA solutions
> as it is noted here -> https://access.redhat.com/solutions/726753
>
> "In a High Availability (HA) situation consider decreasing the setting
> to 3." << From RedHat
>
>
> Here is also video of issue (left controller0, right contoller1, bottom
> logs, middle VIP monitor switch)
>
> https://download.kevko.ultimum.cloud/video_debug.mp4
>
> I will provide fix and push for review.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/kolla-ansible/+bug/1917068/+subscriptions
>

Well,

I've tested lot of keepalived configurations and nothing fixed issue.

What I proposed is standard way how this should be fixed ... Check subjects
I've attached...

Dne po 1. 3. 2021 12:20 uživatel Mark Goddard <1917068@bugs.launchpad.net>
napsal:

> Thinking about the GARP. If the NIC was bounced, then it might not see
> the GARP from the new master. There do seem to be options to tune GARP
> transmission: https://serverfault.com/questions/821809/keepalived-send-
> gratuitous-arp-periodically
> <https://serverfault.com/questions/821809/keepalived-send-gratuitous-arp-periodically>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1917068
>
> Title:
>   Connections to DB are refusing to die after VIP is switched
>
> Status in kolla-ansible:
>   In Progress
> Status in kolla-ansible train series:
>   New
> Status in kolla-ansible ussuri series:
>   New
> Status in kolla-ansible victoria series:
>   New
> Status in kolla-ansible wallaby series:
>   In Progress
>
> Bug description:
>   Hi,
>
>   On production kolla-ansible installed ENV we found strange bahaviour
> when switching VIP between controllers (under load).
>   When VIP is switched from master to backup keepalived, connections to DB
> are dead on host where VIP was before switch (keystone wsgi workers are all
> busy and waiting for DB reply).
>
>
>   Test env:
>   - 2 Controllers - Haproxy, keepalived, OS services, DB ..etc
>   - 2 Computes
>
>   How to reproduce:
>
>   1. Generate as big traffic as you can to replicate issue (curl  token
> issue to keystone VIP:5000)
>   2. Check logs for keystone (there will be big amount of 201 on both
> controllers)
>   2. Restart keepalived OR restart networking OR ifup/ifdown interface on
> current keepalived master
>      (VIP will be switched to secondary host)
>   3. Check logs for keystone
>   4. You can see that access log for keystone is freezed (on host where
> VIP was before), after while there will be 503,504
>
>   Why this is happening ?
>
>   Normally when master keepalived is not reachable, secondary keepalived
>   take VIP and send GARP to network, and all clients will refresh ARP
>   table, so everything should work.
>
>   Problem is that wsgi processes has connectionPool to DB and these
>   connections are dead after switch, they don't know that ARP changed
>   (probably host refused GARP because there is very tiny window when VIP
>   was assigned to him).
>
>   So, wsgi processes are trying to write to filedescriptor/socket for DB
>   connection, but waiting for reply infinite. Simply said these
>   connection are totally dead, and app layer can't fix it, because app
>   layer (oslo.db/sqlalchemy) don't know it's is broken.
>
>   Above problem is solved itself after some time -> this time depends on
>   user's kernel option net.ipv4.tcp_retries2 which is saying how many
>   retries are sent for this TCP connection before kernel will kill it.
>   In my case it was around 930-940 seconds every time I tried it
>   (default value of net.ipv4.tcp_retries2=15). Of course retransmission
>   will not work as VIP is gone and hosted by another host/mac.
>
>   Decrease tcp_retries2 to 1 fixed issue immediately.
>
>   Here is detailed article about tcp socket which are refusing to die ->
>   https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
>
>   RedHat is also suggesting to tune this kernel option for HA solutions
>   as it is noted here -> https://access.redhat.com/solutions/726753
>
>   "In a High Availability (HA) situation consider decreasing the setting
>   to 3." << From RedHat
>
>
>   Here is also video of issue (left controller0, right contoller1, bottom
> logs, middle VIP monitor switch)
>
>   https://download.kevko.ultimum.cloud/video_debug.mp4
>
>   I will provide fix and push for review.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/kolla-ansible/+bug/1917068/+subscriptions
>