suboptimal haproxy LB strategy for API services with longrunning requests might produce traffic spikes and cascading failures
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
High
|
Bogdan Dobrelya |
Bug Description
This is a summary from the "noisy neighbors" related issue https:/
An example scenario for inbound AMQP traffic:
"... - Not a lot of traffic to node3, must not have as many active client connections?
- node2 is handling most of the client traffic
- both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2.
As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out.
I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..."
A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO.
---
Another related issue is https:/
xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/
check duration: 10001ms. 2 active and 0 backup servers left. 0
sessions active, 0 requeued, 0 remaining in queue.
xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/
check duration: 10001ms. 1 active and 0 backup servers left. 0
sessions active, 0 requeued, 0 remaining in queue.
xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/
check duration: 10001ms. 0 active and 0 backup servers left. 0
sessions active, 0 requeued, 0 remaining in queue.
xx:12:09 overcloud-ctrl-0 haproxy[13]: proxy heat_api has no server
available!
xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
heat_api/
passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
heat_api/
passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
xx:13:56 overcloud-ctrl-0 haproxy[13]: Server
heat_api/
passed, code: 200, info: "OK", check duration: 1ms. 3 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
In the end, aforementioned suboptimal roundrobin distribution multiplied by the cascading failures of backends, makes the situation even worse. Leastconn should alleviate the unequal distribution of client sessions for such scenario
tags: | added: ci promotion-blocker |
Changed in tripleo: | |
importance: | High → Critical |
importance: | Critical → High |
tags: | removed: ci promotion-blocker |
Changed in tripleo: | |
milestone: | victoria-1 → victoria-3 |
https:/ /review. opendev. org/728779