2020-06-17 15:21:10 |
Bogdan Dobrelya |
description |
This is a summary from the "noisy neighbors" related issue https://bugzilla.redhat.com/show_bug.cgi?id=1779407, which scope's reduced here into avoiding traffic spikes in the face of a single controller node failures:
An example scenario for inbound AMQP traffic:
"... - Not a lot of traffic to node3, must not have as many active client connections?
- node2 is handling most of the client traffic
- both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2.
As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out.
I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..."
A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO. |
This is a summary from the "noisy neighbors" related issue https://bugzilla.redhat.com/show_bug.cgi?id=1779407, which scope's reduced here into avoiding traffic spikes in the face of a single controller node failures:
An example scenario for inbound AMQP traffic:
"... - Not a lot of traffic to node3, must not have as many active client connections?
- node2 is handling most of the client traffic
- both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2.
As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out.
I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..."
A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO.
---
Another related issue is https://bugzilla.redhat.com/show_bug.cgi?id=1844357, where all Heat API backends become marked as down because of long running API requests and missing TCP-KA option in HAProxy:
xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-1.internalapi is DOWN, reason: Layer7 timeout,
check duration: 10001ms. 2 active and 0 backup servers left. 0
sessions active, 0 requeued, 0 remaining in queue.
xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-2.internalapi is DOWN, reason: Layer7 timeout,
check duration: 10001ms. 1 active and 0 backup servers left. 0
sessions active, 0 requeued, 0 remaining in queue.
xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-0.internalapi is DOWN, reason: Layer7 timeout,
check duration: 10001ms. 0 active and 0 backup servers left. 0
sessions active, 0 requeued, 0 remaining in queue.
xx:12:09 overcloud-ctrl-0 haproxy[13]: proxy heat_api has no server
available!
xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-1.internalapi is UP, reason: Layer7 check
passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-2.internalapi is UP, reason: Layer7 check
passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
xx:13:56 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-0.internalapi is UP, reason: Layer7 check
passed, code: 200, info: "OK", check duration: 1ms. 3 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.
In the end, aforementioned suboptimal roundrobin distribution multiplied by the cascading failures of backends, makes the situation even worse. Leastconn should alleviate the unequal distribution of client sessions for such scenario |
|