Activity log for bug #1882927

Date Who What changed Old value New value Message
2020-06-10 10:34:03 Bogdan Dobrelya bug added bug
2020-06-10 10:36:35 Bogdan Dobrelya tripleo: status New In Progress
2020-06-10 10:36:39 Bogdan Dobrelya tripleo: milestone victoria-1
2020-06-10 10:36:40 Bogdan Dobrelya tripleo: importance Undecided High
2020-06-10 10:36:42 Bogdan Dobrelya tripleo: assignee Bogdan Dobrelya (bogdando)
2020-06-17 15:21:10 Bogdan Dobrelya description This is a summary from the "noisy neighbors" related issue https://bugzilla.redhat.com/show_bug.cgi?id=1779407, which scope's reduced here into avoiding traffic spikes in the face of a single controller node failures: An example scenario for inbound AMQP traffic: "... - Not a lot of traffic to node3, must not have as many active client connections? - node2 is handling most of the client traffic - both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2. As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out. I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..." A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO. This is a summary from the "noisy neighbors" related issue https://bugzilla.redhat.com/show_bug.cgi?id=1779407, which scope's reduced here into avoiding traffic spikes in the face of a single controller node failures: An example scenario for inbound AMQP traffic: "... - Not a lot of traffic to node3, must not have as many active client connections? - node2 is handling most of the client traffic - both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2. As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out. I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..." A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO. --- Another related issue is https://bugzilla.redhat.com/show_bug.cgi?id=1844357, where all Heat API backends become marked as down because of long running API requests and missing TCP-KA option in HAProxy: xx:12:09 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-1.internalapi is DOWN, reason: Layer7 timeout, check duration: 10001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. xx:12:09 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-2.internalapi is DOWN, reason: Layer7 timeout, check duration: 10001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. xx:12:09 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-0.internalapi is DOWN, reason: Layer7 timeout, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. xx:12:09 overcloud-ctrl-0 haproxy[13]: proxy heat_api has no server available! xx:13:55 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-1.internalapi is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. xx:13:55 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-2.internalapi is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. xx:13:56 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-0.internalapi is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. In the end, aforementioned suboptimal roundrobin distribution multiplied by the cascading failures of backends, makes the situation even worse. Leastconn should alleviate the unequal distribution of client sessions for such scenario
2020-06-17 15:21:25 Bogdan Dobrelya tags queens-backport-potential train-backport-potential ussuri-backport-potential
2020-06-17 15:23:41 Bogdan Dobrelya summary suboptimal haproxy LB strategy for API services might produce traffic spikes and cascading failures suboptimal haproxy LB strategy for API services with longrunning requests might produce traffic spikes and cascading failures
2020-06-17 16:27:10 Emilien Macchi tags queens-backport-potential train-backport-potential ussuri-backport-potential ci promotion-blocker queens-backport-potential train-backport-potential ussuri-backport-potential
2020-06-17 16:27:14 Emilien Macchi tripleo: importance High Critical
2020-06-17 16:27:19 Emilien Macchi tripleo: importance Critical High
2020-06-17 16:27:29 Emilien Macchi tags ci promotion-blocker queens-backport-potential train-backport-potential ussuri-backport-potential queens-backport-potential train-backport-potential ussuri-backport-potential
2020-07-28 12:46:40 Emilien Macchi tripleo: milestone victoria-1 victoria-3
2020-07-30 03:19:14 OpenStack Infra tripleo: status In Progress Fix Released