tripleo

Bug #1882927
Activity log

Activity log for bug #1882927

Date	Who	What changed	Old value	New value	Message
2020-06-10 10:34:03	Bogdan Dobrelya	bug			added bug
2020-06-10 10:36:35	Bogdan Dobrelya	tripleo: status	New	In Progress
2020-06-10 10:36:39	Bogdan Dobrelya	tripleo: milestone		victoria-1
2020-06-10 10:36:40	Bogdan Dobrelya	tripleo: importance	Undecided	High
2020-06-10 10:36:42	Bogdan Dobrelya	tripleo: assignee		Bogdan Dobrelya (bogdando)
2020-06-17 15:21:10	Bogdan Dobrelya	description	This is a summary from the "noisy neighbors" related issue https://bugzilla.redhat.com/show_bug.cgi?id=1779407, which scope's reduced here into avoiding traffic spikes in the face of a single controller node failures: An example scenario for inbound AMQP traffic: "... - Not a lot of traffic to node3, must not have as many active client connections? - node2 is handling most of the client traffic - both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2. As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out. I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..." A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO.	This is a summary from the "noisy neighbors" related issue https://bugzilla.redhat.com/show_bug.cgi?id=1779407, which scope's reduced here into avoiding traffic spikes in the face of a single controller node failures: An example scenario for inbound AMQP traffic: "... - Not a lot of traffic to node3, must not have as many active client connections? - node2 is handling most of the client traffic - both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2. As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out. I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..." A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO. --- Another related issue is https://bugzilla.redhat.com/show_bug.cgi?id=1844357, where all Heat API backends become marked as down because of long running API requests and missing TCP-KA option in HAProxy: xx:12:09 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-1.internalapi is DOWN, reason: Layer7 timeout, check duration: 10001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. xx:12:09 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-2.internalapi is DOWN, reason: Layer7 timeout, check duration: 10001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. xx:12:09 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-0.internalapi is DOWN, reason: Layer7 timeout, check duration: 10001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. xx:12:09 overcloud-ctrl-0 haproxy[13]: proxy heat_api has no server available! xx:13:55 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-1.internalapi is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. xx:13:55 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-2.internalapi is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. xx:13:56 overcloud-ctrl-0 haproxy[13]: Server heat_api/overcloud-ctrl-0.internalapi is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. In the end, aforementioned suboptimal roundrobin distribution multiplied by the cascading failures of backends, makes the situation even worse. Leastconn should alleviate the unequal distribution of client sessions for such scenario
2020-06-17 15:21:25	Bogdan Dobrelya	tags		queens-backport-potential train-backport-potential ussuri-backport-potential
2020-06-17 15:23:41	Bogdan Dobrelya	summary	suboptimal haproxy LB strategy for API services might produce traffic spikes and cascading failures	suboptimal haproxy LB strategy for API services with longrunning requests might produce traffic spikes and cascading failures
2020-06-17 16:27:10	Emilien Macchi	tags	queens-backport-potential train-backport-potential ussuri-backport-potential	ci promotion-blocker queens-backport-potential train-backport-potential ussuri-backport-potential
2020-06-17 16:27:14	Emilien Macchi	tripleo: importance	High	Critical
2020-06-17 16:27:19	Emilien Macchi	tripleo: importance	Critical	High
2020-06-17 16:27:29	Emilien Macchi	tags	ci promotion-blocker queens-backport-potential train-backport-potential ussuri-backport-potential	queens-backport-potential train-backport-potential ussuri-backport-potential
2020-07-28 12:46:40	Emilien Macchi	tripleo: milestone	victoria-1	victoria-3
2020-07-30 03:19:14	OpenStack Infra	tripleo: status	In Progress	Fix Released