suboptimal haproxy LB strategy for API services with longrunning requests might produce traffic spikes and cascading failures

Bug #1882927 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Bogdan Dobrelya

Bug Description

This is a summary from the "noisy neighbors" related issue https://bugzilla.redhat.com/show_bug.cgi?id=1779407, which scope's reduced here into avoiding traffic spikes in the face of a single controller node failures:

An example scenario for inbound AMQP traffic:

"... - Not a lot of traffic to node3, must not have as many active client connections?

- node2 is handling most of the client traffic

- both node1 and node2 have a traffic spike at xx:xx:30; node3 has one shortly before at xx:xx:21 although not as much throughput as the other two. ~250 KBps for node3, vs. ~9 MBps on node1 and ~2 MBps on node2.

As for the clustering traffic, there is a clear spike at xx:xx:31, and then things go quiet for about 30 seconds. They've got net_ticktime set to 30 seconds, so that makes sense. After it times out, the cluster partitions and we see the behavior noted in the rabbitmq logs 30 seconds after the stream drops out.

I don't have a real good handle, good or bad, about a ballpark ~10 MBps AMQP input into the cluster ..."

A classical "noisy neighbor" problem may be caused by things, like co-locating OVN networker roles for controllers, AND suboptimal connections distribution among the controllers as well. The latter is that should be tweaked with haproxy configs for API services configured in TripleO.

---
Another related issue is https://bugzilla.redhat.com/show_bug.cgi?id=1844357, where all Heat API backends become marked as down because of long running API requests and missing TCP-KA option in HAProxy:

xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-1.internalapi is DOWN, reason: Layer7 timeout,
  check duration: 10001ms. 2 active and 0 backup servers left. 0
  sessions active, 0 requeued, 0 remaining in queue.

xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-2.internalapi is DOWN, reason: Layer7 timeout,
  check duration: 10001ms. 1 active and 0 backup servers left. 0
  sessions active, 0 requeued, 0 remaining in queue.

xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-0.internalapi is DOWN, reason: Layer7 timeout,
  check duration: 10001ms. 0 active and 0 backup servers left. 0
  sessions active, 0 requeued, 0 remaining in queue.

xx:12:09 overcloud-ctrl-0 haproxy[13]: proxy heat_api has no server
available!

xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-1.internalapi is UP, reason: Layer7 check
passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.

xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-2.internalapi is UP, reason: Layer7 check
passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.

xx:13:56 overcloud-ctrl-0 haproxy[13]: Server
heat_api/overcloud-ctrl-0.internalapi is UP, reason: Layer7 check
passed, code: 200, info: "OK", check duration: 1ms. 3 active and 0
backup servers online. 0 sessions requeued, 0 total in queue.

In the end, aforementioned suboptimal roundrobin distribution multiplied by the cascading failures of backends, makes the situation even worse. Leastconn should alleviate the unequal distribution of client sessions for such scenario

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in tripleo:
status: New → In Progress
milestone: none → victoria-1
importance: Undecided → High
assignee: nobody → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
description: updated
tags: added: queens-backport-potential train-backport-potential ussuri-backport-potential
summary: - suboptimal haproxy LB strategy for API services might produce traffic
- spikes and cascading failures
+ suboptimal haproxy LB strategy for API services with longrunning
+ requests might produce traffic spikes and cascading failures
tags: added: ci promotion-blocker
Changed in tripleo:
importance: High → Critical
importance: Critical → High
tags: removed: ci promotion-blocker
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)
Download full text (4.0 KiB)

Reviewed: https://review.opendev.org/735541
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5717bd79525e2eb5b14c2d365dd59b63d7a63066
Submitter: Zuul
Branch: master

commit 5717bd79525e2eb5b14c2d365dd59b63d7a63066
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Jun 15 11:05:06 2020 +0200

    Use leastcon and socket-level TCP keep-alives for Heat API

    According to the HAProxy docs, when the protocol involves very long
    sessions with long idle periods (eg: querying Heat API for large
    resources), there is a risk that one of the intermediate components
    decides to expire a session which has remained idle for too long.

    In some NFV cases with hundreds of VM/port resources, multiple API
    requests are being sent in parallel towards the Heat API service to
    retrieve the OS::Nova::Server resources from an big Heat stack, and
    this is causing the Heat API backends to be unavailable and requests
    to fail. This also ends up with all of its backends considered down
    by HAProxy, leaving the system in a cascading failure scenario:

    xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
    heat_api/overcloud-ctrl-1.internalapi is DOWN, reason: Layer7 timeout,
      check duration: 10001ms. 2 active and 0 backup servers left. 0
      sessions active, 0 requeued, 0 remaining in queue.

    xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
    heat_api/overcloud-ctrl-2.internalapi is DOWN, reason: Layer7 timeout,
      check duration: 10001ms. 1 active and 0 backup servers left. 0
      sessions active, 0 requeued, 0 remaining in queue.

    xx:12:09 overcloud-ctrl-0 haproxy[13]: Server
    heat_api/overcloud-ctrl-0.internalapi is DOWN, reason: Layer7 timeout,
      check duration: 10001ms. 0 active and 0 backup servers left. 0
      sessions active, 0 requeued, 0 remaining in queue.

    xx:12:09 overcloud-ctrl-0 haproxy[13]: proxy heat_api has no server
    available!

    xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
    heat_api/overcloud-ctrl-1.internalapi is UP, reason: Layer7 check
    passed, code: 200, info: "OK", check duration: 1ms. 1 active and 0
    backup servers online. 0 sessions requeued, 0 total in queue.

    xx:13:55 overcloud-ctrl-0 haproxy[13]: Server
    heat_api/overcloud-ctrl-2.internalapi is UP, reason: Layer7 check
    passed, code: 200, info: "OK", check duration: 2ms. 2 active and 0
    backup servers online. 0 sessions requeued, 0 total in queue.

    xx:13:56 overcloud-ctrl-0 haproxy[13]: Server
    heat_api/overcloud-ctrl-0.internalapi is UP, reason: Layer7 check
    passed, code: 200, info: "OK", check duration: 1ms. 3 active and 0
    backup servers online. 0 sessions requeued, 0 total in queue.

    Mitigation steps proposed:

    * Enabling socket-level TCP keep-alives makes the system regularly send
      packets to the other end of the connection, leaving it active.

    * tl;dr - round-robin LB does not fit scenarios with cascading
      failures. Enabling leastcon LB makes the cascading failure to happen
      less likely, when high numbers of client connections become aligned by
      real counts instead o...

Read more...

Changed in tripleo:
milestone: victoria-1 → victoria-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/728779
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=c04057b58b3a13a433f28d229a8f0908df253b57
Submitter: Zuul
Branch: master

commit c04057b58b3a13a433f28d229a8f0908df253b57
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 18 10:14:21 2020 +0200

    Tune haproxy for long running sessions to use leastconn

    For long running sessions, the leastconn LB is more prefered than the
    default roundrobin. For long connections, it picks the least recently
    used of the servers with the lowest connection count.

    LRU LB strategy that leastconn uses also indirectly reduces the
    possibility of cascaded failures by smoothing out the high traffic
    spikes, which otherwise may be caused by unequal round-robin
    distribution of client connections upon executing it a fail-over to
    another backend.

    Heat API/cfn provides Orchestration API with a notion of long running
    tasks. Neutron server may maintain long running RPC calls to its
    agents. Cinder BlockStorage API provides long running volume and
    backup actions. Swift Proxy's PutObject etc. and Ceph RGW APIs, like
    https://docs.ceph.com/docs/master/radosgw/s3/bucketops/, have a notion
    of "long running" as well. Ironic Inspector API provides long running
    operations, like introspection of BM nodes. All those should benefit
    from the leastconn LB switch controlled by the new parameter.

    Closes-Bug: #1882927
    Change-Id: I9515af738113a3f7aa2ea07315889d4a6595d4eb
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.