Rework bgp membership manager to improve scalability
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
Trunk |
Fix Committed
|
Wishlist
|
Nischal Sheth |
Bug Description
Existing implementation triggers a table walk for each (peer, table) join
or leave request. It also triggers a separate walk per (peer, table) when
all paths received from a peer need to to be marked stale/deleted as part
of graceful restart.
This behavior is fine when a single peer comes up or goes down, but it's
sub-optimal when a bunch of peers go down or come up at roughly the same
time. This happens if multiple vrouters encounter the same problem and
crash or when the CN crashes and comes back up. In the latter case, we run
into the so-called thundering herd problem wherein all vrouters connect
to the CN at roughly the same time and then register to a large number of
common tables.
This causes a few problems:
1. The CN performs a large number of unnecessary table walks. These could potentially be combined into a much smaller number.
2. If there's a large number of peers and a large number of tables, the CN
ends up triggering a very large number of walks at roughly the same time.
This puts an unnecessary burden on the TaskScheduler since each table walk
results in the creation of multiple Tasks (one per partition).
3. Not only does 1) above cause redundant table walks, it also results in
redundant calls to BgpExport:
to call the Join/Leave methods with a BitSet of peers to handle multiple
peers at once. Note that the Join/Leave methods already handle a BitSet.
4. Since Join/Leave processing is done for 1 peer at a time, we also end
up encoding each route update into a bgp/xmpp message for one or few peers
at a time. Would be ideal to encode each route once and send it to all
interested peers i.e. amortize the cost of encoding the update over many
peers.
Proposal is to rework implementation of bgp membership manager to address
all the above issues. The membership manager can keep track of all pending
(peer, table) requests and trigger a table walk for one table at a time.
It can perform join/leave and receive path manipulation operations for all
requesting peers for the table in question. Since each table is sharded
across all partitions, triggering a single table walk still allows the
Task infra to utilize all available threads/cores. Triggering one table
walk at a time also allows the membership manager to accumulate multiple
peer requests for all other tables.
description: | updated |
summary: |
- Rework bgp membership manager to improve efficiency + Rework bgp membership manager to improve scalability |
description: | updated |
Review in progress for https:/ /review. opencontrail. org/20035
Submitter: Nischal Sheth (<email address hidden>)