BGPaaS: flow setup issues at scale
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R3.0 |
In Progress
|
High
|
Manish Singh | |||
R3.0.3.x |
Fix Committed
|
High
|
Manish Singh | |||
R3.1 |
Fix Committed
|
High
|
Manish Singh | |||
R3.2 |
Fix Committed
|
High
|
Manish Singh | |||
Trunk |
Fix Committed
|
High
|
Manish Singh |
Bug Description
Seeing a few flow setup issues when BGPaaS sessions were scaled to 2k. They might be related and have same root cause so tracking all of them under this bug for now:
Symptom 1:
1. Lets assume BGPaaS sessions are established to control-nodes CN1 and CN2 (CN3 is down).
2. If CN1 is restarted, then some flows to CN2 are also RST. Further, it takes some time (anywhere between 2-10 minutes) for all the 2k sessions to come back up.
Symptom 2:
If the BGP process inside the VM that is initiating the BGPaaS sessions (1k such sessions), is restarted multiple times, then a few hundred TCP flows get 'stuck' in vRouter. These flows are not getting evicted even after being idle for longer than the configured idle timeout.
Symptom 3:
Lets say there are some BGPaaS flows stuck in vRouter (as seen in symptom 2). If the guest now attempts to open a new TCP connection, it is seen that the initial 3 whs completes successfully with the control-node, and some data is exchanged as well, but then soon after the flow is removed from the vRouter. The next tcp segment sent by the control-node thus causes the compute send a RST towards the control-node, and the flow has to be re-setup by the guest.
Symptom 4 (not scale related):
Lets say the guest has established 2 BGP connections: one each to the .1 and .2 IPs. Both the sessions are up. Now, if a new SYN is sent from the guest to the .1 IP, then the .2 flow is removed and that connection gets RST.
Review in progress for https:/ /review. opencontrail. org/28955
Submitter: Manish Singh (<email address hidden>)