Agent should implement a backoff algorithm in case of allowed-address pair
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R2.20 |
Fix Committed
|
Undecided
|
Naveen N | |||
Trunk |
Fix Committed
|
Undecided
|
Naveen N |
Bug Description
When we have Active-Backup mode of allowed-address pair, if traffic is received from both active and backup due to some other error, agent will do active-backup transitions at high rate. The active-backup transitions will introduce significant overheads in agent in terms of flow re-evaluations.
Allowed-address pair should implement backoff algorithm if traffic is seen from both active and backup nodes.
Following mail captures behavior seen in one of the customer setups.
----
nova is running the same VM in both boa-001-06 and 07:
in bka-001-06 pid 8791 is executing instance -uuid 6313840c-
in bka-001-07 pid 9126 is executing instance -uuid 6313840c-
root@bka-001-02:~# (source openrc; nova show 6313840c-
+------
| Property | Value |
+------
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | active |
[…]
| example-net network | 10.0.0.10, 37.44.0.127 |
This means that both the private IP address (10.0.0.10) and the floating-ip (34.44.0.127) are being advertised by both compute-nodes.
Contrail is then triggering the code that manage the allowed-address pair feature (and vm migration). These feature works by allow us to detect the system that most recently sent traffic claiming to have the specific address by incrementing a sequence number on the route when they see they are not the preferred system.
Both compute nodes are increase the sequence number on the route and claiming the route. They are able to do this quite fast.
The floating-ip address is then re-originated into 800+ stale snatdebug instances and 30+ functional snat instances. These updates are then pushed to all compute nodes… when the compute nodes receive the route updates they re-examined the flows… The flows for this specific VM are constantly being re-examined. That is why you see the agent being a bit slow.
BGP as the message bus seems to be pushing several thousand updates per second… the agent is able to process them and continuously re-examine flows. But that affects its response time to new flows.
I’d recommend that you terminate that instance. I believe that will stop the route update storm…
Praveen, can you please file a bug regarding the sequence number behavior… ? There are two independent issues: this VM didn’t enable allowed-address pair so we shouldn’t really be increasing the sequence number. And when we increase the sequence number there should be an exponential back-off on how fast we update it.
Review in progress for https:/ /review. opencontrail. org/11413
Submitter: Naveen N (<email address hidden>)