Agent should implement a backoff algorithm in case of allowed-address pair

Bug #1461774 reported by Praveen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Undecided
Naveen N
Trunk
Fix Committed
Undecided
Naveen N

Bug Description

When we have Active-Backup mode of allowed-address pair, if traffic is received from both active and backup due to some other error, agent will do active-backup transitions at high rate. The active-backup transitions will introduce significant overheads in agent in terms of flow re-evaluations.

Allowed-address pair should implement backoff algorithm if traffic is seen from both active and backup nodes.

Following mail captures behavior seen in one of the customer setups.

----

nova is running the same VM in both boa-001-06 and 07:

in bka-001-06 pid 8791 is executing instance -uuid 6313840c-9068-4b13-8e84-f69925ffd0be
in bka-001-07 pid 9126 is executing instance -uuid 6313840c-9068-4b13-8e84-f69925ffd0be

root@bka-001-02:~# (source openrc; nova show 6313840c-9068-4b13-8e84-f69925ffd0be)
+--------------------------------------+--------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+--------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | bka-001-07 |
| OS-EXT-SRV-ATTR:hypervisor_hostname | bka-001-07.bka |
| OS-EXT-SRV-ATTR:instance_name | instance-0000b975 |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |

[…]
| example-net network | 10.0.0.10, 37.44.0.127 |

This means that both the private IP address (10.0.0.10) and the floating-ip (34.44.0.127) are being advertised by both compute-nodes.
Contrail is then triggering the code that manage the allowed-address pair feature (and vm migration). These feature works by allow us to detect the system that most recently sent traffic claiming to have the specific address by incrementing a sequence number on the route when they see they are not the preferred system.

Both compute nodes are increase the sequence number on the route and claiming the route. They are able to do this quite fast.

The floating-ip address is then re-originated into 800+ stale snatdebug instances and 30+ functional snat instances. These updates are then pushed to all compute nodes… when the compute nodes receive the route updates they re-examined the flows… The flows for this specific VM are constantly being re-examined. That is why you see the agent being a bit slow.

BGP as the message bus seems to be pushing several thousand updates per second… the agent is able to process them and continuously re-examine flows. But that affects its response time to new flows.

I’d recommend that you terminate that instance. I believe that will stop the route update storm…

Praveen, can you please file a bug regarding the sequence number behavior… ? There are two independent issues: this VM didn’t enable allowed-address pair so we shouldn’t really be increasing the sequence number. And when we increase the sequence number there should be an exponential back-off on how fast we update it.

Tags: vrouter
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/11413
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11413
Committed: http://github.org/Juniper/contrail-controller/commit/d70aa21ee4b7c377a6f56677e836856f8a565e6e
Submitter: Zuul
Branch: R2.20

commit d70aa21ee4b7c377a6f56677e836856f8a565e6e
Author: Naveen N <email address hidden>
Date: Tue Jun 9 16:40:31 2015 +0530

* Backoff publishing of preference for a path if its flapping

If a given path flaps for more than 5 times in 5 seconds period
of time, then agent will backoff for 5 seconds and retry for
path update after 5 period of time, similarly if path flaps
again then agent would exponentially backoff with maximum
time of 100 seconds.
Test case for same.
Closes-bug:#1461774,#1373135

Change-Id: I7429dc01354784baf0090c13b0e94e9ae990bcf0

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11720
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11720
Committed: http://github.org/Juniper/contrail-controller/commit/88edb11074e496704da868bc4875a6c15daf1483
Submitter: Zuul
Branch: master

commit 88edb11074e496704da868bc4875a6c15daf1483
Author: Naveen N <email address hidden>
Date: Tue Jun 9 16:40:31 2015 +0530

* Backoff publishing of preference for a path if its flapping

If a given path flaps for more than 5 times in 5 seconds period
of time, then agent will backoff for 5 seconds and retry for
path update after 5 period of time, similarly if path flaps
again then agent would exponentially backoff with maximum
time of 100 seconds.
Test case for same.
Closes-bug:#1461774,#1373135

Change-Id: I7429dc01354784baf0090c13b0e94e9ae990bcf0
(cherry picked from commit d70aa21ee4b7c377a6f56677e836856f8a565e6e)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.