[R2.20.21] TSN HA: 10 min of ARP traffic loss during TSN switch over

Bug #1456284 reported by chhandak
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Hari Prasad Killi
Trunk
Fix Committed
High
Hari Prasad Killi

Bug Description

Problem
-------------
Observed 10 min (604 second) of ARP traffic loss during TSN switch over

Description
------------------
Sending ARP Traffic From BMS and TSN is responding to that ARP request. Packets is send at 1000 PPS.
Now stopped supervisor-vrouter service on active TSN/Tor Agent. TOR gets connected to Backup TSN/TOR Agent.

Traffic is not resuming for more than 10 min.

There are 2 different issue observed.

From QFX side New TSN route need to be programmed for broadcast packet. Which is taking around a min. After that ARP packets are reaching to active TSN but getting dropped there.

Here in TSN , TOR tunnel IP not added as bracts route. Traffic is only resuming after 10min. During this time observed high cpu hogging by contrail-vrouter and tor agent.

top - 19:34:11 up 2 days, 3:05, 1 user, load average: 19.93, 8.12, 3.30
Tasks: 339 total, 1 running, 338 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.3 us, 4.7 sy, 0.0 ni, 69.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 19779974+total, 12261744 used, 18553800+free, 179420 buffers
KiB Swap: 20128972+total, 0 used, 20128972+free. 7524344 cached Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25554 root 20 0 2451348 729840 20884 S 570.8 0.4 9:11.46 contr
ail-vroute
25555 root 20 0 2413780 331968 15300 S 283.6 0.2 4:06.66 contrail-tor-ag

Setup
----------
env.roledefs = {
    'all': [host1, host2, host3, host4, host5, host6],
    'cfgm': [host1, host2, host3],
    'openstack': [host1, host2, host3],
    'webui': [host2],
    'control': [host1, host3],
    'compute': [host4, host5, host6],
    'tsn': [host4, host5],
    'toragent': [host4, host5],
    'collector': [host1, host3],
    'database': [host1, host2, host3],
    'build': [host_build],
}

env.hostnames = {
    'all': ['nodei6', 'nodei7', 'nodei8', 'nodei9', 'nodei10', 'nodei19']
}

information type: Proprietary → Public
Revision history for this message
Hari Prasad Killi (haripk) wrote :

Following are the current observations.

1. The connection between agent and QFX is flapping. The OVSDB server is closing the connection as it received a wrong packet - seen with higher data. Possible issue with SSL code.

2. Taking more time (order of 10s of seconds) before walks are initiated

3. Delay in the subscribe message reaching control node

tags: added: blocker
Revision history for this message
Prabhjot Singh Sethi (prabhjot) wrote :

under scaled config, initial read of ovsdb database results in multiple partial updates on physical port entry and then keeps on updating entry with change in each vlan port binding.

in current test environment there were 4K vlan port bindings on 1 physical port, where ovsdb initial read was causing 8K update to physical port entry which results in almost 4K finds for logical switch in every iteration causing performance issues.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : master

Review in progress for https://review.opencontrail.org/11101
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.20

Review in progress for https://review.opencontrail.org/11102
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11101
Committed: http://github.org/Juniper/contrail-controller/commit/06b82cb91f91e749ce8338ca08a64b7e6dfbf690
Submitter: Zuul
Branch: master

commit 06b82cb91f91e749ce8338ca08a64b7e6dfbf690
Author: Prabhjot Singh Sethi <email address hidden>
Date: Mon Jun 1 12:44:51 2015 +0530

ToR Agent Performance - monitor request

Issue:
------
During initial response for monitor request, we end up
generating too many partial update for Physical port
entries causing lots of unneccessary processing on half
baked data. resulting in bad performance.

Fix:
----
Delay Creation of Physical Port entries, to be done
after processing of initial response to monitor request
is complete. This avoids processing triggers for partial
updates on physical port entries. Once we unable physical
port creation it will create physical port and stale
entries for vlan port bindings with completed information.

Partial-Bug: #1456284
Change-Id: I03984ac9749cea6197feb1f8c43c87cdfe48833d

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/11102
Committed: http://github.org/Juniper/contrail-controller/commit/df3d055158217de7e7922ab78d6fb8f93ab5cc13
Submitter: Zuul
Branch: R2.20

commit df3d055158217de7e7922ab78d6fb8f93ab5cc13
Author: Prabhjot Singh Sethi <email address hidden>
Date: Mon Jun 1 12:44:51 2015 +0530

ToR Agent Performance - monitor request

Issue:
------
During initial response for monitor request, we end up
generating too many partial update for Physical port
entries causing lots of unneccessary processing on half
baked data. resulting in bad performance.

Fix:
----
Delay Creation of Physical Port entries, to be done
after processing of initial response to monitor request
is complete. This avoids processing triggers for partial
updates on physical port entries. Once we unable physical
port creation it will create physical port and stale
entries for vlan port bindings with completed information.

Partial-Bug: #1456284
Change-Id: I03984ac9749cea6197feb1f8c43c87cdfe48833d
(cherry picked from commit 06b82cb91f91e749ce8338ca08a64b7e6dfbf690)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.20

Review in progress for https://review.opencontrail.org/11150
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : master

Review in progress for https://review.opencontrail.org/11151
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/11162
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.20

Review in progress for https://review.opencontrail.org/11163
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11150
Committed: http://github.org/Juniper/contrail-controller/commit/4d64eac1c3d143e99941bd7272e96e845a00a1b8
Submitter: Zuul
Branch: R2.20

commit 4d64eac1c3d143e99941bd7272e96e845a00a1b8
Author: Nischal Sheth <email address hidden>
Date: Mon Jun 1 12:50:28 2015 -0700

Optimizations to reduce table walks from RoutePathReplicator

Don't walk table when joining/leaving if the table is empty.
This is handy when tables are being created upon receiving
configuration for routing-instances - the tables are always
empty at this point.

Change-Id: I050dda29ad1311915d0f62d3fe098d9c6e245b5f
Partial-Bug: 1456284

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/11151
Committed: http://github.org/Juniper/contrail-controller/commit/2e42cc2d49fbe704781cf6a729d75783b7116bd0
Submitter: Zuul
Branch: master

commit 2e42cc2d49fbe704781cf6a729d75783b7116bd0
Author: Nischal Sheth <email address hidden>
Date: Mon Jun 1 12:50:28 2015 -0700

Optimizations to reduce table walks from RoutePathReplicator

Don't walk table when joining/leaving if the table is empty.
This is handy when tables are being created upon receiving
configuration for routing-instances - the tables are always
empty at this point.

Change-Id: I050dda29ad1311915d0f62d3fe098d9c6e245b5f
Partial-Bug: 1456284

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/11162
Committed: http://github.org/Juniper/contrail-controller/commit/7bb83875e2920ae218113295547cdcc9496141ae
Submitter: Zuul
Branch: master

commit 7bb83875e2920ae218113295547cdcc9496141ae
Author: Nischal Sheth <email address hidden>
Date: Mon Jun 1 21:41:29 2015 -0700

Don't send XmppPeerInfo UVE on table subscribe/unsubscribe

The UVE has a list of tables to which the xmpp peer is registered.
Building and sending the entire list of tables each time the peer
subscribes and unsubscribes to a table is very expensive.

Do not send the UVE for the time being. Will resume sending the UVE
after new infra to make incremental updates to a list is available.

Change-Id: I1905755f665432e5cfc0229099e88bcc3bbf6b4c
Partial-Bug: 1456284

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/11163
Committed: http://github.org/Juniper/contrail-controller/commit/6d20b1948187380df3cb30e05e32c25077a3f560
Submitter: Zuul
Branch: R2.20

commit 6d20b1948187380df3cb30e05e32c25077a3f560
Author: Nischal Sheth <email address hidden>
Date: Mon Jun 1 21:41:29 2015 -0700

Don't send XmppPeerInfo UVE on table subscribe/unsubscribe

The UVE has a list of tables to which the xmpp peer is registered.
Building and sending the entire list of tables each time the peer
subscribes and unsubscribes to a table is very expensive.

Do not send the UVE for the time being. Will resume sending the UVE
after new infra to make incremental updates to a list is available.

Change-Id: I1905755f665432e5cfc0229099e88bcc3bbf6b4c
Partial-Bug: 1456284

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11303
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/11304
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11303
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/11304
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11303
Committed: http://github.org/Juniper/contrail-controller/commit/363ef441a60121f55a3d3cccae3efac6899a81cd
Submitter: Zuul
Branch: master

commit 363ef441a60121f55a3d3cccae3efac6899a81cd
Author: Prabhjot Singh Sethi <email address hidden>
Date: Fri Jun 5 12:31:54 2015 +0530

ToR Agent OVSDB - performance on HA

Issue:
------
After the initial read from OVSDB server all the previous
VlanPortBindingsEntries are created as Stale entries,
later on subscribing to operDB all the entries start getting
converted from stale to non-stale, however the data remains
same, where OVSDB client not identifying no op keeps on
triggering updates on Physical Port entry, resulting in
preformance degradation

Fix:
----
Keep track of old_logical_switch_name in vlan port binding
entry and if on change it is a no op avoid triggering
updates to physical port entry

Partial-Bug: #1456284
Change-Id: I4a61904056ca374a34d40823f623fc14dfbcff67

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/11304
Committed: http://github.org/Juniper/contrail-controller/commit/819b18b97c3d3fb2c3e42826d57009592bf6d252
Submitter: Zuul
Branch: R2.20

commit 819b18b97c3d3fb2c3e42826d57009592bf6d252
Author: Prabhjot Singh Sethi <email address hidden>
Date: Fri Jun 5 12:31:54 2015 +0530

ToR Agent OVSDB - performance on HA

Issue:
------
After the initial read from OVSDB server all the previous
VlanPortBindingsEntries are created as Stale entries,
later on subscribing to operDB all the entries start getting
converted from stale to non-stale, however the data remains
same, where OVSDB client not identifying no op keeps on
triggering updates on Physical Port entry, resulting in
preformance degradation

Fix:
----
Keep track of old_logical_switch_name in vlan port binding
entry and if on change it is a no op avoid triggering
updates to physical port entry

Partial-Bug: #1456284
Change-Id: I4a61904056ca374a34d40823f623fc14dfbcff67
(cherry picked from commit 363ef441a60121f55a3d3cccae3efac6899a81cd)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.