3.1.3.0-72: BUM Tree corrupted after clean installation

Bug #1692795 reported by Sandeep Sridhar
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.1
Fix Committed
Critical
Manish Singh
R3.2
Fix Committed
Critical
Manish Singh
R4.0
Fix Committed
Critical
Manish Singh
Trunk
Fix Committed
Critical
Manish Singh

Bug Description

There were issues reported with QFX missing from the BUM tree before. Manish investigated and he pointed to a corrupted pointer in the tor-agent core collected from problematic setup. Earlier, we suspected some issue with the ISSU upgrade (as ISSU procedure was used to upgrade Contrail from 2.21.2 to 3.1.3.0-72).

The new occurence is reported on a clean installation of 3.1.3.0-72 build. Here is JTAC's analysis:

VNIs being tested.
1. 4338
2. 3559

IP addresses as follows:
(TSN) openc-34 172.23.10.201
(TSN) openc-35 172.23.10.202
(QFX6) 172.23.11.48
(QFX23) 172.23.11.49

QFX6 is being served by contrail-tor-agent-6 on openc-34
QFX23 is being served by contrail-tor-agent-23 on openc-35.

Please see below:

root@openc-34:~# contrail-status | grep QFX
contrail-tor-agent-1 active (ToR:QFX1 connection up)
contrail-tor-agent-11 active (ToR:QFX11 connection down)
contrail-tor-agent-23 active (ToR:QFX23 connection down)
contrail-tor-agent-6 active (ToR:QFX6 connection up) <<<<<<<<<<<<<<<<<<<<<
root@openc-34:~#

root@openc-35:~# contrail-status | grep QFX
contrail-tor-agent-1 active (ToR:QFX1 connection down)
contrail-tor-agent-11 active (ToR:QFX11 connection up)
contrail-tor-agent-23 active (ToR:QFX23 connection up) <<<<<<<<<<<<<<<<<<<<<
contrail-tor-agent-6 active (ToR:QFX6 connection down)

Test Case 1: (VNI 4338)
============
(30:06:23:00:03:42) => [QFX6 ae7.2834] => [openc-34] => [openc-35] => [QFX23 ae1.3834] => (30:23:06:00:03:42)

Result: On openc-34, QFX6 is missing && openc-35 is present.
        On openc-35, QFX23 is missing && openc-34 is present.

<<< BUM traffic broken completely >>>

Test Case 2: (VNI 3559)
============
(30:06:23:00:00:37) => [QFX6 ae7.2055] => [openc-34] => [openc-35] => [QFX23 ae1.3055] => (30:23:06:00:00:37)

Result: On openc-34, QFX6 is missing && openc-35 present.
        On openc-35, QFX23 is present && openc-34 is also present.

<<< BUM Traffic one way is blocked which is openc-35 ==> openc-34 >>>

All cores can be found here:

/home/ssandeep/2017-0424-0113/NewLogsMay23/

Greetings,
Sandeep.

Tags: bms vrouter nttc
information type: Proprietary → Public
tags: added: bms vrouter
Changed in juniperopenstack:
assignee: nobody → Manish Singh (manishs)
importance: Undecided → Critical
milestone: none → r3.1.3.0
tags: added: nttc
Revision history for this message
Sandeep Sridhar (ssandeep) wrote :

Hi Manish,

  Their test procedure is as below:

The environment is installed clean (3 Control Node, 4 TSN, 128 ToR-Agent, 4 Compute Node, 1 Openstack Node) as following.
After that, they go ahead provisioning LIF and pass traffic.

===================================================================================================
A1-1 Setup Procedure
cd /opt/contrail/utils/
fab install_pkg_all:/tmp/contrail-install-packages_3.1.3.0-73~mitaka_all.deb
fab upgrade_kernel_all
fab install_contrail
fab setup_all

A1-2
They encounter an issue with TSN being unstable after reboot (due to bond settings that you guys worked with Mehul and resolved it)

A1-3
execute "service supervisor-vrouter restart" at 4 TSN Node.

A1-5
Modify following Params
(TSN&Compute)
/etc/contrail/contrail-*-agent*.conf
headless_mode = true

/etc/contrail/supervisord_vrouter.conf
environment=TBB_THREAD_COUNT = 8

(TSN)
/etc/modprobe.d/vrouter.conf
options vrouter vr_mpls_labels=256000 vr_nexthops=521000 vr_vrfs=65536 vr_bridge_entries=1000000

(Compute)
/etc/modprobe.d/vrouter.conf
options vrouter vr_mpls_labels=11520 vr_flow_entries=2097152

/etc/contrail/contrail-vrouter-agent.conf
[DEFAULT]
flow_cache_timeout = 60
disable_flow_collection = True
[FLOWS]
max_vm_flows = 45

remove virbr0
virsh net-destroy default
virsh net-autostart default --disable

A1-6
All contrail-server were rebooted (excuted "shutdown -r now").
===================================================================================================
They believe this issue is due to the POST they do (around 40 posts per second and there would be 10 sessions). The POST messages is more of creating virtual-networks, virtual-machines etc.

The logs below should help:

ssh root@10.219.48.123
password:Jtaclab123

[root@LocalStorage coreCollectedMay26]# pwd
/home/ssandeep/2017-0424-0113/coreCollectedMay26
[root@LocalStorage coreCollectedMay26]# ls -lrt
total 382604
-rw-rw-r--. 1 1001 1001 1181918 May 23 08:05 20170523-pt008.log
-rw-rw-r--. 1 1001 1001 1181918 May 23 08:05 20170523-pt009.log
-rw-rw-r--. 1 1001 1001 3549981 May 23 08:13 20170523-pt002.log
-rw-rw-r--. 1 1001 1001 4729948 May 23 08:18 20170523-pt004.log
-rw-rw-r--. 1 1001 1001 4805984 May 23 08:19 20170523-pt001.log
-rw-rw-r--. 1 1001 1001 4805982 May 23 08:20 20170523-pt003.log
-rw-rw-r--. 1 1001 1001 7824423 May 23 08:34 20170523-pt007.log
-rw-rw-r--. 1 1001 1001 11948881 May 23 08:37 20170523-pt006.log
-rw-rw-r--. 1 1001 1001 11848002 May 23 08:40 20170523-pt011.log
-rw-rw-r--. 1 1001 1001 11850306 May 23 08:40 20170523-pt010.log
-rw-r--r--. 1 root root 264286785 May 26 21:07 20170526_JN-323_tor-agent-21-core.zip
-rw-r--r--. 1 root root 63744000 May 26 21:13 20170523_JN-323_post.tar

all *-pt*.log indicates the POST they are doing during which this issue occurs. This might give you some hint as to what could be resulting in this problem.

The zip file 20170526_JN-323_tor-agent-21-core.zip has the tor-agent and tsn core we collected on openc-36. I will unicast you my notes which has VNI for your reference.

Please let me know when the binary is ready.

Greetings,
Sandeep.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/32264
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32264
Committed: http://github.com/Juniper/contrail-controller/commit/b8bb87016fea60a4aa0e4e366d06c30c3285a1c5
Submitter: Zuul (<email address hidden>)
Branch: R3.1

commit b8bb87016fea60a4aa0e4e366d06c30c3285a1c5
Author: Manish <email address hidden>
Date: Tue May 30 14:36:34 2017 +0530

Few route's export not done from agent to CN.

This was happening in headless mode, where stale walk was fired after few
seconds and because of scale notify walk used to take longer than same.
Stale and notify walk were using same walker. When stale walk was started it
used to cancel notify walker in turn leaving few route entries in not exported
state.
Because of this issue like fmg-bum-tree not built, OVS mac not programmed in
vrouter(resulting in unknown unicast flood), QFX missing from BUM .

Solution:
Seperate both the walks.

Change-Id: I47ca07b0d22aa361a36a01ea06df2eb5c6f628a6
Closes-bug: #1692795

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/32307
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32307
Committed: http://github.com/Juniper/contrail-controller/commit/1b2227d803912a05018ee6fc1ee6672c266fe261
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit 1b2227d803912a05018ee6fc1ee6672c266fe261
Author: Manish <email address hidden>
Date: Tue May 30 14:36:34 2017 +0530

Few route's export not done from agent to CN.

This was happening in headless mode, where stale walk was fired after few
seconds and because of scale notify walk used to take longer than same.
Stale and notify walk were using same walker. When stale walk was started it
used to cancel notify walker in turn leaving few route entries in not exported
state.
Because of this issue like fmg-bum-tree not built, OVS mac not programmed in
vrouter(resulting in unknown unicast flood), QFX missing from BUM .

Solution:
Seperate both the walks.

Change-Id: I47ca07b0d22aa361a36a01ea06df2eb5c6f628a6
Closes-bug: #1692795
(cherry picked from commit b8bb87016fea60a4aa0e4e366d06c30c3285a1c5)

Revision history for this message
Manish Singh (manishs) wrote :

Provided with a probable fix in custom binary. Sandeep is sharing the same and verifying in customer setup.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/32636
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32636
Committed: http://github.com/Juniper/contrail-controller/commit/8a922e38bfd1e6b2a720d4293fcdf87002cb5a26
Submitter: Zuul (<email address hidden>)
Branch: R3.1

commit 8a922e38bfd1e6b2a720d4293fcdf87002cb5a26
Author: Manish <email address hidden>
Date: Thu Jun 8 01:17:30 2017 +0530

BUM tree subscription skipped.

With the introduction of same RD for Tor agent for vrf, change of vnid was not
handled. In cases where a vn was added with vnid as not set or 0 and then
updated later with non zero value, vrf was not notified.
This used to result in skipping of notify registeration for the problematic vrf.
In turn subscriptions were missed.

Solution:
Handle vnid change.

Change-Id: I7ea7332eb1e64cb0e534f992ef5c8680a54b0313
Closes-bug: #1692795

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/32700
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/32700
Committed: http://github.com/Juniper/contrail-controller/commit/bb5a441987b02c1356a7f5b9c0a944df73fcd2e3
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit bb5a441987b02c1356a7f5b9c0a944df73fcd2e3
Author: Manish <email address hidden>
Date: Thu Jun 8 01:17:30 2017 +0530

BUM tree subscription skipped.

With the introduction of same RD for Tor agent for vrf, change of vnid was not
handled. In cases where a vn was added with vnid as not set or 0 and then
updated later with non zero value, vrf was not notified.
This used to result in skipping of notify registeration for the problematic vrf.
In turn subscriptions were missed.

Solution:
Handle vnid change.

Change-Id: I7ea7332eb1e64cb0e534f992ef5c8680a54b0313
Closes-bug: #1692795
(cherry picked from commit 8a922e38bfd1e6b2a720d4293fcdf87002cb5a26)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/33684
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/33685
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/33684
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/33685
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/33685
Committed: http://github.com/Juniper/contrail-controller/commit/72efe8e70cc65b6029d3e7f322003daadd36ec64
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 72efe8e70cc65b6029d3e7f322003daadd36ec64
Author: Manish <email address hidden>
Date: Thu Jun 8 01:17:30 2017 +0530

BUM tree subscription skipped.

With the introduction of same RD for Tor agent for vrf, change of vnid was not
handled. In cases where a vn was added with vnid as not set or 0 and then
updated later with non zero value, vrf was not notified.
This used to result in skipping of notify registeration for the problematic vrf.
In turn subscriptions were missed.

Solution:
Handle vnid change.

Closes-bug: #1692795

Conflicts:
 src/vnsw/agent/controller/controller_vrf_export.cc
 src/vnsw/agent/oper/agent_route_walker.cc
 src/vnsw/agent/oper/vrf.cc
 src/vnsw/agent/oper/vrf.h

Conflicts:
 src/vnsw/agent/oper/vrf.h
Change-Id: I7ea7332eb1e64cb0e534f992ef5c8680a54b0313

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/33684
Committed: http://github.com/Juniper/contrail-controller/commit/c0fbc7229e2ee00935e19e02b869da73ebd11da2
Submitter: Zuul (<email address hidden>)
Branch: master

commit c0fbc7229e2ee00935e19e02b869da73ebd11da2
Author: Manish <email address hidden>
Date: Thu Jun 8 01:17:30 2017 +0530

BUM tree subscription skipped.

With the introduction of same RD for Tor agent for vrf, change of vnid was not
handled. In cases where a vn was added with vnid as not set or 0 and then
updated later with non zero value, vrf was not notified.
This used to result in skipping of notify registeration for the problematic vrf.
In turn subscriptions were missed.

Solution:
Handle vnid change.

Closes-bug: #1692795

Conflicts:
 src/vnsw/agent/controller/controller_vrf_export.cc
 src/vnsw/agent/oper/agent_route_walker.cc
 src/vnsw/agent/oper/vrf.cc
 src/vnsw/agent/oper/vrf.h
Change-Id: I7ea7332eb1e64cb0e534f992ef5c8680a54b0313

Jeba Paulaiyan (jebap)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.