[R2.20 Build 10 Juno] TOR Scale: Vrouter Agent crash during configuration of logical interafce

Bug #1452362 reported by chhandak
40
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
Manish Singh
Trunk
Fix Committed
Medium
Manish Singh

Bug Description

Observed this crash while scale script was configuring logical interface.

Log
-----
m-config:bng-contrail-qfx51-6:xe-0/0/48__1:xe-0/0/48__1.249> Table <__ifmap__.logical_interface.0>. Converting to DELETE
2015-05-06 Wed 21:37:19:265.378 IST nodei10 [Thread 140045748549376, Pid 15637]: ID-PERM not set for object <default-global-system-config:bng-contrail-qfx51-6:xe-0/0/48__1:xe-0/0/48__1.250> Table <__ifmap__.logical_interface.0>. Converting to DELETE
2015-05-06 Wed 21:37:22:850.011 IST nodei10 [Thread 140045512062720, Pid 15637]: ID-PERM not set for object <default-global-system-config:bng-contrail-qfx51-6:xe-0/0/48__1:xe-0/0/48__1.251> Table <__ifmap__.logical_interface.0>. Converting to DELETE
2015-05-06 Wed 21:37:26:319.450 IST nodei10 [Thread 140045491070720, Pid 15637]: ID-PERM not set for object <default-global-system-config:bng-contrail-qfx51-6:xe-0/0/48__1:xe-0/0/48__1.252> Table <__ifmap__.logical_interface.0>. Converting to DELETE

BT
----
#0 0x00007ffcd2b7bcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffcd2b7f0d8 in __GI_abort () at abort.c:89
#2 0x00007ffcd2b74b86 in __assert_fail_base (fmt=0x7ffcd2cc5830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0xfcd6b5 "0", file=file@entry=0xfdac40 "controller/src/vnsw/agent/oper/vrf.cc",
    line=line@entry=314, function=function@entry=0xfdae00 "bool VrfEntry::DeleteTimeout()") at assert.c:92
#3 0x00007ffcd2b74c32 in __GI___assert_fail (assertion=0xfcd6b5 "0", file=0xfdac40 "controller/src/vnsw/agent/oper/vrf.cc",
    line=314, function=0xfdae00 "bool VrfEntry::DeleteTimeout()") at assert.c:101
#4 0x00000000009c160d in VrfEntry::DeleteTimeout() ()
#5 0x0000000000f98ca9 in Timer::TimerTask::Run() ()
#6 0x0000000000f90580 in TaskImpl::execute() ()
#7 0x00007ffcd374ab3a in ?? () from /usr/lib/libtbb.so.2
#8 0x00007ffcd3746816 in ?? () from /usr/lib/libtbb.so.2
#9 0x00007ffcd3745f4b in ?? () from /usr/lib/libtbb.so.2
#10 0x00007ffcd37420ff in ?? () from /usr/lib/libtbb.so.2
#11 0x00007ffcd37422f9 in ?? () from /usr/lib/libtbb.so.2
#12 0x00007ffcd3966182 in start_thread (arg=0x7ffcac6f1700) at pthread_create.c:312
#13 0x00007ffcd2c3f47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

Setup
---------
env.roledefs = {
    'all': [host1, host2, host3, host4, host5, host6],
    'cfgm': [host1, host2, host3],
    'openstack': [host1, host2, host3],
    'webui': [host2],
    'control': [host1, host3],
    'compute': [host4, host5, host6],
    'tsn': [host4, host5],
    'toragent': [host4, host5],
    'collector': [host1, host3],
    'database': [host1, host2, host3],
    'build': [host_build],
}

env.hostnames = {
    'all': ['nodei6', 'nodei7', 'nodei8', 'nodei9', 'nodei10', 'nodei19']
}

Tags: bms scale vrouter
Revision history for this message
chhandak (chhandak) wrote :
Revision history for this message
Hari Prasad Killi (haripk) wrote :

The following composite NHs arent deleted, referring to the deleted VRFs

0x7ffca4023f20 default-domain:Test1:vn-test1-1000:vn-test1-1000 idx=1 ref_count=5 flags=2
p *(NextHop *) 0x7ffc9401f420 (Composite::L2COMP)
p *(NextHop *) 0x7ffc600867c0 (Composite::L2COMP)
p *(NextHop *) 0x7ffc70005070 (Composite::EVPN)
p *(NextHop *) 0x7ffc4c005660 (Composite::TOR)
p *(NextHop *) 0x7ffc70015890 (Composite::TOR)

0x7ffc88024010 default-domain:Test1:vn-test1-1002:vn-test1-1002 idx=3 ref_count=3 flags=2
Following composite Nhs present
0x7ffc48001bb0 (Composite::L2COMP)
0x7ffcb00078b0 (Composite::TOR)
0x7ffcb001aff0 (Composite::EVPN)

Changed in juniperopenstack:
assignee: nobody → Manish Singh (manishs)
Changed in juniperopenstack:
importance: Undecided → High
information type: Proprietary → Public
tags: added: scale
Revision history for this message
Hari Prasad Killi (haripk) wrote :

checking for recreation of the issue, to debug the scenario.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/12127
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12297
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12297
Committed: http://github.org/Juniper/contrail-controller/commit/674b7ed3f366cd1aa0e2d9e23091b327af9f5fcb
Submitter: Zuul
Branch: master

commit 674b7ed3f366cd1aa0e2d9e23091b327af9f5fcb
Author: Manish <email address hidden>
Date: Wed Jul 1 15:27:58 2015 +0530

Dangling EVPN allocated MPLS label.

Problem:
Multicast allocates mpls label for use in evpn. This label is deleted once all
paths in the flood route goes off. Label is stored in multicast object created
for flood route. In problematic scenario multicast object was deleted either via
all VM going off(in regular agent) or via VN in TSN. Because of this unsubscribe
is sent to control-node which in turns issues withdraw of route with peer as CN.
However when this request is received at multicast handler, it tries to search
for multicast object and if it is not found then request is ignored. Hence path
keeps pending in route and label is also active.
Now multicast object gets added again because of VM coming in or VN subscription
is received in TSN. This will result in new evpn label allocation. Again this
results in export of route to CN with new label. CN sends EVPN olist with new
label. On receiving it multicast tries to search for object and finds the new
one. It picks label(new) and sends it for path addition in route. It results in
update of old path which was not deleted for peer CN. This updation results in
label update in evpn path as well and the last context of old mpls label is
lost. This results in dangling MPLS DB entry.

Solution:
If multicast object is null and delete request is received, use label as 0 as it
does not matter what is passed.

Change-Id: I75f5a3e12d312500259532fe95c543fa567eaeed
Closes-bug: 1452362

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/12127
Committed: http://github.org/Juniper/contrail-controller/commit/b88973dd9852cf8bc975a9fb002d0e64f7d6d7d5
Submitter: Zuul
Branch: R2.20

commit b88973dd9852cf8bc975a9fb002d0e64f7d6d7d5
Author: Manish <email address hidden>
Date: Wed Jul 1 15:27:58 2015 +0530

Dangling EVPN allocated MPLS label.

Problem:
Multicast allocates mpls label for use in evpn. This label is deleted once all
paths in the flood route goes off. Label is stored in multicast object created
for flood route. In problematic scenario multicast object was deleted either via
all VM going off(in regular agent) or via VN in TSN. Because of this unsubscribe
is sent to control-node which in turns issues withdraw of route with peer as CN.
However when this request is received at multicast handler, it tries to search
for multicast object and if it is not found then request is ignored. Hence path
keeps pending in route and label is also active.
Now multicast object gets added again because of VM coming in or VN subscription
is received in TSN. This will result in new evpn label allocation. Again this
results in export of route to CN with new label. CN sends EVPN olist with new
label. On receiving it multicast tries to search for object and finds the new
one. It picks label(new) and sends it for path addition in route. It results in
update of old path which was not deleted for peer CN. This updation results in
label update in evpn path as well and the last context of old mpls label is
lost. This results in dangling MPLS DB entry.

Solution:
If multicast object is null and delete request is received, use label as 0 as it
does not matter what is passed.

Change-Id: I75f5a3e12d312500259532fe95c543fa567eaeed
Closes-bug: 1452362

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/13927
Submitter: Vinay Vithal Mahuli (<email address hidden>)

dshimo (dshimo)
information type: Public → Private
information type: Private → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.