contrail-control crash in virtual BgpPeer::~BgpPeer

Bug #1547178 reported by vageesan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Prakash Bailkeri
R2.21.x
Fix Committed
High
Prakash Bailkeri
R2.22.x
Fix Committed
High
Prakash Bailkeri
R3.0
Fix Committed
High
Prakash Bailkeri
Trunk
Fix Committed
High
Prakash Bailkeri

Bug Description

contrail-control crash with following backtrace in solution test run.

mainline 3.0 , 2713 kilo

core is in 10.84.5.112:/cs-shared/bugs/<bug-id>/

#0 0x00007f71a71cbcc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f71a71cf0d8 in __GI_abort () at abort.c:89
#2 0x00007f71a71c4b86 in __assert_fail_base (
    fmt=0x7f71a7315830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0xcafc0a "GetRefCount() == 0",
    file=file@entry=0xcafd40 "controller/src/bgp/bgp_peer.cc", line=line@entry=421,
    function=function@entry=0xcb0cf0 <BgpPeer::~BgpPeer()::__PRETTY_FUNCTION__> "virtual BgpPeer::~BgpPeer()") at assert.c:92
#3 0x00007f71a71c4c32 in __GI___assert_fail (assertion=0xcafc0a "GetRefCount() == 0",
    file=0xcafd40 "controller/src/bgp/bgp_peer.cc", line=421,
    function=0xcb0cf0 <BgpPeer::~BgpPeer()::__PRETTY_FUNCTION__> "virtual BgpPeer::~BgpPeer()")
    at assert.c:101
#4 0x00000000007da8ce in BgpPeer::~BgpPeer (this=0x7f712004ae30, __in_chrg=<optimized out>)
    at controller/src/bgp/bgp_peer.cc:421
#5 0x00000000007da8e9 in BgpPeer::~BgpPeer (this=0x7f712004ae30, __in_chrg=<optimized out>)
    at controller/src/bgp/bgp_peer.cc:430
#6 0x00000000005d0fa0 in PeerManager::DestroyIPeer (this=0x1e4d5a0, ipeer=0x7f712004ae30)
    at controller/src/bgp/routing-instance/peer_manager.cc:113
#7 0x0000000000c27757 in LifetimeManager::DeleteExecutor (this=<optimized out>, actor_ref=...)
    at controller/src/base/lifetime.cc:229
#8 0x0000000000c27ad0 in operator() (a1=..., p=<optimized out>, this=<optimized out>)
    at /usr/include/boost/bind/mem_fn_template.hpp:165
#9 operator()<bool, boost::_mfi::mf1<bool, LifetimeManager, LifetimeManager::LifetimeActorRef>, boost::_bi::list1<LifetimeManager::LifetimeActorRef&> > (a=<synthetic pointer>, f=...,
    this=<optimized out>) at /usr/include/boost/bind/bind.hpp:303
#10 operator()<LifetimeManager::LifetimeActorRef> (a1=<synthetic pointer>, this=<optimized out>)
---Type <return> to continue, or q <return> to quit---

Revision history for this message
Ashish Ranjan (aranjan-n) wrote :

Happens once in a while for some time.. Need to narrow down.

Nischal Sheth (nsheth)
information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/17962
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/17963
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
Prakash Bailkeri (prakashmb) wrote :

From the core file it is clear that DBTablePartition change list for bgp.l3vpn.0 table has 285 entries but the DBPartition change_list_ has 0 entries for partition 0.
This would mean that DB notification to notify the delete of bgp.l3vpn.0 routes will not be kicked in. This causes the secondary paths/routes in the inet table to stay forever. These secondary paths are holding the reference to BgpPeer causing the crash.

Possible root cause is parallel access of DBPartition from multiple static route task as it can run in parallel and notify/change/delete/add routes in same/different DBPartition.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/17962
Committed: http://github.org/Juniper/contrail-controller/commit/d2df260c66908d76b1543a2ebcd66e5c99317338
Submitter: Zuul
Branch: master

commit d2df260c66908d76b1543a2ebcd66e5c99317338
Author: Prakash Bailkeri <email address hidden>
Date: Sat Feb 27 23:18:27 2016 +0530

Static route task instance

Root cause of the crash:
The BgpPeer is attempted to be deleted with active secondary path referring to it.
The secondary paths are not deleted as deleted bgp.l3vpn.0 routes are not notified.
Note. These routes are present in the changes list of DBTablePartition but the
DBPartition Change list is empty. So no DBTable task was scheduled to notify them.

This issue can be caused by static route task. It runs with instance ID of rtinstance->index()
There are two problems:
1. Routing instance Index is not allocated when static route mgr is constructed.
So all tasks run with instance ID -1. So they can all run in parallel. Since these tasks
add/delete/notify entries in DBTabel, They can edit same dbpartition change list.

2. Even if we fix it with correct index, they will have different instance ID and still can run in parallel.

Proposed Fix:
Run static route task with instance of 0. Also, note that task policy of control-node
ensures that DBTable task and staticRoute task doesn't run in parallel. This will
ensure required mutual exclusion in editing the change list of DBPartition

Added a test to add mutliple l3vpn routes and two static route config on different
routing instance.

Change-Id: I47166c2a7c98052c76cf2044d13a646fbe588baa
Closes-Bug: 1547178

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/17963
Committed: http://github.org/Juniper/contrail-controller/commit/90c01cb40911991b5298f21cbd3262193eaafd2d
Submitter: Zuul
Branch: R3.0

commit 90c01cb40911991b5298f21cbd3262193eaafd2d
Author: Prakash Bailkeri <email address hidden>
Date: Sat Feb 27 23:18:27 2016 +0530

Static route task instance

Root cause of the crash:
The BgpPeer is attempted to be deleted with active secondary path referring to it.
The secondary paths are not deleted as deleted bgp.l3vpn.0 routes are not notified.
Note. These routes are present in the changes list of DBTablePartition but the
DBPartition Change list is empty. So no DBTable task was scheduled to notify them.

This issue can be caused by static route task. It runs with instance ID of rtinstance->index()
There are two problems:
1. Routing instance Index is not allocated when static route mgr is constructed.
So all tasks run with instance ID -1. So they can all run in parallel. Since these tasks
add/delete/notify entries in DBTabel, They can edit same dbpartition change list.

2. Even if we fix it with correct index, they will have different instance ID and still can run in parallel.

Proposed Fix:
Run static route task with instance of 0. Also, note that task policy of control-node
ensures that DBTable task and staticRoute task doesn't run in parallel. This will
ensure required mutual exclusion in editing the change list of DBPartition

Added a test to add mutliple l3vpn routes and two static route config on different
routing instance.

Change-Id: I47166c2a7c98052c76cf2044d13a646fbe588baa
Closes-Bug: 1547178
(cherry picked from commit d2df260c66908d76b1543a2ebcd66e5c99317338)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/18099
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/18100
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/18100
Committed: http://github.org/Juniper/contrail-controller/commit/06afb1b3bd4c5223cfa76d06eaf0b9900ef8eb8d
Submitter: Zuul
Branch: R2.22.x

commit 06afb1b3bd4c5223cfa76d06eaf0b9900ef8eb8d
Author: Prakash Bailkeri <email address hidden>
Date: Wed Mar 2 16:07:35 2016 +0530

Static route task instance

Root cause:

StaticRoute task runs with instance ID of rtinstance->index() which allows
multiple static route tasks to run in parallel.

There are 3 major issues related to this concurrency
1. StaticRoute task adds/deletes/updates routes to DBtable. So concurrent tasks
could potentially operate on same DBTablePartition. Possibly corrupt the chagne_list

2. RemoveStaticRouteMgr called on BgpServer object corrupts the srt_manager_list_

3. Concurrent call to BgpConditionListener::UnregisterMatchCondition can corrupt
the condition listener datastructure.(TableMap map_).

Proposed Fix:

Run static route task with instance of 0.
Also, note that task policy of control-node ensures that
1. DBTable task and staticRoute task doesn't run in parallel.
2. bgp:Config task and StaticRoute tasks are mutually exclusive

Change-Id: Ib5ee4060ea90947a8261f33fa194219205702163
Closes-Bug: 1547178
(cherry picked from commit 2dc582a813a7f9ca0dbc3ee0940f449b3c0a9b1f)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/18241
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/18241
Committed: http://github.org/Juniper/contrail-controller/commit/c33f23845d7921c69e58ba4bea3d12ad95b71709
Submitter: Zuul
Branch: R2.21.x

commit c33f23845d7921c69e58ba4bea3d12ad95b71709
Author: Prakash Bailkeri <email address hidden>
Date: Wed Mar 2 16:07:35 2016 +0530

Static route task instance

Root cause:

StaticRoute task runs with instance ID of rtinstance->index() which allows
multiple static route tasks to run in parallel.

There are 3 major issues related to this concurrency
1. StaticRoute task adds/deletes/updates routes to DBtable. So concurrent tasks
could potentially operate on same DBTablePartition. Possibly corrupt the chagne_list

2. RemoveStaticRouteMgr called on BgpServer object corrupts the srt_manager_list_

3. Concurrent call to BgpConditionListener::UnregisterMatchCondition can corrupt
the condition listener datastructure.(TableMap map_).

Proposed Fix:

Run static route task with instance of 0.
Also, note that task policy of control-node ensures that
1. DBTable task and staticRoute task doesn't run in parallel.
2. bgp:Config task and StaticRoute tasks are mutually exclusive

Change-Id: Ib5ee4060ea90947a8261f33fa194219205702163
Closes-Bug: 1547178
(cherry picked from commit 2dc582a813a7f9ca0dbc3ee0940f449b3c0a9b1f)
(cherry picked from commit 06afb1b3bd4c5223cfa76d06eaf0b9900ef8eb8d)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/18099
Committed: http://github.org/Juniper/contrail-controller/commit/2dc582a813a7f9ca0dbc3ee0940f449b3c0a9b1f
Submitter: Zuul
Branch: R2.20

commit 2dc582a813a7f9ca0dbc3ee0940f449b3c0a9b1f
Author: Prakash Bailkeri <email address hidden>
Date: Wed Mar 2 16:07:35 2016 +0530

Static route task instance

Root cause:

StaticRoute task runs with instance ID of rtinstance->index() which allows
multiple static route tasks to run in parallel.

There are 3 major issues related to this concurrency
1. StaticRoute task adds/deletes/updates routes to DBtable. So concurrent tasks
could potentially operate on same DBTablePartition. Possibly corrupt the chagne_list

2. RemoveStaticRouteMgr called on BgpServer object corrupts the srt_manager_list_

3. Concurrent call to BgpConditionListener::UnregisterMatchCondition can corrupt
the condition listener datastructure.(TableMap map_).

Proposed Fix:

Run static route task with instance of 0.
Also, note that task policy of control-node ensures that
1. DBTable task and staticRoute task doesn't run in parallel.
2. bgp:Config task and StaticRoute tasks are mutually exclusive

Change-Id: Ib5ee4060ea90947a8261f33fa194219205702163
Closes-Bug: 1547178

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.