[R2.20 Build 21] Tor Agent crash @ ListenerInfo::Unregister(int)::__PRETTY_FUNCTION__>

Bug #1455862 reported by chhandak
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Manish Singh
Trunk
Fix Committed
High
Manish Singh

Bug Description

Trigger
-----------
Restart of control node. When System has following profile
VN 8K
LIF 16K
VMI 32K

Backtrace
-----------------
(gdb) bt
#0 0x00007f45ffce1cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f45ffce50d8 in __GI_abort () at abort.c:89
#2 0x00007f45ffcdab86 in __assert_fail_base (fmt=0x7f45ffe2b830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x1a299b3 "state_count_[listener] == 0", file=file@entry=0x1a29961 "controller/src/db/db_table.cc", line=line@entry=89,
    function=function@entry=0x1a29f00 <DBTableBase::ListenerInfo::Unregister(int)::__PRETTY_FUNCTION__> "void DBTableBase::ListenerInfo::Unregister(DBTableBase::ListenerId)") at assert.c:92
#3 0x00007f45ffcdac32 in __GI___assert_fail (assertion=0x1a299b3 "state_count_[listener] == 0", file=0x1a29961 "controller/src/db/db_table.cc", line=89,
    function=0x1a29f00 <DBTableBase::ListenerInfo::Unregister(int)::__PRETTY_FUNCTION__> "void DBTableBase::ListenerInfo::Unregister(DBTableBase::ListenerId)")
    at assert.c:101
#4 0x00000000017c7584 in DBTableBase::ListenerInfo::Unregister (this=0x7f45f401bc30, listener=2) at controller/src/db/db_table.cc:89
#5 0x00000000017c6348 in DBTableBase::Unregister (this=0x7f45f401b9f0, listener=2) at controller/src/db/db_table.cc:173
#6 0x0000000001254d52 in BgpPeer::~BgpPeer (this=0x7f458c00ae70, __in_chrg=<optimized out>) at controller/src/vnsw/agent/oper/peer.cc:42
#7 0x0000000001254dd4 in BgpPeer::~BgpPeer (this=0x7f458c00ae70, __in_chrg=<optimized out>) at controller/src/vnsw/agent/oper/peer.cc:44
#8 0x00000000013ee70d in boost::checked_delete<BgpPeer> (x=0x7f458c00ae70) at /usr/include/boost/checked_delete.hpp:34
#9 0x00000000013ef9c0 in boost::detail::sp_counted_impl_p<BgpPeer>::dispose (this=0x7f458c000d20)
    at /usr/include/boost/smart_ptr/detail/sp_counted_impl.hpp:78
#10 0x0000000000fe446c in boost::detail::sp_counted_base::release (this=0x7f458c000d20) at /usr/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:146
#11 0x0000000000fe44e5 in boost::detail::shared_count::~shared_count (this=0x7f45f471dde8, __in_chrg=<optimized out>)
    at /usr/include/boost/smart_ptr/detail/shared_count.hpp:371
#12 0x00000000013c1c16 in boost::shared_ptr<BgpPeer>::~shared_ptr (this=0x7f45f471dde0, __in_chrg=<optimized out>)
    at /usr/include/boost/smart_ptr/shared_ptr.hpp:328
#13 0x00000000013c1c34 in __gnu_cxx::new_allocator<boost::shared_ptr<BgpPeer> >::destroy (this=0x7f45f0ff2787, __p=0x7f45f471dde0)
    at /usr/include/c++/4.8/ext/new_allocator.h:133
#14 0x00000000013c1902 in std::list<boost::shared_ptr<BgpPeer>, std::allocator<boost::shared_ptr<BgpPeer> > >::_M_erase (this=0x7f45f4003448, __position=...)
    at /usr/include/c++/4.8/bits/stl_list.h:1575
#15 0x00000000013c0952 in std::list<boost::shared_ptr<BgpPeer>, std::allocator<boost::shared_ptr<BgpPeer> > >::remove (this=0x7f45f4003448, __value=...)
    at /usr/include/c++/4.8/bits/list.tcc:261
#16 0x00000000013be11a in VNController::ControllerPeerHeadlessAgentDelDone (this=0x7f45f4003430, bgp_peer=0x7f458c00ae70)
    at controller/src/vnsw/agent/controller/controller_init.cc:586
#17 0x00000000013be543 in VNController::ControllerWorkQueueProcess (this=0x7f45f4003430, data=...)
    at controller/src/vnsw/agent/controller/controller_init.cc:680
#18 0x00000000013c4b52 in boost::_mfi::mf1<bool, VNController, boost::shared_ptr<ControllerWorkQueueData> >::operator() (this=0x7f45f0ff2a78,
    p=0x7f45f4003430, a1=...) at /usr/include/boost/bind/mem_fn_template.hpp:165
#19 0x00000000013c4042 in boost::_bi::list2<boost::_bi::value<VNController*>, boost::arg<1> >::operator()<bool, boost::_mfi::mf1<bool, VNController, boost::shar---Type <return> to continue, or q <return> to quit---
ed_ptr<ControllerWorkQueueData> >, boost::_bi::list1<boost::shared_ptr<ControllerWorkQueueData>&> > (this=0x7f45f0ff2a88, f=..., a=...)
    at /usr/include/boost/bind/bind.hpp:303
#20 0x00000000013c35e4 in boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, VNController, boost::shared_ptr<ControllerWorkQueueData> >, boost::_bi::list2<boost::_bi::value<VNController*>, boost::arg<1> > >::operator()<boost::shared_ptr<ControllerWorkQueueData> > (this=0x7f45f0ff2a78, a1=...)
    at /usr/include/boost/bind/bind_template.hpp:32
#21 0x00000000013c2d12 in boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, VNController, boost::shared_ptr<ControllerWorkQueueData> >, boost::_bi::list2<boost::_bi::value<VNController*>, boost::arg<1> > >, bool, boost::shared_ptr<ControllerWorkQueueData> >::invoke (
    function_obj_ptr=..., a0=...) at /usr/include/boost/function/function_template.hpp:132
#22 0x00000000013c57f4 in boost::function1<bool, boost::shared_ptr<ControllerWorkQueueData> >::operator() (this=0x7f45f0ff2a70, a0=...)
    at /usr/include/boost/function/function_template.hpp:767
#23 0x00000000013c54a0 in QueueTaskRunner<boost::shared_ptr<ControllerWorkQueueData>, WorkQueue<boost::shared_ptr<ControllerWorkQueueData> > >::RunQueue (
    this=0x7f45ec1b6af0) at controller/src/base/queue_task.h:74
#24 0x00000000013c52d4 in QueueTaskRunner<boost::shared_ptr<ControllerWorkQueueData>, WorkQueue<boost::shared_ptr<ControllerWorkQueueData> > >::Run (
    this=0x7f45ec1b6af0) at controller/src/base/queue_task.h:57
#25 0x000000000191f872 in TaskImpl::execute (this=0x7f45f9446c40) at controller/src/base/task.cc:232
#26 0x00007f46008b0b3a in ?? () from /usr/lib/libtbb.so.2
#27 0x00007f46008ac816 in ?? () from /usr/lib/libtbb.so.2
#28 0x00007f46008abf4b in ?? () from /usr/lib/libtbb.so.2
#29 0x00007f46008a80ff in ?? () from /usr/lib/libtbb.so.2
#30 0x00007f46008a82f9 in ?? () from /usr/lib/libtbb.so.2
#31 0x00007f4600acc182 in start_thread (arg=0x7f45f0ff3700) at pthread_create.c:312
#32 0x00007f45ffda547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Tags: bms scale vrouter
Revision history for this message
chhandak (chhandak) wrote :
information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.20

Review in progress for https://review.opencontrail.org/10610
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/10610
Committed: http://github.org/Juniper/contrail-controller/commit/705165854bbdaff9c47a2a9443410eec53e4fb37
Submitter: Zuul
Branch: R2.20

commit 705165854bbdaff9c47a2a9443410eec53e4fb37
Author: Manish <email address hidden>
Date: Wed May 20 16:42:03 2015 +0530

VRF state not deleted in DelPeer walk.

Problem statement remains same as in this commit:
https://github.com/Juniper/contrail-controller/commit/8e302fcb991c8f5d8f5defb85b9851f8cde5f479

However above commit does not solve the issue.
Reason being, walk count was being incremented on enqueue of walk but when walk
is processed it calls Cancel for any previously started walk. This Cancel
decrements the walk count. This defeats the purpose of moving walk count
increment to enqueue in above commit.
Also consider a walk for VRF where there are four route tables. This should
result in walk count to be 5 (1 for vrf table + 4 for route tables). With above
fix this will be 2 (1 for Vrf + 1 for route table). It didnt take into
consideration that route walk count needs to be incremented for each route
table.

Solution:
Use a seperate enqueue walk count and restore the walk_count as it was before
the above commit. Use both of them to check for walk done.

Closes-bug: 1455862
Change-Id: I8d96732375f649e70d6754cb6c5b34c24867ce0c

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11490
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/11496
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged
Download full text (7.4 KiB)

Reviewed: https://review.opencontrail.org/11496
Committed: http://github.org/Juniper/contrail-controller/commit/bf9e76793aab5870da98bf49b5fb6c6eed82625f
Submitter: Zuul
Branch: master

commit bf9e76793aab5870da98bf49b5fb6c6eed82625f
Author: Manish <email address hidden>
Date: Thu May 14 17:23:05 2015 +0530

Handle error from XMPP session send.

Problem:
If XMPP channel send fails which internally(TCP send) translates into a defer
send operation, then agent controller module was treating this as a failure.
In case of route update which is seen in bug any such defer will result in
exported flag for route set to false. Since it is set to false, later on route
delete the unsubscribe for this route will not be sent. However update of route
was deferred and would have gone in some time but control node will never get
delete which will result in issue mentioned by the bug.

Solution:
Ideally all the failed send should be used to enqueue further requests to
control node and replay them whenever callback(writereadycb) is called.
In this way agent will not overload socket.
However as a quick fix the error will not be used to judge the further operation
after send is done. This is in asumption that send will always be succesful.
Currently following messages are sent:
1) VM config sub/unsub
2) Config subscribe for agent
3) VRF sub/unsub
4) Route sub/unsub.
Connection not present will be taken care by channel flap handling.

Change-Id: Ib6e0856b5c689b51209add4ab459b8bd2e952143
Closes-bug: 1453483
(cherry picked from commit 0a70915fd3bc1954154e657f59123d1a4597f2a4)

VRF state not deleted in DelPeer walk.

Problem statement remains same as in this commit:
https://github.com/Juniper/contrail-controller/commit/8e302fcb991c8f5d8f5defb85b9851f8cde5f479

However above commit does not solve the issue.
Reason being, walk count was being incremented on enqueue of walk but when walk
is processed it calls Cancel for any previously started walk. This Cancel
decrements the walk count. This defeats the purpose of moving walk count
increment to enqueue in above commit.
Also consider a walk for VRF where there are four route tables. This should
result in walk count to be 5 (1 for vrf table + 4 for route tables). With above
fix this will be 2 (1 for Vrf + 1 for route table). It didnt take into
consideration that route walk count needs to be incremented for each route
table.

Solution:
Use a seperate enqueue walk count and restore the walk_count as it was before
the above commit. Use both of them to check for walk done.

Closes-bug: 1455862
Change-Id: I8d96732375f649e70d6754cb6c5b34c24867ce0c
(cherry picked from commit 705165854bbdaff9c47a2a9443410eec53e4fb37)

Multicast route gets deleted when vxlan id is changed in configured mode

Problem:
In oper multicast if local peer vxlan-id is changed then there was add issued
for route with new vxlan and delete issued for same with old vxlan.
Since the peer is local the path search only compares peer and not vxlan.
This results in deletion of local path and eventually the multicast route.

Solution:
Need not withdraw path from local peer on vxlan id change. Just trigger update
of same. This will result in controller route _ex...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.