[R2.20.72] Agent crash @ state_count_[listener] == 0

Bug #1480375 reported by chhandak
56
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Manish Singh
Trunk
Fix Committed
High
Manish Singh

Bug Description

Observed this Agent crash while rebooting compute node .

Crash occurred in scale setup. There is around 52k VN with 104k VMI. Original router module is replaced by one private router module given by DIvakar.

Backtrace
----------------
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f4635901cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f4635901cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f46359050d8 in __GI_abort () at abort.c:89
#2 0x00007f46358fab86 in __assert_fail_base (fmt=0x7f4635a4b830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x109a417 "state_count_[listener] == 0", file=file@entry=0x109a3f9 "controller/src/db/db_table.cc",
    line=line@entry=89, function=function@entry=0x109a760 "void DBTableBase::ListenerInfo::Unregister(DBTableBase::ListenerId)") at assert.c:92
#3 0x00007f46358fac32 in __GI___assert_fail (assertion=0x109a417 "state_count_[listener] == 0",
    file=0x109a3f9 "controller/src/db/db_table.cc", line=89,
    function=0x109a760 "void DBTableBase::ListenerInfo::Unregister(DBTableBase::ListenerId)") at assert.c:101
#4 0x0000000000ec562f in DBTableBase::Unregister(int) ()
#5 0x0000000000997196 in BgpPeer::~BgpPeer() ()
#6 0x0000000000997219 in BgpPeer::~BgpPeer() ()
#7 0x00000000007ac4de in boost::detail::sp_counted_base::release() ()
#8 0x0000000000bc7bdb in VNController::ControllerPeerHeadlessAgentDelDone(BgpPeer*) ()
#9 0x0000000000bc9271 in VNController::ControllerWorkQueueProcess(boost::shared_ptr<ControllerWorkQueueData>) ()
#10 0x0000000000bcbb2c in boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, VNController, boost::shared_ptr<ControllerWorkQueueData> >, boost::_bi::list2<boost::_bi::value<VNController*>, boost::arg<1> > >, bool, boost::shared_ptr<ControllerWorkQueueData> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<ControllerWorkQueueData>) ()
#11 0x0000000000bce0f4 in QueueTaskRunner<boost::shared_ptr<ControllerWorkQueueData>, WorkQueue<boost::shared_ptr<ControllerWorkQueueData> > >::RunQueue() ()
#12 0x0000000000fc2270 in TaskImpl::execute() ()
#13 0x00007f46364d0b3a in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f46364cc816 in ?? () from /usr/lib/libtbb.so.2
#15 0x00007f46364cbf4b in ?? () from /usr/lib/libtbb.so.2
#16 0x00007f46364c80ff in ?? () from /usr/lib/libtbb.so.2
#17 0x00007f46364c82f9 in ?? () from /usr/lib/libtbb.so.2
#18 0x00007f46366ec182 in start_thread (arg=0x7f462e385700) at pthread_create.c:312
#19 0x00007f46359c547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

root@nodei9:~# vrouter --info
vRouter module version 2.20 (Built by ddivakar@nodeb6 on 2015-07-31 09:02:25.871521)
Interfaces limit 4352
VRF tables limit 65536
NextHops limit 512000
MPLS Labels limit 200000
Bridge Table limit 1000000
Bridge Table Overflow limit 4096
Flow Table limit 524288
Flow Table overflow limit 8192
Mirror entries limit 255

root@nodei9:~# contrail-status
== Contrail vRouter ==
supervisor-vrouter: active
contrail-tor-agent-1 active
contrail-tor-agent-10 initializing (ToR:dummy-10 connection down)
contrail-tor-agent-11 initializing (ToR:dummy-11 connection down)
contrail-tor-agent-12 initializing (ToR:dummy-12 connection down)
contrail-tor-agent-13 initializing (ToR:dummy-13 connection down)
contrail-tor-agent-14 initializing (ToR:dummy-14 connection down)
contrail-tor-agent-15 initializing (ToR:dummy-15 connection down)
contrail-tor-agent-16 initializing (ToR:dummy-16 connection down)
contrail-tor-agent-17 initializing (ToR:dummy-17 connection down)
contrail-tor-agent-18 initializing (ToR:dummy-18 connection down)
contrail-tor-agent-2 initializing (ToR:bng-contrail-qfx51-3 connection down)
contrail-tor-agent-5 initializing (ToR:dummy-5 connection down)
contrail-tor-agent-6 initializing (ToR:dummy-6 connection down)
contrail-tor-agent-7 initializing (ToR:dummy-7 connection down)
contrail-tor-agent-8 initializing (ToR:dummy-8 connection down)
contrail-tor-agent-9 initializing (ToR:dummy-9 connection down)
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

========Run time service failures=============
/var/crashes/core.contrail-vroute.2592.nodei9.1438350737
root@nodei9:~#

Tags: quench vrouter
Changed in juniperopenstack:
assignee: nobody → Manish Singh (manishs)
milestone: none → r2.30-fcs
Revision history for this message
chhandak (chhandak) wrote :
Revision history for this message
Manish Singh (manishs) wrote :

This could potentially happen when walk done is not yet executed for a walk which was started to delete state of decommissioned peer and before that VRF gets notified for a change. Because Walk done is not yet over listener is not removed from vrf table. Notification is received for this VRF for this listener and in turn state is added. Now walk done is called and deletion of peer is done which results in this crash as state is pending for this peer.
Trying UT for this.

Revision history for this message
Ganesha HV (ganeshahv) wrote :
Download full text (4.0 KiB)

Saw the following bt during sanity run :

(gdb) bt
#0 0x00007f36a86e5cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f36a86e90d8 in __GI_abort () at abort.c:89
#2 0x00007f36a86deb86 in __assert_fail_base (fmt=0x7f36a882f830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x10a1b09 "!IsDeleted()", file=file@entry=0x10a1aeb "controller/src/db/db_entry.cc", line=line@entry=31,
    function=function@entry=0x10a1da0 <DBEntryBase::SetState(DBTableBase*, int, DBState*)::__PRETTY_FUNCTION__> "void DBEntryBase::SetState(DBTableBase*, DBEntryBase::ListenerId, DBState*)") at assert.c:92
#3 0x00007f36a86dec32 in __GI___assert_fail (assertion=0x10a1b09 "!IsDeleted()", file=0x10a1aeb "controller/src/db/db_entry.cc", line=31,
    function=0x10a1da0 <DBEntryBase::SetState(DBTableBase*, int, DBState*)::__PRETTY_FUNCTION__> "void DBEntryBase::SetState(DBTableBase*, DBEntryBase::ListenerId, DBState*)") at assert.c:101
#4 0x0000000000ec1650 in DBEntryBase::SetState (this=0x7f36740c3510, tbl_base=0x7f3640014a30, listener=2, state=0x7f3638037090) at controller/src/db/db_entry.cc:31
#5 0x0000000000bd4d58 in RouteExport::MulticastNotify (this=0x7f364001b060, bgp_xmpp_peer=0x7f3630004990, associate=<optimized out>, partition=0x7f3640014bf0, e=0x7f36740c3510) at controller/src/vnsw/agent/controller/controller_export.cc:311
#6 0x0000000000bd51dd in operator() (a6=<optimized out>, a5=<optimized out>, a4=<optimized out>, a3=<optimized out>, a2=<optimized out>, a1=<optimized out>, p=<optimized out>, this=<optimized out>) at /usr/include/boost/bind/mem_fn_template.hpp:732
#7 operator()<boost::_mfi::mf6<void, RouteExport, const Agent*, AgentXmppChannel*, bool, Agent::RouteTableType, DBTablePartBase*, DBEntryBase*>, boost::_bi::list2<DBTablePartBase*&, DBEntryBase*&> > (a=<synthetic pointer>, f=..., this=<optimized out>)
    at /usr/include/boost/bind/bind.hpp:670
#8 operator()<DBTablePartBase*, DBEntryBase*> (a2=<synthetic pointer>, a1=<synthetic pointer>, this=<optimized out>) at /usr/include/boost/bind/bind_template.hpp:61
#9 boost::detail::function::void_function_obj_invoker2<boost::_bi::bind_t<void, boost::_mfi::mf6<void, RouteExport, Agent const*, AgentXmppChannel*, bool, Agent::RouteTableType, DBTablePartBase*, DBEntryBase*>, boost::_bi::list7<boost::_bi::value<RouteExport*>, boost::_bi::value<Agent*>, boost::_bi::value<AgentXmppChannel*>, boost::_bi::value<bool>, boost::_bi::value<Agent::RouteTableType>, boost::arg<1>, boost::arg<2> > >, void, DBTablePartBase*, DBEntryBase*>::invoke (function_obj_ptr=..., a0=<optimized out>, a1=<optimized out>)
    at /usr/include/boost/function/function_template.hpp:153
#10 0x0000000000ecc83a in operator() (a1=0x7f36740c3510, a0=0x7f3640014bf0, this=0x7f36907f0ac0) at /usr/include/boost/function/function_template.hpp:767
#11 RunNotify (entry=0x7f36740c3510, tpart=0x7f3640014bf0, this=0x7f3640014b70) at controller/src/db/db_table.cc:114
#12 DBTableBase::RunNotify (this=<optimized out>, tpart=tpart@entry=0x7f3640014bf0, entry=entry@entry=0x7f36740c3510) at controller/src/db/db_table.cc:199
#13 0x0000000000ecf338 in DBTablePartBase::RunNotify (this=this@entry...

Read more...

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/13024
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13024
Committed: http://github.org/Juniper/contrail-controller/commit/116729fc2788609d247676dcccbd9ad8e7c63efd
Submitter: Zuul
Branch: R2.20

commit 116729fc2788609d247676dcccbd9ad8e7c63efd
Author: Manish <email address hidden>
Date: Thu Aug 13 09:55:15 2015 +0530

Agent crashes in peer handling.

Problem and fixes:
There were three different scenarios where agent was crashing.
1) Bgp-peer State on Vrf: Channel has gone down and peer has been decommisioned.
This results in walk start. Before the walk one of the VRF is deleted.
Walk reaches this deleted VRF and ignores it as it is delete marked assuming
that notification will take care of removal of state. However before
notification happens, walk is done and peer released. This results in failure
as state is not yet removed.

Fix: Dont ignore deleted VRF in walk for deleting peer.

2) Deleted route state addition: Again the same case of deleted peer i.e.
channel has gone down resulting in walk to delete state for this peer. Every VRF
walk is accompained with route table walk. There is a state maintained on route
whose listener(on route table) is stored in state on VRF for that table.
This listener on route table is deleted when route walk done is called. Walk
itself would have deleted all states on route in this table. Point to note that
walk done is enqueued for processing. So it may happen that there is some change
in one of the route along with channel coming up with some other peer.
RouteExport::Notify will get a call for that route and it will try to see if
channel is active. Since new peer came up channel is seen as active. But this
peer still has not added it vrf state. Hence Vrf state will be null. Issue
happens here. Instead of returning after seeing no vrf state, processing
continues and adds state back on route with old id (from old peer).

Fix: Return when no vrf state is found.

3) Multicast route was adding state even though its delete marked.
This was happening because in MulticastExport::Notify non-null state was and'd
with route delete check for delete processing. So if state was null, route
delete notification used to skip delete handling and instead land in addition of
state.

Fix: Decouple state check. Ignore delete notification and return when state is
null.

Change-Id: Ib5156a2ee6b72902a6ff75f05740549f19874241
Closes-bug: 1480375

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/13099
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13099
Committed: http://github.org/Juniper/contrail-controller/commit/350a8eedc3aa4a8e4b13b5224ee878c5c10f5be3
Submitter: Zuul
Branch: master

commit 350a8eedc3aa4a8e4b13b5224ee878c5c10f5be3
Author: Manish <email address hidden>
Date: Thu Aug 13 09:55:15 2015 +0530

Agent crashes in peer handling.

Problem and fixes:
There were three different scenarios where agent was crashing.
1) Bgp-peer State on Vrf: Channel has gone down and peer has been decommisioned.
This results in walk start. Before the walk one of the VRF is deleted.
Walk reaches this deleted VRF and ignores it as it is delete marked assuming
that notification will take care of removal of state. However before
notification happens, walk is done and peer released. This results in failure
as state is not yet removed.

Fix: Dont ignore deleted VRF in walk for deleting peer.

2) Deleted route state addition: Again the same case of deleted peer i.e.
channel has gone down resulting in walk to delete state for this peer. Every VRF
walk is accompained with route table walk. There is a state maintained on route
whose listener(on route table) is stored in state on VRF for that table.
This listener on route table is deleted when route walk done is called. Walk
itself would have deleted all states on route in this table. Point to note that
walk done is enqueued for processing. So it may happen that there is some change
in one of the route along with channel coming up with some other peer.
RouteExport::Notify will get a call for that route and it will try to see if
channel is active. Since new peer came up channel is seen as active. But this
peer still has not added it vrf state. Hence Vrf state will be null. Issue
happens here. Instead of returning after seeing no vrf state, processing
continues and adds state back on route with old id (from old peer).

Fix: Return when no vrf state is found.

3) Multicast route was adding state even though its delete marked.
This was happening because in MulticastExport::Notify non-null state was and'd
with route delete check for delete processing. So if state was null, route
delete notification used to skip delete handling and instead land in addition of
state.

Fix: Decouple state check. Ignore delete notification and return when state is
null.

Change-Id: Ib5156a2ee6b72902a6ff75f05740549f19874241
Closes-bug: 1480375
(cherry picked from commit 116729fc2788609d247676dcccbd9ad8e7c63efd)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/13927
Submitter: Vinay Vithal Mahuli (<email address hidden>)

information type: Proprietary → Public
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.