Tor Agent crash @ KSyncObject::NotifyEvent

Bug #1572287 reported by chhandak
24
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.0
Fix Committed
High
Prabhjot Singh Sethi
Trunk
Fix Committed
High
Prabhjot Singh Sethi

Bug Description

While upgrading to scale setup from Build 26 to 28 observed this crash. Setup has following scale config

8K VN
16K LIF
16K VMI

Backtrace (With Build 26 Binary)
-----------------------------------
gdb) bt
#0 0x00007f10fc055630 in ?? ()
#1 0x0000000000b511cb in KSyncObject::NotifyEvent (this=0x7f10fa69a330, entry=0x7f10fc055640, event=KSyncEntry::DEL_REQ)
    at controller/src/ksync/ksync_object.cc:1337
#2 0x0000000000b5149d in KSyncObject::SafeNotifyEvent (this=0x7f10fa69a330, entry=0x7f10fc055640, event=KSyncEntry::DEL_REQ)
    at controller/src/ksync/ksync_object.cc:277
#3 0x0000000000b52c90 in Delete (entry=0x7f10fc055640, this=0x7f10fa69a330) at controller/src/ksync/ksync_object.cc:235
#4 KSyncObjectManager::Process (this=0x7f110c0021e0, event=0x7f10fa4d8390) at controller/src/ksync/ksync_object.cc:1544
#5 0x0000000000b5875f in operator() (a0=0x7f10fa4d8390, this=0x7f111a2f8a30) at /usr/include/boost/function/function_template.hpp:767
#6 RunQueue (this=0x7f10f82a3970) at controller/src/base/queue_task.h:87
#7 QueueTaskRunner<KSyncObjectEvent*, WorkQueue<KSyncObjectEvent*> >::Run (this=0x7f10f82a3970) at controller/src/base/queue_task.h:66
#8 0x0000000000f446ec in TaskImpl::execute (this=0x7f111a41fd40) at controller/src/base/task.cc:253
#9 0x00007f1128140b3a in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f112813c816 in ?? () from /usr/lib/libtbb.so.2
#11 0x00007f112813bf4b in ?? () from /usr/lib/libtbb.so.2
#12 0x00007f11281380ff in ?? () from /usr/lib/libtbb.so.2
#13 0x00007f11281382f9 in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f112835c182 in start_thread (arg=0x7f111a2f9700) at pthread_create.c:312
#15 0x00007f112763547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Backtrace (With Build 28 Binary)
-----------------------------------
Core was generated by `/usr/bin/contrail-tor-agent --config_file /etc/contrail/contrail-tor-agent-2.co'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f10fc055630 in ?? ()
(gdb) bt
#0 0x00007f10fc055630 in ?? ()
#1 0x0000000000b511cb in ?? () at controller/src/ksync/ksync_object.cc:789
#2 0x00007f10f99157d0 in ?? ()
#3 0x00007f10f99157d0 in ?? ()
#4 0x00007f10fc2c5388 in ?? ()
#5 0x00007f111a2f8990 in ?? ()
#6 0x00007f10fc2a1b78 in ?? ()
#7 0x00000000008e7e55 in AgentRouteTable::Input (this=0x3, part=0x7f1121011d60, client=<optimized out>, req=0x7f10fa69a330)
    at controller/src/vnsw/agent/oper/agent_route.cc:427
#8 0x00007f1120ddf500 in ?? ()
#9 0x0000000000000001 in ?? ()
#10 0x0000000000000000 in ?? ()

chhandak (chhandak)
Changed in juniperopenstack:
importance: Undecided → High
Revision history for this message
chhandak (chhandak) wrote :

Core Copied to

chhandak@ubuntu-build04:/cs-shared/bugs/1572287$ ls -lrt
total 897904
-rwxrwxrwx 1 chhandak epbg 296065973 Apr 19 12:10 contrail-tor-agent
-rwxrwxrwx 1 chhandak epbg 296183330 Apr 19 12:10 contrail-tor-agent-26
-rwxrwxrwx 1 chhandak epbg 323579904 Apr 19 12:10 core.contrail-tor-ag.1181.5b7s4.1461088882

Changed in juniperopenstack:
assignee: nobody → Hari Prasad Killi (haripk)
Revision history for this message
Hari Prasad Killi (haripk) wrote :

(gdb) set print object on
(gdb) fr 1
#1 0x0000000000b511cb in KSyncObject::NotifyEvent (this=0x7f57589c4760, entry=0x7f5758a52250, event=KSyncEntry::DEL_REQ) at controller/src/ksync/ksync_object.cc:1337

(gdb) p entry
$1 = (KSyncEntry *) 0x7f5758a52250
(gdb) p *entry
$2 = {
  _vptr.KSyncEntry = 0x7f5758926420,
  static kInvalidIndex = 4294967295,
  static kDefaultMsgSize = 512,
  node_ = <boost::intrusive_hook> parent = 0x7f5758a52570 left = 0x0 right = 0x0,
  index_ = 4294967295,
  state_ = KSyncEntry::FREE_WAIT,
  refcount_ = (tbb::atomic) 0,
  seen_ = true,
  stale_ = false,
  del_add_pending_ = false
}
(gdb) p *this
$3 = (OVSDB::UnicastMacRemoteEntry) {
...

Changed in juniperopenstack:
assignee: Hari Prasad Killi (haripk) → Prabhjot Singh Sethi (prabhjot)
milestone: none → r3.1.0.0-fcs
Revision history for this message
chhandak (chhandak) wrote :

Observed in build 32 also.

(gdb) bt
#0 0x00007f48b86ab010 in ?? ()
#1 0x0000000000b5396b in KSyncObject::NotifyEvent(KSyncEntry*, KSyncEntry::KSyncEvent) ()
#2 0x0000000000b53c3d in KSyncObject::SafeNotifyEvent(KSyncEntry*, KSyncEntry::KSyncEvent) ()
#3 0x0000000000b55430 in KSyncObjectManager::Process(KSyncObjectEvent*) ()
#4 0x0000000000b5aeff in QueueTaskRunner<KSyncObjectEvent*, WorkQueue<KSyncObjectEvent*> >::Run() ()
#5 0x0000000000f47afc in TaskImpl::execute() ()
#6 0x00007f48ce45bb3a in ?? () from /usr/lib/libtbb.so.2
#7 0x00007f48ce457816 in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f48ce456f4b in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f48ce4530ff in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f48ce4532f9 in ?? () from /usr/lib/libtbb.so.2
#11 0x00007f48ce677182 in start_thread (arg=0x7f48c6b12700) at pthread_create.c:312
#12 0x00007f48cd95047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

amit surana (asurana-t)
tags: added: blocker
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/20001
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/20002
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/20002
Committed: http://github.org/Juniper/contrail-controller/commit/87afc0a5189f8109477c31e226033b2406a89ade
Submitter: Zuul
Branch: R3.0

commit 87afc0a5189f8109477c31e226033b2406a89ade
Author: Prabhjot Singh Sethi <email address hidden>
Date: Mon May 9 16:04:23 2016 +0530

Fix ToR Agent crash in KSyncObject::NotifyEvent

Issue:
------
while KSyncObject is processing delete table request, it
expects KSync Entry tree to be manipulated by itself only
however it possibly seems like there is a parallel access
to KSync Entry Tree causing next pointer to be free'd
before processing delete event on it.
one such event was identified as intrusive pointer release
on Task destructor which can be called in task scheduler
context after completing the Run and can cause parallel
access to KSync Entry Tree

Fix:
----
on task complete reset intrusive pointer to ensure the
reference management happens only in context of KSync task

Also added Concurrency Check to assure client IDL sanity

Closes-Bug: 1572287
Change-Id: I85594163231c2ba92afae1a833698e4c905c9b5d
(cherry picked from commit e8035ffd935af2e49489111f97a0b56d84c5eba2)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/20001
Committed: http://github.org/Juniper/contrail-controller/commit/e8035ffd935af2e49489111f97a0b56d84c5eba2
Submitter: Zuul
Branch: master

commit e8035ffd935af2e49489111f97a0b56d84c5eba2
Author: Prabhjot Singh Sethi <email address hidden>
Date: Mon May 9 16:04:23 2016 +0530

Fix ToR Agent crash in KSyncObject::NotifyEvent

Issue:
------
while KSyncObject is processing delete table request, it
expects KSync Entry tree to be manipulated by itself only
however it possibly seems like there is a parallel access
to KSync Entry Tree causing next pointer to be free'd
before processing delete event on it.
one such event was identified as intrusive pointer release
on Task destructor which can be called in task scheduler
context after completing the Run and can cause parallel
access to KSync Entry Tree

Fix:
----
on task complete reset intrusive pointer to ensure the
reference management happens only in context of KSync task

Also added Concurrency Check to assure client IDL sanity

Closes-Bug: 1572287
Change-Id: I85594163231c2ba92afae1a833698e4c905c9b5d

Revision history for this message
chhandak (chhandak) wrote : Re: [Bug 1572287] A change has been merged
Download full text (6.5 KiB)

Hi Prabhjot,

Still seeing the issue with Build 36.

Core is present in 10.87.121.73 (root/c0ntrail123)

Thanks and Regards,
Chhandak

root@5b7s6:~# contrail-version
Package Version Build-ID | Repo | Package Name
-------------------------------------- ------------------------------ ----------------------------------
contrail-fabric-utils 3.0.2.0-36 36
contrail-install-packages 3.0.2.0-36~liberty 36
contrail-lib 3.0.2.0-36 36
contrail-nodemgr 3.0.2.0-36 36
contrail-nova-vif 3.0.2.0-36 36
contrail-setup 3.0.2.0-36 36
contrail-utils 3.0.2.0-36 36
contrail-vrouter-3.13.0-85-generic 3.0.2.0-36 36
contrail-vrouter-agent 3.0.2.0-36 36
contrail-vrouter-common 3.0.2.0-36 36
contrail-vrouter-init 3.0.2.0-36 36
contrail-vrouter-utils 3.0.2.0-36 36
python-contrail 3.0.2.0-36 36
python-contrail-vrouter-api 3.0.2.0-36 36
python-neutronclient 1:3.1.0-0ubuntu1~cloud0.2contrail36
python-nova 2:12.0.1-0ubuntu1~cloud0.1contrail36
python-opencontrail-vrouter-netns 3.0.2.0-36 36

On 5/10/16, 10:41 PM, "<email address hidden> on behalf of OpenContrail Admin" <<email address hidden> on behalf of <email address hidden>> wrote:

>Reviewed: https://review.opencontrail.org/20001
>Committed: http://github.org/Juniper/contrail-controller/commit/e8035ffd935af2e49489111f97a0b56d84c5eba2
>Submitter: Zuul
>Branch: master
>
>commit e8035ffd935af2e49489111f97a0b56d84c5eba2
>Author: Prabhjot Singh Sethi <email address hidden>
>Date: Mon May 9 16:04:23 2016 +0530
>
>Fix ToR Agent crash in KSyncObject::NotifyEvent
>
>Issue:
>------
>while KSyncObject is processing delete table request, it
>expects KSync Entry tree to be manipulated by itself only
>however it possibly seems like there is a parallel access
>to KSync Entry Tree causing next pointer to be free'd
>before processing delete event on it.
>one such event was identified as intrusive pointer release
>on Task destructor which can be called in task scheduler
>context after completing the Run and can cause parallel
>access to KSync Entry Tree
>
>Fix:
>----
>on task complete reset intrusive pointer to ensure the
>reference management happens only in context of KSync task
>
>Also added Concurrency Check to assure client IDL sanity
...

Read more...

Revision history for this message
chhandak (chhandak) wrote :
Download full text (10.4 KiB)

Hi Prabhjot,

Recreated the issue with your private build also.

Leaving the setup in issue recreated state. Setup (10.87.121.72, root/c0ntrail123)

Thanks and Regards,
Chhandak

root@5b7s5:/opt/contrail/utils# contrail-status
== Contrail vRouter ==
supervisor-vrouter: active
contrail-tor-agent-1 active
contrail-tor-agent-2 active
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

========Run time service failures=============
/var/crashes/core.contrail-tor-ag.29031.5b7s5.1463594410
/var/crashes/core.contrail-tor-ag.25313.5b7s5.1463594288
/var/crashes/core.contrail-tor-ag.25312.5b7s5.1463594465

Backtrace
----------
#0 0x00007fece1724760 in ?? ()
#1 0x00000000016ceb55 in KSyncObject::NotifyEvent (this=0x7fecda8eecb0, entry=0x7fece1724770, event=KSyncEntry::DEL_REQ) at controller/src/ksync/ksync_object.cc:1363
#2 0x00000000016cbbf3 in KSyncObject::SafeNotifyEvent (this=0x7fecda8eecb0, entry=0x7fece1724770, event=KSyncEntry::DEL_REQ) at controller/src/ksync/ksync_object.cc:301
#3 0x00000000016cb88f in KSyncObject::Delete (this=0x7fecda8eecb0, entry=0x7fece1724770) at controller/src/ksync/ksync_object.cc:257
#4 0x00000000016cf7e3 in KSyncObjectManager::Process (this=0x7fecd00021c0, event=0x7fecbda251e0) at controller/src/ksync/ksync_object.cc:1570
#5 0x00000000016dfcb2 in boost::_mfi::mf1<bool, KSyncObjectManager, KSyncObjectEvent*>::operator() (this=0x7fecc7e5fa58, p=0x7fecd00021c0, a1=0x7fecbda251e0)
    at /usr/include/boost/bind/mem_fn_template.hpp:165
#6 0x00000000016de7ae in boost::_bi::list2<boost::_bi::value<KSyncObjectManager*>, boost::arg<1> >::operator()<bool, boost::_mfi::mf1<bool, KSyncObjectManager, KSyncObjectEvent*>, boost::_bi::list1<KSyncObjectEvent*&> > (this=0x7fecc7e5fa68, f=..., a=...) at /usr/include/boost/bind/bind.hpp:303
#7 0x00000000016dce36 in boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, KSyncObjectManager, KSyncObjectEvent*>, boost::_bi::list2<boost::_bi::value<KSyncObjectManager*>, boost::arg<1> > >::operator()<KSyncObjectEvent*> (this=0x7fecc7e5fa58, a1=@0x7fecc7e5f9c0: 0x7fecbda251e0) at /usr/include/boost/bind/bind_template.hpp:32
#8 0x00000000016dac2f in boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, KSyncObjectManager, KSyncObjectEvent*>, boost::_bi::list2<boost::_bi::value<KSyncObjectManager*>, boost::arg<1> > >, bool, KSyncObjectEvent*>::invoke (function_obj_ptr=..., a0=0x7fecbda251e0)
    at /usr/include/boost/function/function_template.hpp:132
#9 0x00000000016e0dff in boost::function1<bool, KSyncObjectEvent*>::operator() (this=0x7fecc7e5fa50, a0=0x7fecbda251e0)
    at /usr/include/boost/function/function_template.hpp:767
#10 0x00000000016e0a2c in QueueTaskRunner<KSyncObjectEvent*, WorkQueue<KSyncObjectEvent*> >::RunQueue (this=0x7fecbd68eaa0) at controller/src/base/queue_task.h:87
#11 0x00000000016e0816 in QueueTaskRunner<KSyncObjectEvent*, WorkQueue<KSyncObjectEvent*> >::Run (this=0x7fecbd68eaa0) at controller/src/base/queue_task.h:66
#12 0x0000000001c32c58 in TaskImpl::execute (this=0x7fecc7f...

information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/20409
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/20410
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/20410
Committed: http://github.org/Juniper/contrail-controller/commit/4f04f8719dcd1e5c56b6978edbbe9afe3d0de27b
Submitter: Zuul
Branch: R3.0

commit 4f04f8719dcd1e5c56b6978edbbe9afe3d0de27b
Author: Prabhjot Singh Sethi <email address hidden>
Date: Thu May 19 14:23:00 2016 +0530

Fix ToR agent crash in KSyncObjectManager::Process

Issue:
------
ToR agent relies on commiting bulk transactions to
OVSDB server, which on completion can end up as a
no-op to OVSDB server, which internally triggers ack
to complete KSync states on the associated entries
in current bulk context. while delete table processing
being happening it can so happen that for the next
entry pointer obtained there is already a delete op
being triggered and added to Bulk context, while
completion of bulk context it can end up completing
the KSync state for the next entry pointer and end
up deleting it while KSyncObjectManager::Process
still assuming to have a valid pointer without holding
reference to it.

Fix:
----
KSyncObjectManager::Process should take reference to
the KSync Entry pointers it is dealing with to assure
sanity.
Replaced usage of unsafe NotifyEvent with SafeNotifyEvent
for OVSDB based triggers.

Closes-Bug: 1572287
Change-Id: I10fd79d176f0cfc22315a20c47f4db40fe956800
(cherry picked from commit 8b97ed23a09e5af92210119a5f954b4a6923b35e)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/20409
Committed: http://github.org/Juniper/contrail-controller/commit/8b97ed23a09e5af92210119a5f954b4a6923b35e
Submitter: Zuul
Branch: master

commit 8b97ed23a09e5af92210119a5f954b4a6923b35e
Author: Prabhjot Singh Sethi <email address hidden>
Date: Thu May 19 14:23:00 2016 +0530

Fix ToR agent crash in KSyncObjectManager::Process

Issue:
------
ToR agent relies on commiting bulk transactions to
OVSDB server, which on completion can end up as a
no-op to OVSDB server, which internally triggers ack
to complete KSync states on the associated entries
in current bulk context. while delete table processing
being happening it can so happen that for the next
entry pointer obtained there is already a delete op
being triggered and added to Bulk context, while
completion of bulk context it can end up completing
the KSync state for the next entry pointer and end
up deleting it while KSyncObjectManager::Process
still assuming to have a valid pointer without holding
reference to it.

Fix:
----
KSyncObjectManager::Process should take reference to
the KSync Entry pointers it is dealing with to assure
sanity.
Replaced usage of unsafe NotifyEvent with SafeNotifyEvent
for OVSDB based triggers.

Closes-Bug: 1572287
Change-Id: I10fd79d176f0cfc22315a20c47f4db40fe956800

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.