[3.0.2.0-28 ] contrail-control crash @ BgpXmppChannel::XmppPeer::~XmppPeer()

Bug #1574169 reported by chhandak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Prakash Bailkeri
R2.21.x
Fix Committed
High
Prakash Bailkeri
R2.22.x
Fix Committed
High
Prakash Bailkeri
R3.0
Fix Committed
High
Prakash Bailkeri
Trunk
Fix Committed
High
Prakash Bailkeri

Bug Description

Observed this control node crash while deleting lif and vmi in scale setup

Backtrace
----------------
(gdb) bt
#0 0x00007f7b6170ccc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f7b617100d8 in __GI_abort () at abort.c:89
#2 0x00007f7b61705b86 in __assert_fail_base (fmt=0x7f7b61856830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0xcdba4a "GetRefCount() == 0",
    file=file@entry=0xd0a1b0 "controller/src/bgp/bgp_xmpp_channel.cc", line=line@entry=345,
    function=function@entry=0xd0c020 "virtual BgpXmppChannel::XmppPeer::~XmppPeer()") at assert.c:92
#3 0x00007f7b61705c32 in __GI___assert_fail (assertion=0xcdba4a "GetRefCount() == 0",
    file=0xd0a1b0 "controller/src/bgp/bgp_xmpp_channel.cc", line=345,
    function=0xd0c020 "virtual BgpXmppChannel::XmppPeer::~XmppPeer()") at assert.c:101
#4 0x000000000041b580 in ?? ()
#5 0x000000000097efa0 in ?? ()
#6 0x00000000009555f7 in ?? ()
#7 0x0000000000955fd9 in ?? ()
#8 0x000000000094d7ad in ?? ()
#9 0x00000000009876e3 in ?? ()
#10 0x0000000000687cac in ?? ()
#11 0x00007f7b624e3b3a in ?? () from /usr/lib/libtbb.so.2
#12 0x00007f7b624df816 in ?? () from /usr/lib/libtbb.so.2
#13 0x00007f7b624def4b in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f7b624db0ff in ?? () from /usr/lib/libtbb.so.2
#15 0x00007f7b624db2f9 in ?? () from /usr/lib/libtbb.so.2
#16 0x00007f7b626ff182 in start_thread (arg=0x7f7b58b7f700) at pthread_create.c:312
#17 0x00007f7b617d047d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

root@5b7s2:~# contrail-version | grep control
contrail-control 3.0.2.0-28 28
contrail-openstack-control 3.0.2.0-28 28

root@5b7s2:~# contrail-status | grep control
supervisor-control: active
contrail-control initializing (IFMap Server End-Of-RIB not computed)
contrail-control-nodemgr active

Contrail-control log
-----------------------
Oper: LinkRemove instance-ip:7f215674-77ef-46e6-83be-a23647515135 - virtual-machine-interface:default-domain:admin:fdce4f0c-a7f7-4899-a1ee-2b3f699d3515 , lhs: , rhs: controller/src/ifmap/ifmap_exporter.cc 523
2016-04-23 Sat 23:12:33:150.535 PDT 5b7s2 [Thread 139788032145152, Pid 25240]: SANDESH: Queue Drop: IFMap [SYS_DEBUG]: LinkOper: LinkRemove instance-ip:7f215674-77ef-46e6-83be-a23647515135 - virtual-network:default-domain:test1:vn-test-13217 , lhs: , rhs: controller/src/ifmap/ifmap_exporter.cc 523
2016-04-23 Sat 23:12:33:184.153 PDT 5b7s2 [Thread 139786421532416, Pid 25240]: XMPP [SYS_NOTICE]: XmppEventLog: Mode Server: PassiveOpen in state: Idle peer ip: 172.17.90.7 ( ) controller/src/xmpp/xmpp_state_machine.cc 1342

chhandak (chhandak)
Changed in juniperopenstack:
importance: Undecided → High
importance: High → Critical
description: updated
chhandak (chhandak)
summary: - contrail-control crash @ BgpXmppChannel::XmppPeer::~XmppPeer()
+ [3.0.2.0-28 28] contrail-control crash @
+ BgpXmppChannel::XmppPeer::~XmppPeer()
summary: - [3.0.2.0-28 28] contrail-control crash @
+ [3.0.2.0-28 ] contrail-control crash @
BgpXmppChannel::XmppPeer::~XmppPeer()
Revision history for this message
Nischal Sheth (nsheth) wrote :

I think we missed one scenario in the previous fix.

Add operation is enqueued to DB, unsubscribe for the
instance is received, membership manager API to
unregister table is called, membership managers starts
leave and part of the table is walked but membership
manager has not yet removed IPeerRib, add operation
is processed and path gets added.

Fix could be to invalidate the subscribe gen id before
or when leave starts.

Nischal Sheth (nsheth)
information type: Proprietary → Public Security
information type: Public Security → Public
Revision history for this message
Nischal Sheth (nsheth) wrote :

Lowering importance based on condition
when it happens.

Revision history for this message
chhandak (chhandak) wrote :

Core File copied in

ubuntu-build04:/cs-shared/bugs/1574169> pwd
/auto/cs-shared/bugs/1574169
ubuntu-build04:/cs-shared/bugs/1574169> ls -lrt
total 20521412
-rwxrwxrwx 1 chhandak epbg 20931510272 Apr 23 23:18 core.contrail-contro.2047.5b7s2.1461477461

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/19631
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/19717
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/19718
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/19720
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/19721
Submitter: Prakash Bailkeri (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/19631
Committed: http://github.org/Juniper/contrail-controller/commit/315c60bca45710d8ef9de2bc0ff123316d9ea3db
Submitter: Zuul
Branch: master

commit 315c60bca45710d8ef9de2bc0ff123316d9ea3db
Author: Prakash Bailkeri <email address hidden>
Date: Tue Apr 26 16:33:41 2016 +0530

Race condition between route handling and VRF subscription

As part of fix for 1563550, fix for one corner case is missed.

Consider the case where
1. route add operation is enqueued to DB,
2. unsubscribe for the instance is received,
3. membership manager API to unregister table is called, membership managers
starts leave and part of the table is walked but membership manager has not
yet removed IPeerRib,
4. Route add request is processed and path gets added.

Fix:
1. Reset the subscription gen id when unregister request is handled. i.e.
before starting the walk. This will ensure that delayed route add is rejected.

2. Add UT cases to validate above mentioned scenario

Change-Id: Ic12259b852227c1f199a67611da49a3563010026
Closes-Bug: #1574169

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/19721
Committed: http://github.org/Juniper/contrail-controller/commit/0892660c80504097cff3ccfbe7e87871e836ed82
Submitter: Zuul
Branch: R3.0

commit 0892660c80504097cff3ccfbe7e87871e836ed82
Author: Prakash Bailkeri <email address hidden>
Date: Tue Apr 26 16:33:41 2016 +0530

Race condition between route handling and VRF subscription

As part of fix for 1563550, fix for one corner case is missed.

Consider the case where
1. route add operation is enqueued to DB,
2. unsubscribe for the instance is received,
3. membership manager API to unregister table is called, membership managers
starts leave and part of the table is walked but membership manager has not
yet removed IPeerRib,
4. Route add request is processed and path gets added.

Fix:
1. Reset the subscription gen id when unregister request is handled. i.e.
before starting the walk. This will ensure that delayed route add is rejected.

2. Add UT cases to validate above mentioned scenario

Change-Id: Ic12259b852227c1f199a67611da49a3563010026
Closes-Bug: #1574169
(cherry picked from commit 315c60bca45710d8ef9de2bc0ff123316d9ea3db)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/19717
Committed: http://github.org/Juniper/contrail-controller/commit/e62159cce888819c0d67803466fc1a5fead3db29
Submitter: Zuul
Branch: R2.21.x

commit e62159cce888819c0d67803466fc1a5fead3db29
Author: Prakash Bailkeri <email address hidden>
Date: Tue Apr 26 16:33:41 2016 +0530

Race condition between route handling and VRF subscription

As part of fix for 1563550, fix for one corner case is missed.

Consider the case where
1. route add operation is enqueued to DB,
2. unsubscribe for the instance is received,
3. membership manager API to unregister table is called, membership managers
starts leave and part of the table is walked but membership manager has not
yet removed IPeerRib,
4. Route add request is processed and path gets added.

Fix:
1. Reset the subscription gen id when unregister request is handled. i.e.
before starting the walk. This will ensure that delayed route add is rejected.

2. Add UT cases to validate above mentioned scenario

Change-Id: Ic12259b852227c1f199a67611da49a3563010026
Closes-Bug: #1574169
(cherry picked from commit 315c60bca45710d8ef9de2bc0ff123316d9ea3db)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/19718
Committed: http://github.org/Juniper/contrail-controller/commit/242296f5161a7567869d6b1628e2c8ede9cfe1c6
Submitter: Zuul
Branch: R2.20

commit 242296f5161a7567869d6b1628e2c8ede9cfe1c6
Author: Prakash Bailkeri <email address hidden>
Date: Tue Apr 26 16:33:41 2016 +0530

Race condition between route handling and VRF subscription

As part of fix for 1563550, fix for one corner case is missed.

Consider the case where
1. route add operation is enqueued to DB,
2. unsubscribe for the instance is received,
3. membership manager API to unregister table is called, membership managers
starts leave and part of the table is walked but membership manager has not
yet removed IPeerRib,
4. Route add request is processed and path gets added.

Fix:
1. Reset the subscription gen id when unregister request is handled. i.e.
before starting the walk. This will ensure that delayed route add is rejected.

2. Add UT cases to validate above mentioned scenario

Change-Id: Ic12259b852227c1f199a67611da49a3563010026
Closes-Bug: #1574169
(cherry picked from commit 315c60bca45710d8ef9de2bc0ff123316d9ea3db)

Revision history for this message
Ashish Ranjan (aranjan-n) wrote :

R2.22.x has the code https://review.opencontrail.org/#/c/19720/

I don't know why it did automatically update this bug, but github had the changes. SO, marking it fixed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.