control-node crash due to assert at GetRefCount on scale setup

Bug #1459505 reported by Vedamurthy Joshi
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
Nischal Sheth
Trunk
Fix Released
Medium
Nischal Sheth

Bug Description

R2.20 Build 30 Ubuntu 14.04 Juno multi-node setup

Seen on a tor-scale setup with 128 tors, 128K vmis, and 64k lifs

env.roledefs = {
    'all': [host2, host3, host4, host5, host6, host7, host8, host9],
    'cfgm': [host2, host3, host4],
    'openstack': [host2, host3, host4],
    'webui': [host3],
    'control': [host2, host3, host4],
    'compute': [host5, host6, host7, host8, host9],
    'collector': [host2, host3, host4],
    'database': [host2, host3, host4],
    'toragent': [host5, host6, host7, host9 ],
    'tsn': [host5, host6, host7,host9 ],
    'build': [host_build],
}

env.hostnames = {
    'all': ['nodei34', 'nodei35', 'nodei36', 'nodei37', 'nodei38', 'nodei28', 'nodei27', 'nodei30']
}

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-control'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fe2bfd2acc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007fe2bfd2acc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fe2bfd2e0d8 in __GI_abort () at abort.c:89
#2 0x00007fe2bfd23b86 in __assert_fail_base (fmt=0x7fe2bfe74830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0xb04b53 "GetRefCount() == 0",
    file=file@entry=0xb2b810 "controller/src/bgp/bgp_xmpp_channel.cc", line=line@entry=322, function=function@entry=0xb2c9c0 "virtual BgpXmppChannel::XmppPeer::~XmppPeer()") at assert.c:92
#3 0x00007fe2bfd23c32 in __GI___assert_fail (assertion=0xb04b53 "GetRefCount() == 0", file=0xb2b810 "controller/src/bgp/bgp_xmpp_channel.cc", line=322,
    function=0xb2c9c0 "virtual BgpXmppChannel::XmppPeer::~XmppPeer()") at assert.c:101
#4 0x000000000041a014 in ?? ()
#5 0x00000000007862d0 in ?? ()
#6 0x000000000077ac27 in ?? ()
#7 0x000000000077ad69 in ?? ()
#8 0x000000000075216d in ?? ()
#9 0x0000000000790c05 in ?? ()
#10 0x0000000000ab7390 in ?? ()
#11 0x00007fe2c0b01b3a in ?? () from /usr/lib/libtbb.so.2
#12 0x00007fe2c0afd816 in ?? () from /usr/lib/libtbb.so.2
#13 0x00007fe2c0afcf4b in ?? () from /usr/lib/libtbb.so.2
#14 0x00007fe2c0af90ff in ?? () from /usr/lib/libtbb.so.2
#15 0x00007fe2c0af92f9 in ?? () from /usr/lib/libtbb.so.2
#16 0x00007fe2c0d1d182 in start_thread (arg=0x7fe2b8da4700) at pthread_create.c:312
#17 0x00007fe2bfdee47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :
Nischal Sheth (nsheth)
Changed in juniperopenstack:
assignee: nobody → Nischal Sheth (nsheth)
Revision history for this message
Nischal Sheth (nsheth) wrote :

This assertion indicates problem in cleanup up all routes added by
the XmppPeer in question.

Attempting to recreate it by running bgp_stress_test repeatedly.
Haven't been successful in recreating this problem so far, but ran
into bug 1463122, bug 1462550 and bug 1462550.

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

Seen again on 2.20 Build 54

Core is in http://10.204.216.50/Docs/bugs/1459505/build54

Revision history for this message
Ashish Ranjan (aranjan-n) wrote :

Another instance is bug 1470838

Revision history for this message
Nischal Sheth (nsheth) wrote :

Root cause could be that the replicator has not yet cleaned
up paths added by it is when the peer gets destroyed.

Revision history for this message
Nischal Sheth (nsheth) wrote :

Previous theory didn't pan out.
Added a test which disabled DB, but the peer didn't get deleted
because xmpp peers are not deleted if the DB work queue is not
empty.

Will commit the test against this bug.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12376
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/12379
Submitter: Nischal Sheth (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12376
Committed: http://github.org/Juniper/contrail-controller/commit/bf1a3be7e4381731c9818556434dc451389a6057
Submitter: Zuul
Branch: master

commit bf1a3be7e4381731c9818556434dc451389a6057
Author: Nischal Sheth <email address hidden>
Date: Mon Jul 13 09:30:17 2015 -0700

Verify that xmpp peer is not deleted till replicated routes are gone

Change-Id: I6decc6d0c78ba433ac52565599806495b5a891c2
Partial-Bug: 1459505

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/12379
Committed: http://github.org/Juniper/contrail-controller/commit/07fa383b0012d04b3b36f274704e83a6b8a40593
Submitter: Zuul
Branch: R2.20

commit 07fa383b0012d04b3b36f274704e83a6b8a40593
Author: Nischal Sheth <email address hidden>
Date: Mon Jul 13 09:30:17 2015 -0700

Verify that xmpp peer is not deleted till replicated routes are gone

Change-Id: I6decc6d0c78ba433ac52565599806495b5a891c2
Partial-Bug: 1459505

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

crash seen again on build 2.20 Build 91 Ubuntu 14.04 Juno setup

Core will be in http://10.204.216.50/Docs/bugs/#/2.20-91/

tags: added: releasenote
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/13927
Submitter: Vinay Vithal Mahuli (<email address hidden>)

Revision history for this message
Anoop Kumar Sahu (anoops) wrote :
Download full text (7.0 KiB)

Seeing this on R2.21 <95>

root@Host1-CN1:/opt/contrail/utils/fabfile/testbeds# gdb contrail-control /var/crashes/core.contrail-contro.12159.Host1-CN1.1442873519
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from contrail-control...(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 12222]
[New LWP 12226]
[New LWP 12223]
[New LWP 12229]
[New LWP 12224]
[New LWP 12225]
[New LWP 12235]
[New LWP 12233]
[New LWP 12159]
[New LWP 12236]
[New LWP 12238]
[New LWP 12209]
[New LWP 12241]
[New LWP 12208]
[New LWP 12210]
[New LWP 12212]
[New LWP 12221]
[New LWP 12218]
[New LWP 15170]
[New LWP 12219]
[New LWP 12214]
[New LWP 12228]
[New LWP 12239]
[New LWP 12215]
[New LWP 12220]
[New LWP 12207]
[New LWP 12237]
[New LWP 12213]
[New LWP 12227]
[New LWP 12230]
[New LWP 12234]
[New LWP 12231]
[New LWP 12240]
[New LWP 12232]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-control'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f3477e27cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f3477e27cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f3477e2b0d8 in __GI_abort () at abort.c:89
#2 0x00007f3477e20b86 in __assert_fail_base (fmt=0x7f3477f71830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0xb3e3b7 "GetRefCount() == 0",
    file=file@entry=0xb684e0 "controller/src/bgp/bgp_xmpp_channel.cc", line=line@entry=338,
    function=function@entry=0xb695c0 "virtual BgpXmppChannel::XmppPeer::~XmppPeer()") at assert.c:92
#3 0x00007f3477e20c32 in __GI___assert_fail (assertion=0xb3e3b7 "GetRefCount() == 0",
    file=0xb684e0 "controller/src/bgp/bgp_xmpp_channel.cc", line=338,
    function=0xb695c0 "virtual BgpXmppChannel::XmppPeer::~XmppPeer()") at assert.c:101
#4 0x000000000041a174 in ?? ()
#5 0x00000000007c5f50 in ?? ()
#6 0x00000000007bbe07 in ?? ()
#7 0x00000000007bc7f9 in ?? ()
#8 0x000000000079575d in ?? ()
#9 0x00000000007d11b3 in ?? ()
#10 0x0000000000af1540 in ?? ()
#11 0x00007f3478bfeb3a in ?? () from /usr/lib/libtbb.so.2
#12 0x00007f3478bfa816 in ?? () from /usr/lib/libtbb.so.2
#13 0x00007f3478bf9f4b in ?? () from /usr/lib/libtbb.so.2
#14 0x00007f3478bf60ff in ?? () from /usr/lib/libtbb...

Read more...

Revision history for this message
Anoop Kumar Sahu (anoops) wrote :

Saw above while deleting the VNs fromwebGUI. I had around 4K VNs

tags: added: blocker qfx
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.