[3.0 2717]: TOR Scale: Tor agent and vrouter agent crash after stopping all the control node

Bug #1550459 reported by chhandak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.0
Fix Committed
Critical
Manish Singh
Trunk
Fix Committed
Critical
Manish Singh

Bug Description

Observing multiple Tor agent and vrouter agent crash after stopping both the control node. All the crash is related to VrfEntry::DeleteTimeout.

Backtrace-1
-----------------
(gdb) bt
#0 0x00007f8d0e153cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f8d0e1570d8 in __GI_abort () at abort.c:89
#2 0x00007f8d0e14cb86 in __assert_fail_base (fmt=0x7f8d0e29d830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x11dc975 "0",
    file=file@entry=0x11ee2d0 "controller/src/vnsw/agent/oper/vrf.cc", line=line@entry=344, function=function@entry=0x11ee840 "bool VrfEntry::DeleteTimeout()")
    at assert.c:92
#3 0x00007f8d0e14cc32 in __GI___assert_fail (assertion=0x11dc975 "0", file=0x11ee2d0 "controller/src/vnsw/agent/oper/vrf.cc", line=344,
    function=0x11ee840 "bool VrfEntry::DeleteTimeout()") at assert.c:101
#4 0x0000000000ab9692 in VrfEntry::DeleteTimeout() ()
#5 0x00000000011915b9 in Timer::TimerTask::Run() ()
#6 0x000000000118a61c in TaskImpl::execute() ()
#7 0x00007f8d0ed22b3a in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f8d0ed1e816 in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f8d0ed1df4b in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f8d0ed1a0ff in ?? () from /usr/lib/libtbb.so.2
#11 0x00007f8d0ed1a2f9 in ?? () from /usr/lib/libtbb.so.2
#12 0x00007f8d0ef3e182 in start_thread (arg=0x7f8d05bd3700) at pthread_create.c:312
#13 0x00007f8d0e21747d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Backtrace-2
-----------
(gdb) bt
#0 0x00007f378cba1cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f378cba50d8 in __GI_abort () at abort.c:89
#2 0x00007f378cb9ab86 in __assert_fail_base (fmt=0x7f378cceb830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0xfa2375 "0",
    file=file@entry=0xfb3d18 "controller/src/vnsw/agent/oper/vrf.cc", line=line@entry=344,
    function=function@entry=0xfb4280 <VrfEntry::DeleteTimeout()::__PRETTY_FUNCTION__> "bool VrfEntry::DeleteTimeout()") at assert.c:92
#3 0x00007f378cb9ac32 in __GI___assert_fail (assertion=0xfa2375 "0", file=0xfb3d18 "controller/src/vnsw/agent/oper/vrf.cc", line=344,
    function=0xfb4280 <VrfEntry::DeleteTimeout()::__PRETTY_FUNCTION__> "bool VrfEntry::DeleteTimeout()") at assert.c:101
#4 0x00000000009dd882 in VrfEntry::DeleteTimeout (this=0x7f3774562f40) at controller/src/vnsw/agent/oper/vrf.cc:344
#5 0x0000000000f5d809 in operator() (this=<optimized out>) at /usr/include/boost/function/function_template.hpp:767
#6 Timer::TimerTask::Run (this=0x33f6230) at controller/src/base/timer.cc:42
#7 0x0000000000f5686c in TaskImpl::execute (this=0x7f378640b840) at controller/src/base/task.cc:253
#8 0x00007f378d770b3a in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f378d76c816 in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f378d76bf4b in ?? () from /usr/lib/libtbb.so.2
#11 0x00007f378d7680ff in ?? () from /usr/lib/libtbb.so.2
#12 0x00007f378d7682f9 in ?? () from /usr/lib/libtbb.so.2
#13 0x00007f378d98c182 in start_thread (arg=0x7f3785e27700) at pthread_create.c:312
#14 0x00007f378cc6547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Tags: blocker tor
chhandak (chhandak)
Changed in juniperopenstack:
importance: Undecided → Critical
milestone: none → future
summary: [3.0 2717]: TOR Scale: Tor agent and vrouter agent crash after stopping
- all the config node
+ all the control node
Changed in juniperopenstack:
assignee: nobody → Hari Prasad Killi (haripk)
chhandak (chhandak)
description: updated
Changed in juniperopenstack:
assignee: Hari Prasad Killi (haripk) → Manish Singh (manishs)
Jeba Paulaiyan (jebap)
information type: Proprietary → Public
Revision history for this message
chhandak (chhandak) wrote :

ssh to bhushana@10.204.216.50 Password bhu@123
Cores copied in /home/bhushana/Documents/technical/bugs/1550459

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/20663
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.0

Review in progress for https://review.opencontrail.org/20664
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/20664
Committed: http://github.org/Juniper/contrail-controller/commit/57813920707d523bec37c6a401e2099bbb2f56b7
Submitter: Zuul
Branch: R3.0

commit 57813920707d523bec37c6a401e2099bbb2f56b7
Author: Manish <email address hidden>
Date: Thu May 26 15:04:58 2016 +0530

VMI was not removed and it held on VRF.

Problem:
Sequence of events goes like this:
- Delete is issued on VMI. This reset configurer in VMI and mark it for
deletion. However because of some reference entry is still present.
- Add is issued. As entry is present in DB this will convert to change
operation. On seeing a change delete flag is cleared but VMI does not get its
configurer back. Agent just resyncs the entry.
- Delete is issued again. Since configurer was not set resync for entry is not
called to flush off VMI parameters, hence VRF reference is held.

Results in Vrf Delete timeout.

Solution:
Resync will always be done for non deleted entry. In above case on add, delete
flags has been reset and entry is no more in deleted state. So whenever resync
is done, set the configurer.
Also OnDelete calls resync to flush VMI parameters. Reset of configurer should
be done after resync call is done in OnDelete.

Change-Id: I431cc3dcf264a23a0e2e1491c23ee2f8a257aba4
Closes-bug: #1550459

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/20663
Committed: http://github.org/Juniper/contrail-controller/commit/6edffd00cad1cecec9d37e1ba03f6f029f13e1b6
Submitter: Zuul
Branch: master

commit 6edffd00cad1cecec9d37e1ba03f6f029f13e1b6
Author: Manish <email address hidden>
Date: Thu May 26 15:04:58 2016 +0530

VMI was not removed and it held on VRF.

Problem:
Sequence of events goes like this:
- Delete is issued on VMI. This reset configurer in VMI and mark it for
deletion. However because of some reference entry is still present.
- Add is issued. As entry is present in DB this will convert to change
operation. On seeing a change delete flag is cleared but VMI does not get its
configurer back. Agent just resyncs the entry.
- Delete is issued again. Since configurer was not set resync for entry is not
called to flush off VMI parameters, hence VRF reference is held.

Results in Vrf Delete timeout.

Solution:
Resync will always be done for non deleted entry. In above case on add, delete
flags has been reset and entry is no more in deleted state. So whenever resync
is done, set the configurer.
Also OnDelete calls resync to flush VMI parameters. Reset of configurer should
be done after resync call is done in OnDelete.

Change-Id: I431cc3dcf264a23a0e2e1491c23ee2f8a257aba4
Closes-bug: #1550459

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.