[Ubuntu 12.04 Icehouse R2.1 Build 36] Tor Agent crash while deleting server details from logical interface

Bug #1425458 reported by chhandak
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.1
Fix Committed
High
Prabhjot Singh Sethi
Trunk
Fix Committed
High
Prabhjot Singh Sethi

Bug Description

Observed Tor agent crash with Build 36 while deleting server details from logical interface and deleting VMI (Ports) from Webui

BT
---
(gdb) bt
#0 0x00007febf7761425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007febf7764b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007febf775a0ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007febf775a192 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x000000000088c183 in VrfEntry::DeleteTimeout (this=0x7febec00b020) at controller/src/vnsw/agent/oper/vrf.cc:314
#5 0x0000000000db6038 in operator() (this=<optimized out>) at build/include/boost/function/function_template.hpp:763
#6 Timer::TimerTask::Run (this=0x17ee210) at controller/src/base/timer.cc:42
#7 0x0000000000da80f5 in TaskImpl::execute (this=0x17f4e40) at controller/src/base/task.cc:232
#8 0x00007febf8954e52 in ?? () from /usr/lib/libtbb.so.2
#9 0x00007febf8950c2d in ?? () from /usr/lib/libtbb.so.2
#10 0x00007febf89500db in ?? () from /usr/lib/libtbb.so.2
#11 0x00007febf894dc1f in ?? () from /usr/lib/libtbb.so.2
#12 0x00007febf894de59 in ?? () from /usr/lib/libtbb.so.2
#13 0x00007febf8b6be9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#14 0x00007febf781f3fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x0000000000000000 in ?? ()

Prabhjot is aware of this issue.

Tags: bms vrouter
chhandak (chhandak)
summary: [Ubuntu 12.04 Icehouse R2.1 Build 36] Tor Agent crash while deleting
- server details from logical
+ server details from logical interface
Revision history for this message
chhandak (chhandak) wrote :
Revision history for this message
Prabhjot Singh Sethi (prabhjot) wrote :

(gdb) p Agent::singleton_->agent_init_
$1 = (AgentInit *) 0x7fff6a01dd50
(gdb) set print object on
(gdb) p Agent::singleton_->agent_init_
$2 = (TorAgentInit *) 0x7fff6a01dd50
(gdb) p $2->ovsdb_client_._M_ptr
$3 = (OVSDB::OvsdbClientTcp *) 0x7febe8004760
(gdb) p $3->session_
$4 = (OVSDB::OvsdbClientTcpSession *) 0x7febe8016cb0
(gdb) p $4->client_id^CQuit
(gdb) p $4->client_idl_.px
$5 = (OVSDB::OvsdbClientIdl *) 0x7febe8036150
(gdb) p $5->unicast_mac_local_ovsdb_._M_ptr
$6 = (OVSDB::UnicastMacLocalOvsdb *) 0x7febe8038920
(gdb) source ~/new_code/controller/src/db/db.gdb
(gdb) source ~/new_code/controller/src/vnsw/agent/test/agent.gdb
(gdb) source ~/new_code/controller/src/vnsw/agent/ovs_tor_agent/ovsdb_client/test/ovsdb.gdb
(gdb) dump_uc_v4_route_entries ^CQuit
(gdb) dump_ovsdb_uc_local_entries 0x7febe8016cb0
0x7febf00ddc90 state=2 mac=00:60:2f:aa:e5:e3 logical_switch= bd19e06c-1340-4de5-9c54-2897047f3a4b dest_ip= 10.204.216.196
(gdb) p *(KSyncEntry *)0x7febf00ddc90
$7 = (OVSDB::UnicastMacLocalEntry) {<OVSDB::OvsdbEntry> = {<KSyncEntry> = {_vptr.KSyncEntry = 0xdf6470, static kInvalidIndex = 4294967295,
      node_ = {<boost::intrusive::detail::generic_hook<boost::intrusive::get_set_node_algo<void*, false>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>> = {<boost::intrusive::detail::no_default_definer> = {<No data fields>}, <boost::intrusive::rbtree_node<void*>> = {parent_ = 0x7febe8038958, left_ = 0x0, right_ = 0x0, color_ = boost::intrusive::rbtree_node<void*>::black_t}, <No data fields>}, <No data fields>}, index_ = 4294967295,
      state_ = KSyncEntry::ADD_DEFER, refcount_ = {<tbb::internal::atomic_impl_with_arithmetic<int, int, char>> = {<tbb::internal::atomic_impl<int>> = {rep = {value = 2}}, <No data fields>}, <No data fields>},

while adding unicast mac local entry it waits for resolution of Logical siwtch entry
and ends up taking reference on a deleted entry. resulting in a dead lock due to which
logical switch remains in DEL_REF_PENDING state and never removed

Changed in juniperopenstack:
assignee: nobody → Prabhjot Singh Sethi (prabhjot)
information type: Proprietary → Public
Revision history for this message
Prabhjot Singh Sethi (prabhjot) wrote :

Reference to VRF is not release as the logical switch is in DEL_REF_PENDING state due to pending unicast local route holding reference to logical switch.

most of the cases when vlan port membership is withdrawn TOR withdraws associated unicast local route, however other wise TOR agent tries to force delete the unicast local route along with delete of logical switch.

but in this case the KSYNC reference on logical switch prevents TOR agent to do proper cleanup.

Working on a fix.

tags: added: bms
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : master

Review in progress for https://review.opencontrail.org/8578
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/8578
Committed: http://github.org/Juniper/contrail-controller/commit/1b45ceb29352fdccb459c6ea7120207709d3726f
Submitter: Zuul
Branch: master

commit 1b45ceb29352fdccb459c6ea7120207709d3726f
Author: Prabhjot Singh Sethi <email address hidden>
Date: Tue Mar 24 16:11:05 2015 +0530

Fix for VRF delete Timeout crash

Issue:
------
Unicast Mac Local Entry used to refer logical switch to
fetch VRF/VN info for route add to oper, so while Addition
it ended up holding reference to Deleted Logical Switch
waiting for it to be resolved, since Logical Switch goes
to DEL_REF_PENDING, it fails to force clean the dependent
Unicast Mac Local Entries due to cyclic dependency,
resulting in dead lock and eventually VRF delete timeout
crash

Fix:
----
Remove Logical Switch dependency from Unicast Mac Local
Entry, to break cyclic dependency
Changes to peer to take VRF and VxLAN id to ensure
KSYNC maps to correct view of vxlan id and vrf for route
add, so that the corresponding route deletes can be
triggered accordingly.

Closes-Bug: 1425458
Closes-Bug: 1433477
Change-Id: I236d0615bb4b144558334a24495b1f23ab9c3961

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.1

Review in progress for https://review.opencontrail.org/8653
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/8653
Committed: http://github.org/Juniper/contrail-controller/commit/e594cd696204aaab4a8150176fa1fcb839dbba3e
Submitter: Zuul
Branch: R2.1

commit e594cd696204aaab4a8150176fa1fcb839dbba3e
Author: Prabhjot Singh Sethi <email address hidden>
Date: Tue Mar 24 16:11:05 2015 +0530

Fix for VRF delete Timeout crash

Issue:
------
Unicast Mac Local Entry used to refer logical switch to
fetch VRF/VN info for route add to oper, so while Addition
it ended up holding reference to Deleted Logical Switch
waiting for it to be resolved, since Logical Switch goes
to DEL_REF_PENDING, it fails to force clean the dependent
Unicast Mac Local Entries due to cyclic dependency,
resulting in dead lock and eventually VRF delete timeout
crash

Fix:
----
Remove Logical Switch dependency from Unicast Mac Local
Entry, to break cyclic dependency
Changes to peer to take VRF and VxLAN id to ensure
KSYNC maps to correct view of vxlan id and vrf for route
add, so that the corresponding route deletes can be
triggered accordingly.

Closes-Bug: 1425458
Closes-Bug: 1433477
(cherry picked from commit 1b45ceb29352fdccb459c6ea7120207709d3726f)

Change-Id: Iae86e6e12b7b6849e698d4994c79b2f95067ecbe

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.