tor-agent crash at OVSDB::OvsdbDBEntry::NotifyAdd on tor-scale setup

Bug #1426513 reported by Vedamurthy Joshi
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.1
Fix Committed
High
Prabhjot Singh Sethi
Trunk
Fix Committed
High
Prabhjot Singh Sethi

Bug Description

2.1 Build 39 Ubuntu 14.04 Icehouse multinode setup

On this setup with 128 tor-agents and 110 TORs, 11K VMis, and 1.1K real endpoints,
three crashes were seen with same backtrace.

Am not really sure what was going on in the testbed at this time.
Crash files will be in http://10.204.216.50/Docs/bugs/#

(gdb) bt
#0 0x00007fac3ad51bb9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fac3ad54fc8 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fac3ad4aa76 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fac3ad4ab22 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00000000008cabba in OVSDB::OvsdbDBEntry::NotifyAdd (this=<optimized out>, row=<optimized out>)
    at controller/src/vnsw/agent/ovs_tor_agent/ovsdb_client/ovsdb_entry.cc:135
#5 0x00000000008d7866 in OVSDB::VrfOvsdbObject::OvsdbRouteNotify (this=0x7fac2c077a90, op=OVSDB::OvsdbClientIdl::OVSDB_ADD,
    row=0x7fabec047920) at controller/src/vnsw/agent/ovs_tor_agent/ovsdb_client/unicast_mac_remote_ovsdb.cc:376
#6 0x00000000008e7b82 in ovsdb_idl_insert_row ()
#7 0x00000000008e6dfe in ovsdb_idl_process_update ()
#8 0x00000000008e6bda in ovsdb_idl_parse_update__ ()
#9 0x00000000008e6861 in ovsdb_idl_parse_update ()
#10 0x00000000008e60ce in ovsdb_idl_msg_process ()
#11 0x00000000008c6b74 in OVSDB::OvsdbClientIdl::ProcessMessage (this=<optimized out>, msg=0x7fabc4010610)
    at controller/src/vnsw/agent/ovs_tor_agent/ovsdb_client/ovsdb_client_idl.cc:230
#12 0x00000000008ca30a in operator() (a0=0x7fabc4010610, this=0x7fac11bf5a90) at /usr/include/boost/function/function_template.hpp:767
#13 RunQueue (this=0x7fabec03a5f0) at controller/src/base/queue_task.h:53
#14 QueueTaskRunner<OVSDB::OvsdbClientIdl::OvsdbMsg*, WorkQueue<OVSDB::OvsdbClientIdl::OvsdbMsg*> >::Run (this=0x7fabec03a5f0)
    at controller/src/base/queue_task.h:36
#15 0x0000000000cfb5d0 in TaskImpl::execute (this=0x7fac3454fa40) at controller/src/base/task.cc:232
#16 0x00007fac3bf59b3a in ?? () from /usr/lib/libtbb.so.2
#17 0x00007fac3bf55816 in ?? () from /usr/lib/libtbb.so.2
#18 0x00007fac3bf54f4b in ?? () from /usr/lib/libtbb.so.2
#19 0x00007fac3bf510ff in ?? () from /usr/lib/libtbb.so.2
#20 0x00007fac3bf512f9 in ?? () from /usr/lib/libtbb.so.2
#21 0x00007fac3c175182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#22 0x00007fac3ae15fbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb)
root@nodei38:/var/crashes# ls -ltr
total 336744
-rw------- 1 root root 162619392 Feb 27 10:17 core.contrail-tor-ag.2295.nodei38.1425012472
-rw------- 1 root root 153624576 Feb 27 10:18 core.contrail-tor-ag.23796.nodei38.1425012481
-rw------- 1 root root 1808727 Feb 27 10:18 core.contrail-tor-ag.24285.nodei38.1425012487.gz
-rw------- 1 root root 158982144 Feb 27 10:25 core.contrail-tor-ag.24876.nodei38.1425012910
-rw------- 1 root root 153276416 Feb 27 10:48 core.contrail-tor-ag.2339.nodei38.1425014304
-rw------- 1 root root 155561984 Feb 27 11:32 core.contrail-tor-ag.2341.nodei38.1425016957
-rw------- 1 root root 161374208 Feb 27 11:32 core.contrail-tor-ag.21087.nodei38.1425016971
-rw------- 1 root root 161402880 Feb 27 11:33 core.contrail-tor-ag.21576.nodei38.1425016986
-rw------- 1 root root 157286400 Feb 27 11:35 core.contrail-tor-ag.28710.nodei38.1425017134
root@nodei38:/var/crashes#

Changed in juniperopenstack:
assignee: Hari Prasad Killi (haripk) → Prabhjot Singh Sethi (prabhjot)
tags: added: blocker scale
Revision history for this message
Prabhjot Singh Sethi (prabhjot) wrote :

it seems like ovs schema don't have mac as a key in route tables, due to which we end up having two ovs_idl rows in ovsdb-server.

When TOR agent receives this two updates with the virtual network config already avaiable it ends up crashing un able to handle two rows with same mac.

Work Around :- disassociate tor agent from physical router and after connection to TOR associate tor agent again with TOR to recover from this situation.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.1

Review in progress for https://review.opencontrail.org/8349
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : master

Review in progress for https://review.opencontrail.org/8350
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.1

Review in progress for https://review.opencontrail.org/8369
Submitter: Prabhjot Singh Sethi (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/8369
Committed: http://github.org/Juniper/contrail-controller/commit/2ac3337c80f334ff9d58bf6a076fcb7d96a44935
Submitter: Zuul
Branch: R2.1

commit 2ac3337c80f334ff9d58bf6a076fcb7d96a44935
Author: Prabhjot Singh Sethi <email address hidden>
Date: Sat Mar 14 19:21:53 2015 +0530

Fix Recursive TOR Agent crash

Issue:
------
OVSDB schema does not consider a key for unicast mac
remote entry, TOR agent on restart due to race condition
ended up having two unicast mac remote entry, later
TOR Agent on receiving duplicate entry treats it as
exception case and ends up asserting.

Fix:
----
Handle receive of duplicate entries from OVSDB-Server
and then cleanup the duplicate entries from OVSDB-Server.

Closes-Bug: 1426513
(cherry picked from commit a8d8ca79519338c2d53e10af587fbc61fe6b321a)

Change-Id: Id4e621e9b0dd64a41878227bd62e1e3390bb41ac

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/8350
Committed: http://github.org/Juniper/contrail-controller/commit/a8d8ca79519338c2d53e10af587fbc61fe6b321a
Submitter: Zuul
Branch: master

commit a8d8ca79519338c2d53e10af587fbc61fe6b321a
Author: Prabhjot Singh Sethi <email address hidden>
Date: Sat Mar 14 19:21:53 2015 +0530

Fix Recursive TOR Agent crash

Issue:
------
OVSDB schema does not consider a key for unicast mac
remote entry, TOR agent on restart due to race condition
ended up having two unicast mac remote entry, later
TOR Agent on receiving duplicate entry treats it as
exception case and ends up asserting.

Fix:
----
Handle receive of duplicate entries from OVSDB-Server
and then cleanup the duplicate entries from OVSDB-Server.

Closes-Bug: 1426513
Change-Id: I044b6d576920e97440aa0d40a0604fea3035dbf6

tags: added: bms
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.