On network disconnect-reconnect, few tor-agents not responding to their http ports

Bug #1460946 reported by Vedamurthy Joshi
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Megh Bhatt
Trunk
Fix Committed
High
Megh Bhatt

Bug Description

R2.20 Build 30 Ubuntu 14.04 Juno multi-node

In the tor-scale setup, there are 4 tor-agent/tsn nodes (i.e two pairs for active/standby) , each of them with 64 tor-agents

nodei38 and nodei28 is one such pair for TORs ovs-vm1 to ovs-vm64

I disconnected the control/data link on nodei38
contrail-tor-agent-1 on nodei28 became active for ovs-vm1

After about 20-30 mins, i reconnected the link on nodei38

It was then seen that 3 tor-agents (like contrail-tor-agent-49) were not responding on their http ports.
contrail-status would show them with 'timeout'

root@nodei38:/var/crashes# ps aux |grep "agent-49"
root 8672 0.0 0.0 10460 936 pts/0 S+ 12:55 0:00 grep --color=auto agent-49
root 25622 2.7 0.0 2413912 139988 ? Sl 00:54 19:52 /usr/bin/contrail-tor-agent --config_file /etc/contrail/contrail-tor-agent-49.conf
root@nodei38:/var/crashes#

root@nodei38:/var/crashes# contrail-status |grep timeout
contrail-tor-agent-16 timeout
contrail-tor-agent-4 timeout
contrail-tor-agent-49 timeout
root@nodei38:/var/crashes#

root@nodei38:/var/crashes# curl http://127.0.0.1:9058

^C
root@nodei38:/var/crashes# lsof -p 25622 |grep 9058
contrail- 25622 root 10u IPv4 28219150 0t0 TCP *:9058 (LISTEN)
root@nodei38:/var/crashes#

Revision history for this message
Praveen (praveen-karadakal) wrote :
Download full text (11.6 KiB)

contrail-tor-agent is struck in tbb::mutex::lock called from TcpSession::IsClosed. On looking at the lock, it seems that there is an attempt to take lock on a free'd TcpSession.

(gdb) bt
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f76c3b28657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f76c3b28480 in __GI___pthread_mutex_lock (mutex=0x7f76980e6b98) at ../nptl/pthread_mutex_lock.c:79
#3 0x000000000100c7ec in tbb::mutex::lock (this=0x7f76980e6b98) at /usr/include/tbb/mutex.h:164
#4 0x000000000100c79e in tbb::mutex::scoped_lock::acquire (this=0x7fff60b7fc40, mutex=...) at /usr/include/tbb/mutex.h:105
#5 0x000000000100c759 in tbb::mutex::scoped_lock::scoped_lock (this=0x7fff60b7fc40, mutex=...) at /usr/include/tbb/mutex.h:91
#6 0x00000000012fce92 in TcpSession::IsClosed (this=0x7f76980e6b90) at controller/src/io/tcp_session.h:119
#7 0x0000000001859d81 in SandeshClient::SendSandesh (this=0x7f76b4070f70, snh=0x3dcdc40) at tools/sandesh/library/cpp/sandesh_client.cc:117
#8 0x000000000183f7e3 in Sandesh::SendEnqueue (this=0x3dcdc40) at tools/sandesh/library/cpp/sandesh.cc:591
#9 0x000000000183fa36 in Sandesh::Dispatch (this=0x3dcdc40, sconn=0x0) at tools/sandesh/library/cpp/sandesh.cc:611
#10 0x0000000001616f2e in XmppEventLog::Send (category=..., level=SandeshLevel::SYS_NOTICE, file=..., line=1314, f1=..., f2=..., f3=...,
    f4=...) at build/debug/xmpp/sandesh/xmpp_state_machine_sandesh_types.h:504
#11 0x000000000161305d in XmppStateMachine::OnSessionEvent (this=0x7f765c00e5b0, session=0x7f7694088de0, event=TcpSession::CONNECT_FAILED)
    at controller/src/xmpp/xmpp_state_machine.cc:1311
#12 0x0000000001631886 in boost::_mfi::mf2<void, XmppStateMachine, TcpSession*, TcpSession::Event>::operator() (this=0x7f7694088e98,
    p=0x7f765c00e5b0, a1=0x7f7694088de0, a2=TcpSession::CONNECT_FAILED) at /usr/include/boost/bind/mem_fn_template.hpp:280
#13 0x00000000016300ca in boost::_bi::list3<boost::_bi::value<XmppStateMachine*>, boost::arg<1>, boost::arg<2> >::operator()<boost::_mfi::mf2<void, XmppStateMachine, TcpSession*, TcpSession::Event>, boost::_bi::list2<TcpSession*&, TcpSession::Event&> > (this=0x7f7694088ea8, f=...,
    a=...) at /usr/include/boost/bind/bind.hpp:392
#14 0x000000000162edd8 in boost::_bi::bind_t<void, boost::_mfi::mf2<void, XmppStateMachine, TcpSession*, TcpSession::Event>, boost::_bi::list3<boost::_bi::value<XmppStateMachine*>, boost::arg<1>, boost::arg<2> > >::operator()<TcpSession*, TcpSession::Event> (this=0x7f7694088e98,
    a1=@0x7fff60b80160: 0x7f7694088de0, a2=@0x7fff60b8015c: TcpSession::CONNECT_FAILED) at /usr/include/boost/bind/bind_template.hpp:61
#15 0x000000000162def1 in boost::detail::function::void_function_obj_invoker2<boost::_bi::bind_t<void, boost::_mfi::mf2<void, XmppStateMachine, TcpSession*, TcpSession::Event>, boost::_bi::list3<boost::_bi::value<XmppStateMachine*>, boost::arg<1>, boost::arg<2> > >, void, TcpSession*, TcpSession::Event>::invoke (function_obj_ptr=..., a0=0x7f7694088de0, a1=TcpSession::CONNECT_FAILED)
    at /usr/include/boost/function/function_template.hpp:153
#16 0x000000000182e424 in boost::function2<void, Tcp...

Changed in juniperopenstack:
assignee: Praveen (praveen-karadakal) → Raj Reddy (rajreddy)
Revision history for this message
Praveen (praveen-karadakal) wrote :

Core file and the unstripped binary are stored at bhushana@10.204.216.50:~/Documents/technical/bugs/1460946

Revision history for this message
Raj Reddy (rajreddy) wrote : Re: [Bug 1460946] Re: On network disconnect-reconnect, few tor-agents not responding to their http ports

'timeout' means the daemon is busy with other task and not able to respond to introspect request
in the default 2s time..

You just need to increase the timeout in the command using for e.g.
contrail-status -t 5

-
Raj

On Jun 2, 2015, at 10:21 AM, Ashish Ranjan <email address hidden>
 wrote:

> ** Also affects: juniperopenstack/trunk
> Importance: High
> Assignee: Raj Reddy (rajreddy)
> Status: New
>
> ** Changed in: juniperopenstack/trunk
> Milestone: None => r2.30-fcs
>
> --
> You received this bug notification because you are a member of Contrail
> Systems engineering, which is subscribed to Juniper Openstack.
> https://bugs.launchpad.net/bugs/1460946
>
> Title:
> On network disconnect-reconnect, few tor-agents not responding to
> their http ports
>
> Status in Juniper Openstack distribution:
> New
> Status in Juniper Openstack r2.20 series:
> New
> Status in Juniper Openstack trunk series:
> New
>
> Bug description:
> R2.20 Build 30 Ubuntu 14.04 Juno multi-node
>
> In the tor-scale setup, there are 4 tor-agent/tsn nodes (i.e two pairs
> for active/standby) , each of them with 64 tor-agents
>
> nodei38 and nodei28 is one such pair for TORs ovs-vm1 to ovs-vm64
>
> I disconnected the control/data link on nodei38
> contrail-tor-agent-1 on nodei28 became active for ovs-vm1
>
> After about 20-30 mins, i reconnected the link on nodei38
>
> It was then seen that 3 tor-agents (like contrail-tor-agent-49) were not responding on their http ports.
> contrail-status would show them with 'timeout'
>
> root@nodei38:/var/crashes# ps aux |grep "agent-49"
> root 8672 0.0 0.0 10460 936 pts/0 S+ 12:55 0:00 grep --color=auto agent-49
> root 25622 2.7 0.0 2413912 139988 ? Sl 00:54 19:52 /usr/bin/contrail-tor-agent --config_file /etc/contrail/contrail-tor-agent-49.conf
> root@nodei38:/var/crashes#
>
> root@nodei38:/var/crashes# contrail-status |grep timeout
> contrail-tor-agent-16 timeout
> contrail-tor-agent-4 timeout
> contrail-tor-agent-49 timeout
> root@nodei38:/var/crashes#
>
> root@nodei38:/var/crashes# curl http://127.0.0.1:9058
>
> ^C
> root@nodei38:/var/crashes# lsof -p 25622 |grep 9058
> contrail- 25622 root 10u IPv4 28219150 0t0 TCP *:9058 (LISTEN)
> root@nodei38:/var/crashes#
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juniperopenstack/+bug/1460946/+subscriptions

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

Raj, in this case, the port was not responding at all. It would have failed even for 5 secs

Raj Reddy (rajreddy)
tags: added: analytics
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/11342
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11342
Committed: http://github.org/Juniper/contrail-sandesh/commit/31994e0f072345f6c19e166ed2a0859cbbb80fc1
Submitter: Zuul
Branch: R2.20

commit 31994e0f072345f6c19e166ed2a0859cbbb80fc1
Author: Megh Bhatt <email address hidden>
Date: Sat Jun 6 00:50:35 2015 -0700

Sandesh session delete happens in the sandesh client state
machine context and the session is used to enqueue messages
from client context and these can be executing concurrently
causing the session to be deleted when being used from the
client context. The fix is to enqueue the messages from the
client to the state machine - EvSandeshSend and from the
state machine context enqueue on to the session send queue.

Enabled sandesh client state machine unit tests.

Change-Id: I2d33c9348f71c99108bfc5fdfc2966b42231b214
Closes-Bug: #1460946

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/11477
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11477
Committed: http://github.org/Juniper/contrail-sandesh/commit/130eabc369583883f1e820a3a2323d77f7e5d82f
Submitter: Zuul
Branch: master

commit 130eabc369583883f1e820a3a2323d77f7e5d82f
Author: Megh Bhatt <email address hidden>
Date: Sat Jun 6 00:50:35 2015 -0700

Sandesh session delete happens in the sandesh client state
machine context and the session is used to enqueue messages
from client context and these can be executing concurrently
causing the session to be deleted when being used from the
client context. The fix is to enqueue the messages from the
client to the state machine - EvSandeshSend and from the
state machine context enqueue on to the session send queue.

Enabled sandesh client state machine unit tests.

Change-Id: I2d33c9348f71c99108bfc5fdfc2966b42231b214
Closes-Bug: #1460946
(cherry picked from commit 31994e0f072345f6c19e166ed2a0859cbbb80fc1)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.