tor-agent crash at operator() due to seg fault on scale setup

Bug #1433668 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
Vedamurthy Joshi
Trunk
Fix Committed
Medium
Vedamurthy Joshi

Bug Description

R2.1 42 Ubuntu 14.04 Multi-node setup

Tor Scale setup with 128 tor agents and 11K vmis

Crash will be in http://10.204.216.50/Docs/bugs/#

Below tor-agent crash was seen once :

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-tor-agent --config_file /etc/contrail/contrail-tor-agent-77.c'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000c7b181 in operator() (this=0x7fdf4406e178) at /usr/include/boost/function/function_template.hpp:767
767 /usr/include/boost/function/function_template.hpp: No such file or directory.
(gdb) bt
#0 0x0000000000c7b181 in operator() (this=0x7fdf4406e178) at /usr/include/boost/function/function_template.hpp:767
#1 OnEntry (this=0x7fdf4406e100) at controller/src/base/queue_task.h:297
#2 QueueTaskRunner<SandeshClientSMImpl::EventContainer, WorkQueue<SandeshClientSMImpl::EventContainer> >::Run (this=0x7fdf1c017050)
    at controller/src/base/queue_task.h:33
#3 0x0000000000d0d290 in TaskImpl::execute (this=0x7fdf4b0a5c40) at controller/src/base/task.cc:232
#4 0x00007fdf52ac1b3a in ?? () from /usr/lib/libtbb.so.2
#5 0x00007fdf52abd816 in ?? () from /usr/lib/libtbb.so.2
#6 0x00007fdf52abcf4b in ?? () from /usr/lib/libtbb.so.2
#7 0x00007fdf52ab90ff in ?? () from /usr/lib/libtbb.so.2
#8 0x00007fdf52ab92f9 in ?? () from /usr/lib/libtbb.so.2
#9 0x00007fdf52cdd182 in start_thread (arg=0x7fdf3bbfe700) at pthread_create.c:312
#10 0x00007fdf5197dfbd in __signbitl (__x=0) at ../sysdeps/x86/fpu/bits/mathinline.h:154
#11 __qfcvt_r (value=0, ndigit=1002432256, decpt=0x7fdf3bbfe9c0, sign=0x0, buf=0x0, len=0) at efgcvt_r.c:93
#12 0x0000000000000000 in ?? ()
(gdb)

Revision history for this message
Prabhjot Singh Sethi (prabhjot) wrote :

Seems to be a case of workqueue object delete without shutdown in Sandesh infra

SandeshClientSMImpl pointer memory overlaps with workqueue pointer (which seems to be deleted without shutdown)
(gdb) p (TcpServer *) 0x7fdf44071d50
$167 = (SandeshClient *) 0x7fdf44071d50
(gdb) p (SandeshClient *) 0x7fdf44071d50
$168 = (SandeshClient *) 0x7fdf44071d50
(gdb) p $168->sm_.px
$169 = (SandeshClientSMImpl *) 0x7fdf4406e080
(gdb) p &$168->sm_.px->work_queue_
$170 = (WorkQueue<SandeshClientSMImpl::EventContainer> *) 0x7fdf4406e1e8

(gdb) fr 3
#3 0x0000000000d0d290 in TaskImpl::execute (this=0x7fdf4b0a5c40) at controller/src/base/task.cc:232
(gdb) p *this->parent_
$178 = (QueueTaskRunner<SandeshClientSMImpl::EventContainer, WorkQueue<SandeshClientSMImpl::EventContainer> >) {
  <Task> = {
    _vptr.Task = 0xdd6bf0 <vtable for QueueTaskRunner<SandeshClientSMImpl::EventContainer, WorkQueue<SandeshClientSMImpl::EventContainer> >+16>,
    static kTaskInstanceAny = -1,
    task_id_ = 16,
    task_instance_ = 0,
    task_impl_ = 0x7fdf4b0a5c40,
    state_ = Task::RUN,
    seqno_ = 747966,
    task_recycle_ = false,
    task_cancel_ = false
  },
  members of QueueTaskRunner<SandeshClientSMImpl::EventContainer, WorkQueue<SandeshClientSMImpl::EventContainer> >:
  queue_ = 0x7fdf4406e100
}
(gdb)

Task Context from task scheduler
elem[14].left: $210 = {
  static npos = <optimized out>,
  _M_dataplus = {
    <std::allocator<char>> = {
      <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
    members of std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider:
    _M_p = 0x1687f18 "sandesh::SandeshClientSM"
  }
}
elem[14].right: $211 = 16

looking at the code i can find one such potential instance, where workqueue in Sandesh Session object is deleted without shutdown.
We will still need to figure out if there are any other instances of the same.

Changed in juniperopenstack:
assignee: Prabhjot Singh Sethi (prabhjot) → Raj Reddy (rajreddy)
Raj Reddy (rajreddy)
Changed in juniperopenstack:
assignee: Raj Reddy (rajreddy) → Megh Bhatt (meghb)
tags: added: analytics
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Not seen recently

Changed in juniperopenstack:
importance: High → Medium
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/13872
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13872
Committed: http://github.org/Juniper/contrail-sandesh/commit/1c2c85e8a37f9e02c5a08eff3ac64281fae05ef0
Submitter: Zuul
Branch: R2.22-dev

commit 1c2c85e8a37f9e02c5a08eff3ac64281fae05ef0
Author: Megh Bhatt <email address hidden>
Date: Wed Sep 16 10:40:42 2015 -0700

Return failure from InitGenerator instead of asserting if sandesh
HTTP initialization fails.
Partial-Bug: #1433668

Change-Id: I86acb9d4c60b6c3747018588f88a0d43188929a9

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/13892
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13892
Committed: http://github.org/Juniper/contrail-sandesh/commit/56981efdb83120cdff0f668d323113cbdb804b6b
Submitter: Zuul
Branch: master

commit 56981efdb83120cdff0f668d323113cbdb804b6b
Author: Megh Bhatt <email address hidden>
Date: Wed Sep 16 10:40:42 2015 -0700

Return failure from InitGenerator instead of asserting if sandesh
HTTP initialization fails.
Partial-Bug: #1433668

Change-Id: I86acb9d4c60b6c3747018588f88a0d43188929a9
(cherry picked from commit 1c2c85e8a37f9e02c5a08eff3ac64281fae05ef0)

Revision history for this message
Megh Bhatt (meghb) wrote :

The commits were done with wrong bug-id. They are for bug #1474258

Revision history for this message
Megh Bhatt (meghb) wrote :

The above core is likely fixed as part of fix for bug #1460946. Please open new bug if seen again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.