contrail-collector core continuously in ha setup

Bug #1613896 reported by wenqing liang
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.1
Fix Committed
High
ted ghose
Trunk
Fix Committed
High
ted ghose

Bug Description

In a sm provisioned r3.1.0.0-23 mitaka ha setup, contrail-collector was seen core continuously. cores uploaded to /cs-shared/bugs/1613896.

From: Ted Ghose
Sent: Monday, August 15, 2016 7:26 PM
To: Megh Bhatt <email address hidden>; Wenqing Liang <email address hidden>
Cc: Raj Reddy <email address hidden>; Ignatious Johnson <email address hidden>; Abhay Joshi <email address hidden>; dl-contrail-server-manager <email address hidden>
Subject: Re: service issue

its a scoped pointer, if the callback is recorded, it wont be deleted, right?

________________________________________
From: Megh Bhatt
Sent: Monday, August 15, 2016 7:20 PM
To: Wenqing Liang; Ted Ghose
Cc: Raj Reddy; Ignatious Johnson; Abhay Joshi; dl-contrail-server-manager
Subject: Re: service issue

Sorry I mean VncApi client can get deleted and we can still access it from the RespHandler callback ?? and then that can end up enqueueing a callback on a deleted object/empty boost function.

Thanks

Megh

On Aug 15, 2016, at 7:13 PM, Megh Bhatt <email address hidden> wrote:

Added Ted

The collector did core on the first controller

/var/log/contrail/contrail-collector.log.4:2016-08-15 Mon 15:53:26:298.401 PDT a5d01e03 [Thread 140083858196224, Pid 21908]: !!!! ERROR !!!! Task caught fatal exception: call to empty boost::function TaskImpl: 0x7f67d3bc3c40

Core was generated by `/usr/bin/contrail-collector'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f67dabba0d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f67dabafb86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f67dabafc32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291
#5 0x00007f67dc13eb3a in ?? () from /usr/lib/libtbb.so.2
#6 0x00007f67dc13a816 in ?? () from /usr/lib/libtbb.so.2
#7 0x00007f67dc139f4b in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f67dc1360ff in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f67dc1362f9 in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f67dc35a182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007f67dac7a47d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) f 4
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291
291 controller/src/base/task.cc: No such file or directory.
(gdb) info locals
what = "call to empty boost::function"
e = @0x7f67a0007cb0: <incomplete type>
running = @0x7f67d3b03c80: 0x2a857b0
__PRETTY_FUNCTION__ = "virtual tbb::task* TaskImpl::execute()"
(gdb) p *this
$1 = {<tbb::task> = {<No data fields>}, parent_ = 0x2a857b0}
(gdb) set print object on
(gdb) p parent_
$2 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > > *) 0x2a857b0
(gdb) p *parent_
$3 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > >) {<Task> = {_vptr.Task = 0x9b6bf0 <vtable for QueueTaskRunner<boost::function<void ()>, WorkQueue<boost::function<void ()> > >+16>, static kTaskInstanceAny = -1, task_id_ = 15, task_instance_ = 0, task_impl_ = 0x7f67d3bc3c40, state_ = Task::RUN, seqno_ = 518,
    task_recycle_ = false, task_cancel_ = false, enqueue_time_ = 0, schedule_time_ = 0, execute_delay_ = 0, schedule_delay_ = 0,
    waitq_hook_ = {<boost::intrusive::detail::generic_hook<boost::intrusive::get_list_node_algo<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>> = {<boost::intrusive::detail::no_default_definer> = {<No data fields>}, <boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <No data fields>}, <No data fields>}},
  queue_ = 0x7f677c00a9d0}
(gdb) p TaskScheduler::singleton_.px_
There is no member or method named px_.
(gdb) p TaskScheduler::singleton_.px
$4 = (TaskScheduler *) 0x2a044a0
(gdb) p *TaskScheduler::singleton_.px
$5 = {static kVectorGrowSize = 16, static singleton_ = {px = 0x2a044a0}, stop_entry_ = 0x2a46530, task_scheduler_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, my_scheduler = 0x7f67d3bcf800, static automatic = -1, static deferred = -2}, mutex_ = {static is_rw_mutex = false,
    static is_recursive_mutex = false, static is_fair_mutex = false, impl = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, running_ = true, seqno_ = 519, task_group_db_ = std::vector of length 33, capacity 33 = {
    0x0, 0x2ad3450, 0x0, 0x0, 0x2ad17f0, 0x2a58470, 0x2a58660, 0x2a49f40, 0x2a592e0, 0x0, 0x0, 0x0, 0x2a97a10, 0x2a939f0, 0x2a97230, 0x2ad53d0, 0x0, 0x2ad4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, id_map_mutex_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>},
    reader_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock_read*, long, tbb::interface5::reader_writer_lock::scoped_lock_read>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock_read*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},
    writer_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},
    writer_tail = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>}, my_current_writer = {my_id = 0},
    rdr_count_and_flags = {<tbb::internal::atomic_impl_with_arithmetic<unsigned long, unsigned long, char>> = {<tbb::internal::atomic_impl<unsigned long>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}}, id_map_ = std::map with 17 elements = {["CqlIfImpl::Task"] = 2, ["Kafka Timer"] = 4, ["UDC_cfg_poll_timer"] = 1,
    ["analytics::DbHandler"] = 8, ["bgp::Config"] = 16, ["collector:DbIf"] = 3, ["http client"] = 15, ["io::ReaderTask"] = 6, ["io::udp::ReaderTask"] = 10, ["sandesh::LifetimeMgr"] = 7, ["sandesh::RecvQueue"] = 11, ["sandesh::SandeshClientReader"] = 14, ["sandesh::SandeshClientSM"] = 12, ["sandesh::SandeshClientSession"] = 13,
    ["sandesh::SandeshStateMachine"] = 5, ["vizd::Stats"] = 17, ["vizd::syslog"] = 9}, id_max_ = 17, log_fn_ = {<boost::function5<void, char const*, unsigned int, Task const*, char const*, unsigned int>> = {<boost::function_base> = {vtable = 0x0, functor = {obj_ptr = 0x29ff688, type = {type = 0x29ff688, const_qualified = 88, volatile_qualified = 135},
          func_ptr = 0x29ff688, bound_memfunc_ptr = {memfunc_ptr = (void (boost::detail::function::X::*)(boost::detail::function::X * const, int)) 0x29ff688, this adjustment 43943768, obj_ptr = 0x2a35598}, obj_ref = {obj_ptr = 0x29ff688, is_const_qualified = 88, is_volatile_qualified = 135}, data = -120 '\210'}}, static args = <optimized out>,
      static arity = <optimized out>}, <No data fields>}, hw_thread_count_ = 16, track_run_time_ = false, measure_delay_ = false, schedule_delay_ = 0, execute_delay_ = 0, enqueue_count_ = 519, done_count_ = 506, cancel_count_ = 12, static ThreadAmpFactor_ = 1}
(gdb) p *(TaskImpl *)0x7f67d3bc3c40
$6 = {void (TaskImpl * const, Task *)} 0x7f67d3bc3c40
(gdb) p parent_->task_impl_
$7 = (tbb::task *) 0x7f67d3bc3c40
(gdb) p *parent_->task_impl_
$8 = <incomplete type>
(gdb)

It looks like the http client task Workqueue DequeueEvent function is causing this. Ted some of the vncapi.cc RemoveConnection calls will delete the connection right after the call ??? That does not seem correct at first glance.

Can you please confirm?

Thanks

Megh

On Aug 15, 2016, at 4:18 PM, Wenqing Liang <email address hidden> wrote:

Hi Megh,
I made the same changes to the other two controllers 10.87.141.118 and 10.87.141.126. root/n1keenA. Restart of contrail-collector exited with ERROR however.

root@a5d01e04:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/
vizd.debug 100% 83MB 83.4MB/s 00:01
root@a5d01e04:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
root@a5d01e04:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
root@a5d01e04:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e04:~# ssh -l root 10.87.141.126
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-85-generic x86_64)

* Documentation: https://help.ubuntu.com/
New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Mon Aug 15 16:06:40 2016 from 10.87.141.118
root@a5d01e05:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/
vizd.debug 100% 83MB 83.4MB/s 00:00
root@a5d01e05:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
root@a5d01e05:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
root@a5d01e05:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e05:~#

No, I don’t think I have any user defined counters configured.

Thanks,
Wenqing

From: Megh Bhatt
Sent: Monday, August 15, 2016 3:54 PM
To: Raj Reddy <email address hidden>; Wenqing Liang <email address hidden>
Cc: Ignatious Johnson <email address hidden>; Abhay Joshi <email address hidden>; dl-contrail-server-manager <email address hidden>
Subject: Re: service issue

Hi Wenqing,
The production contrail-collector binary has optimized out bunch of variables in the core and hence we cannot determine which task caused the core. The reason is a call to empty boost function. To debug further, can you please do the following:

1. Change log_level to SYS_DEBUG in all the controller’s /etc/contrail/contrail-collector.conf

2. Copy over the debug vizd binary from /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd to all the controllers and run it instead of the production binary.

scp /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd root@<IP-address-of-controller-nodes>:/var/tmp/vizd.debug
mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
service contrail-collector restart

I have already done the above on 10.87.141.120

I will also look at the current checkins to find out. Also can you please confirm if you have any user defined counters configured, that is a new contrail-collector feature that was added in 3.1.

Thanks

Megh

On Aug 15, 2016, at 8:03 AM, Raj Reddy <email address hidden> wrote:

+Megh,

Megh, can you take a look and see why collector is crashing..

thanks,

On Aug 14, 2016, at 9:57 PM, Ignatious Johnson <email address hidden> wrote:

Hi Raj,

I see contrail-collector core continuously. Can you get the team to take a look.

Hi Wenqing,
Device-manager/schema/svc-monitor will be active in one of the config node and backup in other nodes. This is expected.

Thanks,
Ignatious

From: Wenqing Liang <email address hidden>
Date: Sunday, August 14, 2016 at 9:03 AM
To: Abhay Joshi <email address hidden>, Ignatious Johnson <email address hidden>
Cc: Raj Reddy <email address hidden>, dl-contrail-server-manager <email address hidden>
Subject: RE: service issue

Any news? Can I take back the setup?
Thanks,
Wenqing

From: Abhay Joshi
Sent: Friday, August 12, 2016 7:26 PM
To: Ignatious Johnson <email address hidden>
Cc: Wenqing Liang <email address hidden>; Raj Reddy <email address hidden>; dl-contrail-server-manager <email address hidden>
Subject: Re: service issue

You looking into this right, Ignatious?
On Aug 12, 2016, at 6:18 PM, Ignatious Johnson <email address hidden> wrote:

+ Abhay
On Aug 12, 2016 5:44 PM, Wenqing Liang <email address hidden> wrote:
Hi,
On my r3.1.0.0-23 mitaka sm provisioned ha cluster, I see the following. Could u pls have a look? 10.87.141.120 is the cfg node and the sm is 10.87.141.124. root/n1keenA.

root@a5d01e03:~# contrail-status | egrep -v "active|="

contrail-snmp-collector initializing (ApiServer:SNMP[Not connected] connection down)

contrail-device-manager backup
contrail-schema backup
contrail-svc-monitor backup

/var/crashes/core.contrail-collec.10616.a5d01e03.1471037078
/var/crashes/core.contrail-collec.7173.a5d01e03.1471038160
/var/crashes/core.contrail-collec.18920.a5d01e03.1471030477
/var/crashes/core.contrail-collec.1853.a5d01e03.1471037920
root@a5d01e03:~#

Thanks,
Wenqing

Tags: analytics
wenqing liang (wliang)
information type: Proprietary → Public
wenqing liang (wliang)
description: updated
Raj Reddy (rajreddy)
tags: added: analytics
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/23379
Submitter: ted ghose (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/23380
Submitter: ted ghose (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/23379
Submitter: ted ghose (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/23379
Committed: http://github.org/Juniper/contrail-controller/commit/e887cca79c4f53af22c757d937f43f471e86bb80
Submitter: Zuul
Branch: master

commit e887cca79c4f53af22c757d937f43f471e86bb80
Author: Ted Ghose <email address hidden>
Date: Thu Aug 18 14:55:32 2016 -0700

contrail-collector core continuously in ha setup

- http client gets deleted on discovery refresh

Change-Id: I7809a1f139cfc03b748ea5d1fcbe68e9c8b96abc
Closes-Bug: 1613896

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/23380
Committed: http://github.org/Juniper/contrail-controller/commit/68bb82377f760f0dad3a3243e2086a8701d33805
Submitter: Zuul
Branch: R3.1

commit 68bb82377f760f0dad3a3243e2086a8701d33805
Author: Ted Ghose <email address hidden>
Date: Thu Aug 18 14:55:32 2016 -0700

contrail-collector core continuously in ha setup

- http client gets deleted on discovery refresh

Change-Id: I3187f898a440d5c41c8fca34bcda34ff2ad8cf0e
Closes-Bug: 1613896

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.