Activity log for bug #1613896

Date Who What changed Old value New value Message
2016-08-16 23:03:48 wenqing liang bug added bug
2016-08-16 23:04:27 wenqing liang nominated for series juniperopenstack/trunk
2016-08-16 23:04:27 wenqing liang bug task added juniperopenstack/trunk
2016-08-16 23:04:27 wenqing liang nominated for series juniperopenstack/r3.1
2016-08-16 23:04:27 wenqing liang bug task added juniperopenstack/r3.1
2016-08-16 23:04:34 wenqing liang juniperopenstack/r3.1: importance Undecided Medium
2016-08-16 23:04:50 wenqing liang juniperopenstack/r3.1: assignee ted ghose (rintu)
2016-08-16 23:05:02 wenqing liang juniperopenstack/r3.1: milestone r3.1.1.0
2016-08-16 23:05:11 wenqing liang juniperopenstack/trunk: milestone r3.2.0.0-fcs
2016-08-16 23:05:17 wenqing liang information type Proprietary Public
2016-08-16 23:07:23 wenqing liang juniperopenstack/r3.1: importance Medium High
2016-08-16 23:07:25 wenqing liang juniperopenstack/trunk: importance Medium High
2016-08-16 23:11:10 wenqing liang description In a sm provisioned r3.1.0.0-23 mitaka ha setup, contrail-collector was seen core continuously. From: Ted Ghose Sent: Monday, August 15, 2016 7:26 PM To: Megh Bhatt <meghb@juniper.net>; Wenqing Liang <wliang@juniper.net> Cc: Raj Reddy <rajreddy@juniper.net>; Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: Re: service issue its a scoped pointer, if the callback is recorded, it wont be deleted, right? ________________________________________ From: Megh Bhatt Sent: Monday, August 15, 2016 7:20 PM To: Wenqing Liang; Ted Ghose Cc: Raj Reddy; Ignatious Johnson; Abhay Joshi; dl-contrail-server-manager Subject: Re: service issue Sorry I mean VncApi client can get deleted and we can still access it from the RespHandler callback ?? and then that can end up enqueueing a callback on a deleted object/empty boost function. Thanks Megh On Aug 15, 2016, at 7:13 PM, Megh Bhatt <meghb@juniper.net> wrote: Added Ted The collector did core on the first controller /var/log/contrail/contrail-collector.log.4:2016-08-15 Mon 15:53:26:298.401 PDT a5d01e03 [Thread 140083858196224, Pid 21908]: !!!! ERROR !!!! Task caught fatal exception: call to empty boost::function TaskImpl: 0x7f67d3bc3c40 Core was generated by `/usr/bin/contrail-collector'. Program terminated with signal SIGABRT, Aborted. #0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f67dabba0d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f67dabafb86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007f67dabafc32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291 #5 0x00007f67dc13eb3a in ?? () from /usr/lib/libtbb.so.2 #6 0x00007f67dc13a816 in ?? () from /usr/lib/libtbb.so.2 #7 0x00007f67dc139f4b in ?? () from /usr/lib/libtbb.so.2 #8 0x00007f67dc1360ff in ?? () from /usr/lib/libtbb.so.2 #9 0x00007f67dc1362f9 in ?? () from /usr/lib/libtbb.so.2 #10 0x00007f67dc35a182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #11 0x00007f67dac7a47d in clone () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) f 4 #4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291 291 controller/src/base/task.cc: No such file or directory. (gdb) info locals what = "call to empty boost::function" e = @0x7f67a0007cb0: <incomplete type> running = @0x7f67d3b03c80: 0x2a857b0 __PRETTY_FUNCTION__ = "virtual tbb::task* TaskImpl::execute()" (gdb) p *this $1 = {<tbb::task> = {<No data fields>}, parent_ = 0x2a857b0} (gdb) set print object on (gdb) p parent_ $2 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > > *) 0x2a857b0 (gdb) p *parent_ $3 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > >) {<Task> = {_vptr.Task = 0x9b6bf0 <vtable for QueueTaskRunner<boost::function<void ()>, WorkQueue<boost::function<void ()> > >+16>, static kTaskInstanceAny = -1, task_id_ = 15, task_instance_ = 0, task_impl_ = 0x7f67d3bc3c40, state_ = Task::RUN, seqno_ = 518, task_recycle_ = false, task_cancel_ = false, enqueue_time_ = 0, schedule_time_ = 0, execute_delay_ = 0, schedule_delay_ = 0, waitq_hook_ = {<boost::intrusive::detail::generic_hook<boost::intrusive::get_list_node_algo<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>> = {<boost::intrusive::detail::no_default_definer> = {<No data fields>}, <boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <No data fields>}, <No data fields>}}, queue_ = 0x7f677c00a9d0} (gdb) p TaskScheduler::singleton_.px_ There is no member or method named px_. (gdb) p TaskScheduler::singleton_.px $4 = (TaskScheduler *) 0x2a044a0 (gdb) p *TaskScheduler::singleton_.px $5 = {static kVectorGrowSize = 16, static singleton_ = {px = 0x2a044a0}, stop_entry_ = 0x2a46530, task_scheduler_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, my_scheduler = 0x7f67d3bcf800, static automatic = -1, static deferred = -2}, mutex_ = {static is_rw_mutex = false, static is_recursive_mutex = false, static is_fair_mutex = false, impl = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, running_ = true, seqno_ = 519, task_group_db_ = std::vector of length 33, capacity 33 = { 0x0, 0x2ad3450, 0x0, 0x0, 0x2ad17f0, 0x2a58470, 0x2a58660, 0x2a49f40, 0x2a592e0, 0x0, 0x0, 0x0, 0x2a97a10, 0x2a939f0, 0x2a97230, 0x2ad53d0, 0x0, 0x2ad4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, id_map_mutex_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, reader_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock_read*, long, tbb::interface5::reader_writer_lock::scoped_lock_read>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock_read*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>}, writer_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>}, writer_tail = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>}, my_current_writer = {my_id = 0}, rdr_count_and_flags = {<tbb::internal::atomic_impl_with_arithmetic<unsigned long, unsigned long, char>> = {<tbb::internal::atomic_impl<unsigned long>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}}, id_map_ = std::map with 17 elements = {["CqlIfImpl::Task"] = 2, ["Kafka Timer"] = 4, ["UDC_cfg_poll_timer"] = 1, ["analytics::DbHandler"] = 8, ["bgp::Config"] = 16, ["collector:DbIf"] = 3, ["http client"] = 15, ["io::ReaderTask"] = 6, ["io::udp::ReaderTask"] = 10, ["sandesh::LifetimeMgr"] = 7, ["sandesh::RecvQueue"] = 11, ["sandesh::SandeshClientReader"] = 14, ["sandesh::SandeshClientSM"] = 12, ["sandesh::SandeshClientSession"] = 13, ["sandesh::SandeshStateMachine"] = 5, ["vizd::Stats"] = 17, ["vizd::syslog"] = 9}, id_max_ = 17, log_fn_ = {<boost::function5<void, char const*, unsigned int, Task const*, char const*, unsigned int>> = {<boost::function_base> = {vtable = 0x0, functor = {obj_ptr = 0x29ff688, type = {type = 0x29ff688, const_qualified = 88, volatile_qualified = 135}, func_ptr = 0x29ff688, bound_memfunc_ptr = {memfunc_ptr = (void (boost::detail::function::X::*)(boost::detail::function::X * const, int)) 0x29ff688, this adjustment 43943768, obj_ptr = 0x2a35598}, obj_ref = {obj_ptr = 0x29ff688, is_const_qualified = 88, is_volatile_qualified = 135}, data = -120 '\210'}}, static args = <optimized out>, static arity = <optimized out>}, <No data fields>}, hw_thread_count_ = 16, track_run_time_ = false, measure_delay_ = false, schedule_delay_ = 0, execute_delay_ = 0, enqueue_count_ = 519, done_count_ = 506, cancel_count_ = 12, static ThreadAmpFactor_ = 1} (gdb) p *(TaskImpl *)0x7f67d3bc3c40 $6 = {void (TaskImpl * const, Task *)} 0x7f67d3bc3c40 (gdb) p parent_->task_impl_ $7 = (tbb::task *) 0x7f67d3bc3c40 (gdb) p *parent_->task_impl_ $8 = <incomplete type> (gdb) It looks like the http client task Workqueue DequeueEvent function is causing this. Ted some of the vncapi.cc RemoveConnection calls will delete the connection right after the call ??? That does not seem correct at first glance. Can you please confirm? Thanks Megh On Aug 15, 2016, at 4:18 PM, Wenqing Liang <wliang@juniper.net> wrote: Hi Megh, I made the same changes to the other two controllers 10.87.141.118 and 10.87.141.126. root/n1keenA. Restart of contrail-collector exited with ERROR however. root@a5d01e04:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/ vizd.debug 100% 83MB 83.4MB/s 00:01 root@a5d01e04:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig root@a5d01e04:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector root@a5d01e04:~# service contrail-collector restart contrail-collector: stopped contrail-collector: ERROR (abnormal termination) root@a5d01e04:~# ssh -l root 10.87.141.126 Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-85-generic x86_64) * Documentation: https://help.ubuntu.com/ New release '16.04.1 LTS' available. Run 'do-release-upgrade' to upgrade to it. Last login: Mon Aug 15 16:06:40 2016 from 10.87.141.118 root@a5d01e05:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/ vizd.debug 100% 83MB 83.4MB/s 00:00 root@a5d01e05:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig root@a5d01e05:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector root@a5d01e05:~# service contrail-collector restart contrail-collector: stopped contrail-collector: ERROR (abnormal termination) root@a5d01e05:~# No, I don’t think I have any user defined counters configured. Thanks, Wenqing From: Megh Bhatt Sent: Monday, August 15, 2016 3:54 PM To: Raj Reddy <rajreddy@juniper.net>; Wenqing Liang <wliang@juniper.net> Cc: Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: Re: service issue Hi Wenqing, The production contrail-collector binary has optimized out bunch of variables in the core and hence we cannot determine which task caused the core. The reason is a call to empty boost function. To debug further, can you please do the following: 1. Change log_level to SYS_DEBUG in all the controller’s /etc/contrail/contrail-collector.conf 2. Copy over the debug vizd binary from /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd to all the controllers and run it instead of the production binary. scp /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd root@<IP-address-of-controller-nodes>:/var/tmp/vizd.debug mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector service contrail-collector restart I have already done the above on 10.87.141.120 I will also look at the current checkins to find out. Also can you please confirm if you have any user defined counters configured, that is a new contrail-collector feature that was added in 3.1. Thanks Megh On Aug 15, 2016, at 8:03 AM, Raj Reddy <rajreddy@juniper.net> wrote: +Megh, Megh, can you take a look and see why collector is crashing.. thanks, On Aug 14, 2016, at 9:57 PM, Ignatious Johnson <ijohnson@juniper.net> wrote: Hi Raj, I see contrail-collector core continuously. Can you get the team to take a look. Hi Wenqing, Device-manager/schema/svc-monitor will be active in one of the config node and backup in other nodes. This is expected. Thanks, Ignatious From: Wenqing Liang <wliang@juniper.net> Date: Sunday, August 14, 2016 at 9:03 AM To: Abhay Joshi <abhayj@juniper.net>, Ignatious Johnson <ijohnson@juniper.net> Cc: Raj Reddy <rajreddy@juniper.net>, dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: RE: service issue Any news? Can I take back the setup? Thanks, Wenqing From: Abhay Joshi Sent: Friday, August 12, 2016 7:26 PM To: Ignatious Johnson <ijohnson@juniper.net> Cc: Wenqing Liang <wliang@juniper.net>; Raj Reddy <rajreddy@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: Re: service issue You looking into this right, Ignatious? On Aug 12, 2016, at 6:18 PM, Ignatious Johnson <ijohnson@juniper.net> wrote: + Abhay On Aug 12, 2016 5:44 PM, Wenqing Liang <wliang@juniper.net> wrote: Hi, On my r3.1.0.0-23 mitaka sm provisioned ha cluster, I see the following. Could u pls have a look? 10.87.141.120 is the cfg node and the sm is 10.87.141.124. root/n1keenA. root@a5d01e03:~# contrail-status | egrep -v "active|=" contrail-snmp-collector initializing (ApiServer:SNMP[Not connected] connection down) contrail-device-manager backup contrail-schema backup contrail-svc-monitor backup /var/crashes/core.contrail-collec.10616.a5d01e03.1471037078 /var/crashes/core.contrail-collec.7173.a5d01e03.1471038160 /var/crashes/core.contrail-collec.18920.a5d01e03.1471030477 /var/crashes/core.contrail-collec.1853.a5d01e03.1471037920 root@a5d01e03:~# Thanks, Wenqing In a sm provisioned r3.1.0.0-23 mitaka ha setup, contrail-collector was seen core continuously. cores uploaded to /cs-shared/bugs/1613896. From: Ted Ghose Sent: Monday, August 15, 2016 7:26 PM To: Megh Bhatt <meghb@juniper.net>; Wenqing Liang <wliang@juniper.net> Cc: Raj Reddy <rajreddy@juniper.net>; Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: Re: service issue its a scoped pointer, if the callback is recorded, it wont be deleted, right? ________________________________________ From: Megh Bhatt Sent: Monday, August 15, 2016 7:20 PM To: Wenqing Liang; Ted Ghose Cc: Raj Reddy; Ignatious Johnson; Abhay Joshi; dl-contrail-server-manager Subject: Re: service issue Sorry I mean VncApi client can get deleted and we can still access it from the RespHandler callback ?? and then that can end up enqueueing a callback on a deleted object/empty boost function. Thanks Megh On Aug 15, 2016, at 7:13 PM, Megh Bhatt <meghb@juniper.net> wrote: Added Ted The collector did core on the first controller /var/log/contrail/contrail-collector.log.4:2016-08-15 Mon 15:53:26:298.401 PDT a5d01e03 [Thread 140083858196224, Pid 21908]: !!!! ERROR !!!! Task caught fatal exception: call to empty boost::function TaskImpl: 0x7f67d3bc3c40 Core was generated by `/usr/bin/contrail-collector'. Program terminated with signal SIGABRT, Aborted. #0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) bt #0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f67dabba0d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007f67dabafb86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007f67dabafc32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6 #4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291 #5 0x00007f67dc13eb3a in ?? () from /usr/lib/libtbb.so.2 #6 0x00007f67dc13a816 in ?? () from /usr/lib/libtbb.so.2 #7 0x00007f67dc139f4b in ?? () from /usr/lib/libtbb.so.2 #8 0x00007f67dc1360ff in ?? () from /usr/lib/libtbb.so.2 #9 0x00007f67dc1362f9 in ?? () from /usr/lib/libtbb.so.2 #10 0x00007f67dc35a182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #11 0x00007f67dac7a47d in clone () from /lib/x86_64-linux-gnu/libc.so.6 (gdb) f 4 #4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291 291 controller/src/base/task.cc: No such file or directory. (gdb) info locals what = "call to empty boost::function" e = @0x7f67a0007cb0: <incomplete type> running = @0x7f67d3b03c80: 0x2a857b0 __PRETTY_FUNCTION__ = "virtual tbb::task* TaskImpl::execute()" (gdb) p *this $1 = {<tbb::task> = {<No data fields>}, parent_ = 0x2a857b0} (gdb) set print object on (gdb) p parent_ $2 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > > *) 0x2a857b0 (gdb) p *parent_ $3 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > >) {<Task> = {_vptr.Task = 0x9b6bf0 <vtable for QueueTaskRunner<boost::function<void ()>, WorkQueue<boost::function<void ()> > >+16>, static kTaskInstanceAny = -1, task_id_ = 15, task_instance_ = 0, task_impl_ = 0x7f67d3bc3c40, state_ = Task::RUN, seqno_ = 518,     task_recycle_ = false, task_cancel_ = false, enqueue_time_ = 0, schedule_time_ = 0, execute_delay_ = 0, schedule_delay_ = 0,     waitq_hook_ = {<boost::intrusive::detail::generic_hook<boost::intrusive::get_list_node_algo<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>> = {<boost::intrusive::detail::no_default_definer> = {<No data fields>}, <boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <No data fields>}, <No data fields>}},   queue_ = 0x7f677c00a9d0} (gdb) p TaskScheduler::singleton_.px_ There is no member or method named px_. (gdb) p TaskScheduler::singleton_.px $4 = (TaskScheduler *) 0x2a044a0 (gdb) p *TaskScheduler::singleton_.px $5 = {static kVectorGrowSize = 16, static singleton_ = {px = 0x2a044a0}, stop_entry_ = 0x2a46530, task_scheduler_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, my_scheduler = 0x7f67d3bcf800, static automatic = -1, static deferred = -2}, mutex_ = {static is_rw_mutex = false,     static is_recursive_mutex = false, static is_fair_mutex = false, impl = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, running_ = true, seqno_ = 519, task_group_db_ = std::vector of length 33, capacity 33 = {     0x0, 0x2ad3450, 0x0, 0x0, 0x2ad17f0, 0x2a58470, 0x2a58660, 0x2a49f40, 0x2a592e0, 0x0, 0x0, 0x0, 0x2a97a10, 0x2a939f0, 0x2a97230, 0x2ad53d0, 0x0, 0x2ad4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, id_map_mutex_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>},     reader_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock_read*, long, tbb::interface5::reader_writer_lock::scoped_lock_read>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock_read*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},     writer_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},     writer_tail = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>}, my_current_writer = {my_id = 0},     rdr_count_and_flags = {<tbb::internal::atomic_impl_with_arithmetic<unsigned long, unsigned long, char>> = {<tbb::internal::atomic_impl<unsigned long>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}}, id_map_ = std::map with 17 elements = {["CqlIfImpl::Task"] = 2, ["Kafka Timer"] = 4, ["UDC_cfg_poll_timer"] = 1,     ["analytics::DbHandler"] = 8, ["bgp::Config"] = 16, ["collector:DbIf"] = 3, ["http client"] = 15, ["io::ReaderTask"] = 6, ["io::udp::ReaderTask"] = 10, ["sandesh::LifetimeMgr"] = 7, ["sandesh::RecvQueue"] = 11, ["sandesh::SandeshClientReader"] = 14, ["sandesh::SandeshClientSM"] = 12, ["sandesh::SandeshClientSession"] = 13,     ["sandesh::SandeshStateMachine"] = 5, ["vizd::Stats"] = 17, ["vizd::syslog"] = 9}, id_max_ = 17, log_fn_ = {<boost::function5<void, char const*, unsigned int, Task const*, char const*, unsigned int>> = {<boost::function_base> = {vtable = 0x0, functor = {obj_ptr = 0x29ff688, type = {type = 0x29ff688, const_qualified = 88, volatile_qualified = 135},           func_ptr = 0x29ff688, bound_memfunc_ptr = {memfunc_ptr = (void (boost::detail::function::X::*)(boost::detail::function::X * const, int)) 0x29ff688, this adjustment 43943768, obj_ptr = 0x2a35598}, obj_ref = {obj_ptr = 0x29ff688, is_const_qualified = 88, is_volatile_qualified = 135}, data = -120 '\210'}}, static args = <optimized out>,       static arity = <optimized out>}, <No data fields>}, hw_thread_count_ = 16, track_run_time_ = false, measure_delay_ = false, schedule_delay_ = 0, execute_delay_ = 0, enqueue_count_ = 519, done_count_ = 506, cancel_count_ = 12, static ThreadAmpFactor_ = 1} (gdb) p *(TaskImpl *)0x7f67d3bc3c40 $6 = {void (TaskImpl * const, Task *)} 0x7f67d3bc3c40 (gdb) p parent_->task_impl_ $7 = (tbb::task *) 0x7f67d3bc3c40 (gdb) p *parent_->task_impl_ $8 = <incomplete type> (gdb) It looks like the http client task Workqueue DequeueEvent function is causing this. Ted some of the vncapi.cc RemoveConnection calls will delete the connection right after the call ??? That does not seem correct at first glance. Can you please confirm? Thanks Megh On Aug 15, 2016, at 4:18 PM, Wenqing Liang <wliang@juniper.net> wrote: Hi Megh, I made the same changes to the other two controllers 10.87.141.118 and 10.87.141.126. root/n1keenA. Restart of contrail-collector exited with ERROR however. root@a5d01e04:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/ vizd.debug 100% 83MB 83.4MB/s 00:01 root@a5d01e04:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig root@a5d01e04:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector root@a5d01e04:~# service contrail-collector restart contrail-collector: stopped contrail-collector: ERROR (abnormal termination) root@a5d01e04:~# ssh -l root 10.87.141.126 Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-85-generic x86_64) * Documentation: https://help.ubuntu.com/ New release '16.04.1 LTS' available. Run 'do-release-upgrade' to upgrade to it. Last login: Mon Aug 15 16:06:40 2016 from 10.87.141.118 root@a5d01e05:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/ vizd.debug 100% 83MB 83.4MB/s 00:00 root@a5d01e05:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig root@a5d01e05:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector root@a5d01e05:~# service contrail-collector restart contrail-collector: stopped contrail-collector: ERROR (abnormal termination) root@a5d01e05:~# No, I don’t think I have any user defined counters configured. Thanks, Wenqing From: Megh Bhatt Sent: Monday, August 15, 2016 3:54 PM To: Raj Reddy <rajreddy@juniper.net>; Wenqing Liang <wliang@juniper.net> Cc: Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: Re: service issue Hi Wenqing, The production contrail-collector binary has optimized out bunch of variables in the core and hence we cannot determine which task caused the core. The reason is a call to empty boost function. To debug further, can you please do the following: 1. Change log_level to SYS_DEBUG in all the controller’s /etc/contrail/contrail-collector.conf 2. Copy over the debug vizd binary from /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd to all the controllers and run it instead of the production binary. scp /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd root@<IP-address-of-controller-nodes>:/var/tmp/vizd.debug mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector service contrail-collector restart I have already done the above on 10.87.141.120 I will also look at the current checkins to find out. Also can you please confirm if you have any user defined counters configured, that is a new contrail-collector feature that was added in 3.1. Thanks Megh On Aug 15, 2016, at 8:03 AM, Raj Reddy <rajreddy@juniper.net> wrote: +Megh, Megh, can you take a look and see why collector is crashing.. thanks, On Aug 14, 2016, at 9:57 PM, Ignatious Johnson <ijohnson@juniper.net> wrote: Hi Raj, I see contrail-collector core continuously. Can you get the team to take a look. Hi Wenqing, Device-manager/schema/svc-monitor will be active in one of the config node and backup in other nodes. This is expected. Thanks, Ignatious From: Wenqing Liang <wliang@juniper.net> Date: Sunday, August 14, 2016 at 9:03 AM To: Abhay Joshi <abhayj@juniper.net>, Ignatious Johnson <ijohnson@juniper.net> Cc: Raj Reddy <rajreddy@juniper.net>, dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: RE: service issue Any news? Can I take back the setup? Thanks, Wenqing From: Abhay Joshi Sent: Friday, August 12, 2016 7:26 PM To: Ignatious Johnson <ijohnson@juniper.net> Cc: Wenqing Liang <wliang@juniper.net>; Raj Reddy <rajreddy@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net> Subject: Re: service issue You looking into this right, Ignatious? On Aug 12, 2016, at 6:18 PM, Ignatious Johnson <ijohnson@juniper.net> wrote: + Abhay On Aug 12, 2016 5:44 PM, Wenqing Liang <wliang@juniper.net> wrote: Hi, On my r3.1.0.0-23 mitaka sm provisioned ha cluster, I see the following. Could u pls have a look? 10.87.141.120 is the cfg node and the sm is 10.87.141.124. root/n1keenA. root@a5d01e03:~# contrail-status | egrep -v "active|=" contrail-snmp-collector initializing (ApiServer:SNMP[Not connected] connection down) contrail-device-manager backup contrail-schema backup contrail-svc-monitor backup /var/crashes/core.contrail-collec.10616.a5d01e03.1471037078 /var/crashes/core.contrail-collec.7173.a5d01e03.1471038160 /var/crashes/core.contrail-collec.18920.a5d01e03.1471030477 /var/crashes/core.contrail-collec.1853.a5d01e03.1471037920 root@a5d01e03:~# Thanks, Wenqing
2016-08-17 00:59:09 Raj Reddy tags analytics
2016-08-17 16:09:29 OpenContrail Admin juniperopenstack/trunk: status New In Progress
2016-08-17 16:12:31 OpenContrail Admin juniperopenstack/r3.1: status New In Progress
2016-08-19 07:40:11 OpenContrail Admin juniperopenstack/trunk: status In Progress Fix Committed
2016-08-22 19:52:56 OpenContrail Admin juniperopenstack/r3.1: status In Progress Fix Committed