2016-08-16 23:11:10 |
wenqing liang |
description |
In a sm provisioned r3.1.0.0-23 mitaka ha setup, contrail-collector was seen core continuously.
From: Ted Ghose
Sent: Monday, August 15, 2016 7:26 PM
To: Megh Bhatt <meghb@juniper.net>; Wenqing Liang <wliang@juniper.net>
Cc: Raj Reddy <rajreddy@juniper.net>; Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: Re: service issue
its a scoped pointer, if the callback is recorded, it wont be deleted, right?
________________________________________
From: Megh Bhatt
Sent: Monday, August 15, 2016 7:20 PM
To: Wenqing Liang; Ted Ghose
Cc: Raj Reddy; Ignatious Johnson; Abhay Joshi; dl-contrail-server-manager
Subject: Re: service issue
Sorry I mean VncApi client can get deleted and we can still access it from the RespHandler callback ?? and then that can end up enqueueing a callback on a deleted object/empty boost function.
Thanks
Megh
On Aug 15, 2016, at 7:13 PM, Megh Bhatt <meghb@juniper.net> wrote:
Added Ted
The collector did core on the first controller
/var/log/contrail/contrail-collector.log.4:2016-08-15 Mon 15:53:26:298.401 PDT a5d01e03 [Thread 140083858196224, Pid 21908]: !!!! ERROR !!!! Task caught fatal exception: call to empty boost::function TaskImpl: 0x7f67d3bc3c40
Core was generated by `/usr/bin/contrail-collector'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f67dabba0d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f67dabafb86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f67dabafc32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291
#5 0x00007f67dc13eb3a in ?? () from /usr/lib/libtbb.so.2
#6 0x00007f67dc13a816 in ?? () from /usr/lib/libtbb.so.2
#7 0x00007f67dc139f4b in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f67dc1360ff in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f67dc1362f9 in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f67dc35a182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007f67dac7a47d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) f 4
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291
291 controller/src/base/task.cc: No such file or directory.
(gdb) info locals
what = "call to empty boost::function"
e = @0x7f67a0007cb0: <incomplete type>
running = @0x7f67d3b03c80: 0x2a857b0
__PRETTY_FUNCTION__ = "virtual tbb::task* TaskImpl::execute()"
(gdb) p *this
$1 = {<tbb::task> = {<No data fields>}, parent_ = 0x2a857b0}
(gdb) set print object on
(gdb) p parent_
$2 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > > *) 0x2a857b0
(gdb) p *parent_
$3 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > >) {<Task> = {_vptr.Task = 0x9b6bf0 <vtable for QueueTaskRunner<boost::function<void ()>, WorkQueue<boost::function<void ()> > >+16>, static kTaskInstanceAny = -1, task_id_ = 15, task_instance_ = 0, task_impl_ = 0x7f67d3bc3c40, state_ = Task::RUN, seqno_ = 518,
task_recycle_ = false, task_cancel_ = false, enqueue_time_ = 0, schedule_time_ = 0, execute_delay_ = 0, schedule_delay_ = 0,
waitq_hook_ = {<boost::intrusive::detail::generic_hook<boost::intrusive::get_list_node_algo<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>> = {<boost::intrusive::detail::no_default_definer> = {<No data fields>}, <boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <No data fields>}, <No data fields>}},
queue_ = 0x7f677c00a9d0}
(gdb) p TaskScheduler::singleton_.px_
There is no member or method named px_.
(gdb) p TaskScheduler::singleton_.px
$4 = (TaskScheduler *) 0x2a044a0
(gdb) p *TaskScheduler::singleton_.px
$5 = {static kVectorGrowSize = 16, static singleton_ = {px = 0x2a044a0}, stop_entry_ = 0x2a46530, task_scheduler_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, my_scheduler = 0x7f67d3bcf800, static automatic = -1, static deferred = -2}, mutex_ = {static is_rw_mutex = false,
static is_recursive_mutex = false, static is_fair_mutex = false, impl = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, running_ = true, seqno_ = 519, task_group_db_ = std::vector of length 33, capacity 33 = {
0x0, 0x2ad3450, 0x0, 0x0, 0x2ad17f0, 0x2a58470, 0x2a58660, 0x2a49f40, 0x2a592e0, 0x0, 0x0, 0x0, 0x2a97a10, 0x2a939f0, 0x2a97230, 0x2ad53d0, 0x0, 0x2ad4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, id_map_mutex_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>},
reader_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock_read*, long, tbb::interface5::reader_writer_lock::scoped_lock_read>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock_read*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},
writer_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},
writer_tail = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>}, my_current_writer = {my_id = 0},
rdr_count_and_flags = {<tbb::internal::atomic_impl_with_arithmetic<unsigned long, unsigned long, char>> = {<tbb::internal::atomic_impl<unsigned long>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}}, id_map_ = std::map with 17 elements = {["CqlIfImpl::Task"] = 2, ["Kafka Timer"] = 4, ["UDC_cfg_poll_timer"] = 1,
["analytics::DbHandler"] = 8, ["bgp::Config"] = 16, ["collector:DbIf"] = 3, ["http client"] = 15, ["io::ReaderTask"] = 6, ["io::udp::ReaderTask"] = 10, ["sandesh::LifetimeMgr"] = 7, ["sandesh::RecvQueue"] = 11, ["sandesh::SandeshClientReader"] = 14, ["sandesh::SandeshClientSM"] = 12, ["sandesh::SandeshClientSession"] = 13,
["sandesh::SandeshStateMachine"] = 5, ["vizd::Stats"] = 17, ["vizd::syslog"] = 9}, id_max_ = 17, log_fn_ = {<boost::function5<void, char const*, unsigned int, Task const*, char const*, unsigned int>> = {<boost::function_base> = {vtable = 0x0, functor = {obj_ptr = 0x29ff688, type = {type = 0x29ff688, const_qualified = 88, volatile_qualified = 135},
func_ptr = 0x29ff688, bound_memfunc_ptr = {memfunc_ptr = (void (boost::detail::function::X::*)(boost::detail::function::X * const, int)) 0x29ff688, this adjustment 43943768, obj_ptr = 0x2a35598}, obj_ref = {obj_ptr = 0x29ff688, is_const_qualified = 88, is_volatile_qualified = 135}, data = -120 '\210'}}, static args = <optimized out>,
static arity = <optimized out>}, <No data fields>}, hw_thread_count_ = 16, track_run_time_ = false, measure_delay_ = false, schedule_delay_ = 0, execute_delay_ = 0, enqueue_count_ = 519, done_count_ = 506, cancel_count_ = 12, static ThreadAmpFactor_ = 1}
(gdb) p *(TaskImpl *)0x7f67d3bc3c40
$6 = {void (TaskImpl * const, Task *)} 0x7f67d3bc3c40
(gdb) p parent_->task_impl_
$7 = (tbb::task *) 0x7f67d3bc3c40
(gdb) p *parent_->task_impl_
$8 = <incomplete type>
(gdb)
It looks like the http client task Workqueue DequeueEvent function is causing this. Ted some of the vncapi.cc RemoveConnection calls will delete the connection right after the call ??? That does not seem correct at first glance.
Can you please confirm?
Thanks
Megh
On Aug 15, 2016, at 4:18 PM, Wenqing Liang <wliang@juniper.net> wrote:
Hi Megh,
I made the same changes to the other two controllers 10.87.141.118 and 10.87.141.126. root/n1keenA. Restart of contrail-collector exited with ERROR however.
root@a5d01e04:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/
vizd.debug 100% 83MB 83.4MB/s 00:01
root@a5d01e04:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
root@a5d01e04:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
root@a5d01e04:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e04:~# ssh -l root 10.87.141.126
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-85-generic x86_64)
* Documentation: https://help.ubuntu.com/
New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Mon Aug 15 16:06:40 2016 from 10.87.141.118
root@a5d01e05:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/
vizd.debug 100% 83MB 83.4MB/s 00:00
root@a5d01e05:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
root@a5d01e05:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
root@a5d01e05:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e05:~#
No, I don’t think I have any user defined counters configured.
Thanks,
Wenqing
From: Megh Bhatt
Sent: Monday, August 15, 2016 3:54 PM
To: Raj Reddy <rajreddy@juniper.net>; Wenqing Liang <wliang@juniper.net>
Cc: Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: Re: service issue
Hi Wenqing,
The production contrail-collector binary has optimized out bunch of variables in the core and hence we cannot determine which task caused the core. The reason is a call to empty boost function. To debug further, can you please do the following:
1. Change log_level to SYS_DEBUG in all the controller’s /etc/contrail/contrail-collector.conf
2. Copy over the debug vizd binary from /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd to all the controllers and run it instead of the production binary.
scp /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd root@<IP-address-of-controller-nodes>:/var/tmp/vizd.debug
mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
service contrail-collector restart
I have already done the above on 10.87.141.120
I will also look at the current checkins to find out. Also can you please confirm if you have any user defined counters configured, that is a new contrail-collector feature that was added in 3.1.
Thanks
Megh
On Aug 15, 2016, at 8:03 AM, Raj Reddy <rajreddy@juniper.net> wrote:
+Megh,
Megh, can you take a look and see why collector is crashing..
thanks,
On Aug 14, 2016, at 9:57 PM, Ignatious Johnson <ijohnson@juniper.net> wrote:
Hi Raj,
I see contrail-collector core continuously. Can you get the team to take a look.
Hi Wenqing,
Device-manager/schema/svc-monitor will be active in one of the config node and backup in other nodes. This is expected.
Thanks,
Ignatious
From: Wenqing Liang <wliang@juniper.net>
Date: Sunday, August 14, 2016 at 9:03 AM
To: Abhay Joshi <abhayj@juniper.net>, Ignatious Johnson <ijohnson@juniper.net>
Cc: Raj Reddy <rajreddy@juniper.net>, dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: RE: service issue
Any news? Can I take back the setup?
Thanks,
Wenqing
From: Abhay Joshi
Sent: Friday, August 12, 2016 7:26 PM
To: Ignatious Johnson <ijohnson@juniper.net>
Cc: Wenqing Liang <wliang@juniper.net>; Raj Reddy <rajreddy@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: Re: service issue
You looking into this right, Ignatious?
On Aug 12, 2016, at 6:18 PM, Ignatious Johnson <ijohnson@juniper.net> wrote:
+ Abhay
On Aug 12, 2016 5:44 PM, Wenqing Liang <wliang@juniper.net> wrote:
Hi,
On my r3.1.0.0-23 mitaka sm provisioned ha cluster, I see the following. Could u pls have a look? 10.87.141.120 is the cfg node and the sm is 10.87.141.124. root/n1keenA.
root@a5d01e03:~# contrail-status | egrep -v "active|="
contrail-snmp-collector initializing (ApiServer:SNMP[Not connected] connection down)
contrail-device-manager backup
contrail-schema backup
contrail-svc-monitor backup
/var/crashes/core.contrail-collec.10616.a5d01e03.1471037078
/var/crashes/core.contrail-collec.7173.a5d01e03.1471038160
/var/crashes/core.contrail-collec.18920.a5d01e03.1471030477
/var/crashes/core.contrail-collec.1853.a5d01e03.1471037920
root@a5d01e03:~#
Thanks,
Wenqing |
In a sm provisioned r3.1.0.0-23 mitaka ha setup, contrail-collector was seen core continuously. cores uploaded to /cs-shared/bugs/1613896.
From: Ted Ghose
Sent: Monday, August 15, 2016 7:26 PM
To: Megh Bhatt <meghb@juniper.net>; Wenqing Liang <wliang@juniper.net>
Cc: Raj Reddy <rajreddy@juniper.net>; Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: Re: service issue
its a scoped pointer, if the callback is recorded, it wont be deleted, right?
________________________________________
From: Megh Bhatt
Sent: Monday, August 15, 2016 7:20 PM
To: Wenqing Liang; Ted Ghose
Cc: Raj Reddy; Ignatious Johnson; Abhay Joshi; dl-contrail-server-manager
Subject: Re: service issue
Sorry I mean VncApi client can get deleted and we can still access it from the RespHandler callback ?? and then that can end up enqueueing a callback on a deleted object/empty boost function.
Thanks
Megh
On Aug 15, 2016, at 7:13 PM, Megh Bhatt <meghb@juniper.net> wrote:
Added Ted
The collector did core on the first controller
/var/log/contrail/contrail-collector.log.4:2016-08-15 Mon 15:53:26:298.401 PDT a5d01e03 [Thread 140083858196224, Pid 21908]: !!!! ERROR !!!! Task caught fatal exception: call to empty boost::function TaskImpl: 0x7f67d3bc3c40
Core was generated by `/usr/bin/contrail-collector'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f67dabba0d8 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f67dabafb86 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f67dabafc32 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291
#5 0x00007f67dc13eb3a in ?? () from /usr/lib/libtbb.so.2
#6 0x00007f67dc13a816 in ?? () from /usr/lib/libtbb.so.2
#7 0x00007f67dc139f4b in ?? () from /usr/lib/libtbb.so.2
#8 0x00007f67dc1360ff in ?? () from /usr/lib/libtbb.so.2
#9 0x00007f67dc1362f9 in ?? () from /usr/lib/libtbb.so.2
#10 0x00007f67dc35a182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007f67dac7a47d in clone () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) f 4
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3bc3c40) at controller/src/base/task.cc:291
291 controller/src/base/task.cc: No such file or directory.
(gdb) info locals
what = "call to empty boost::function"
e = @0x7f67a0007cb0: <incomplete type>
running = @0x7f67d3b03c80: 0x2a857b0
__PRETTY_FUNCTION__ = "virtual tbb::task* TaskImpl::execute()"
(gdb) p *this
$1 = {<tbb::task> = {<No data fields>}, parent_ = 0x2a857b0}
(gdb) set print object on
(gdb) p parent_
$2 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > > *) 0x2a857b0
(gdb) p *parent_
$3 = (QueueTaskRunner<boost::function<void()>, WorkQueue<boost::function<void()> > >) {<Task> = {_vptr.Task = 0x9b6bf0 <vtable for QueueTaskRunner<boost::function<void ()>, WorkQueue<boost::function<void ()> > >+16>, static kTaskInstanceAny = -1, task_id_ = 15, task_instance_ = 0, task_impl_ = 0x7f67d3bc3c40, state_ = Task::RUN, seqno_ = 518,
task_recycle_ = false, task_cancel_ = false, enqueue_time_ = 0, schedule_time_ = 0, execute_delay_ = 0, schedule_delay_ = 0,
waitq_hook_ = {<boost::intrusive::detail::generic_hook<boost::intrusive::get_list_node_algo<void*>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>> = {<boost::intrusive::detail::no_default_definer> = {<No data fields>}, <boost::intrusive::list_node<void*>> = {next_ = 0x0, prev_ = 0x0}, <No data fields>}, <No data fields>}},
queue_ = 0x7f677c00a9d0}
(gdb) p TaskScheduler::singleton_.px_
There is no member or method named px_.
(gdb) p TaskScheduler::singleton_.px
$4 = (TaskScheduler *) 0x2a044a0
(gdb) p *TaskScheduler::singleton_.px
$5 = {static kVectorGrowSize = 16, static singleton_ = {px = 0x2a044a0}, stop_entry_ = 0x2a46530, task_scheduler_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>}, my_scheduler = 0x7f67d3bcf800, static automatic = -1, static deferred = -2}, mutex_ = {static is_rw_mutex = false,
static is_recursive_mutex = false, static is_fair_mutex = false, impl = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, running_ = true, seqno_ = 519, task_group_db_ = std::vector of length 33, capacity 33 = {
0x0, 0x2ad3450, 0x0, 0x0, 0x2ad17f0, 0x2a58470, 0x2a58660, 0x2a49f40, 0x2a592e0, 0x0, 0x0, 0x0, 0x2a97a10, 0x2a939f0, 0x2a97230, 0x2ad53d0, 0x0, 0x2ad4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, id_map_mutex_ = {<tbb::internal::no_copy> = {<tbb::internal::no_assign> = {<No data fields>}, <No data fields>},
reader_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock_read*, long, tbb::interface5::reader_writer_lock::scoped_lock_read>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock_read*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},
writer_head = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>},
writer_tail = {<tbb::internal::atomic_impl_with_arithmetic<tbb::interface5::reader_writer_lock::scoped_lock*, long, tbb::interface5::reader_writer_lock::scoped_lock>> = {<tbb::internal::atomic_impl<tbb::interface5::reader_writer_lock::scoped_lock*>> = {my_storage = {my_value = 0x0}}, <No data fields>}, <No data fields>}, my_current_writer = {my_id = 0},
rdr_count_and_flags = {<tbb::internal::atomic_impl_with_arithmetic<unsigned long, unsigned long, char>> = {<tbb::internal::atomic_impl<unsigned long>> = {my_storage = {my_value = 0}}, <No data fields>}, <No data fields>}}, id_map_ = std::map with 17 elements = {["CqlIfImpl::Task"] = 2, ["Kafka Timer"] = 4, ["UDC_cfg_poll_timer"] = 1,
["analytics::DbHandler"] = 8, ["bgp::Config"] = 16, ["collector:DbIf"] = 3, ["http client"] = 15, ["io::ReaderTask"] = 6, ["io::udp::ReaderTask"] = 10, ["sandesh::LifetimeMgr"] = 7, ["sandesh::RecvQueue"] = 11, ["sandesh::SandeshClientReader"] = 14, ["sandesh::SandeshClientSM"] = 12, ["sandesh::SandeshClientSession"] = 13,
["sandesh::SandeshStateMachine"] = 5, ["vizd::Stats"] = 17, ["vizd::syslog"] = 9}, id_max_ = 17, log_fn_ = {<boost::function5<void, char const*, unsigned int, Task const*, char const*, unsigned int>> = {<boost::function_base> = {vtable = 0x0, functor = {obj_ptr = 0x29ff688, type = {type = 0x29ff688, const_qualified = 88, volatile_qualified = 135},
func_ptr = 0x29ff688, bound_memfunc_ptr = {memfunc_ptr = (void (boost::detail::function::X::*)(boost::detail::function::X * const, int)) 0x29ff688, this adjustment 43943768, obj_ptr = 0x2a35598}, obj_ref = {obj_ptr = 0x29ff688, is_const_qualified = 88, is_volatile_qualified = 135}, data = -120 '\210'}}, static args = <optimized out>,
static arity = <optimized out>}, <No data fields>}, hw_thread_count_ = 16, track_run_time_ = false, measure_delay_ = false, schedule_delay_ = 0, execute_delay_ = 0, enqueue_count_ = 519, done_count_ = 506, cancel_count_ = 12, static ThreadAmpFactor_ = 1}
(gdb) p *(TaskImpl *)0x7f67d3bc3c40
$6 = {void (TaskImpl * const, Task *)} 0x7f67d3bc3c40
(gdb) p parent_->task_impl_
$7 = (tbb::task *) 0x7f67d3bc3c40
(gdb) p *parent_->task_impl_
$8 = <incomplete type>
(gdb)
It looks like the http client task Workqueue DequeueEvent function is causing this. Ted some of the vncapi.cc RemoveConnection calls will delete the connection right after the call ??? That does not seem correct at first glance.
Can you please confirm?
Thanks
Megh
On Aug 15, 2016, at 4:18 PM, Wenqing Liang <wliang@juniper.net> wrote:
Hi Megh,
I made the same changes to the other two controllers 10.87.141.118 and 10.87.141.126. root/n1keenA. Restart of contrail-collector exited with ERROR however.
root@a5d01e04:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/
vizd.debug 100% 83MB 83.4MB/s 00:01
root@a5d01e04:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
root@a5d01e04:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
root@a5d01e04:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e04:~# ssh -l root 10.87.141.126
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-85-generic x86_64)
* Documentation: https://help.ubuntu.com/
New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Mon Aug 15 16:06:40 2016 from 10.87.141.118
root@a5d01e05:~# scp root@10.87.141.120:/var/tmp/vizd.debug /var/tmp/
vizd.debug 100% 83MB 83.4MB/s 00:00
root@a5d01e05:~# mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
root@a5d01e05:~# ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
root@a5d01e05:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e05:~#
No, I don’t think I have any user defined counters configured.
Thanks,
Wenqing
From: Megh Bhatt
Sent: Monday, August 15, 2016 3:54 PM
To: Raj Reddy <rajreddy@juniper.net>; Wenqing Liang <wliang@juniper.net>
Cc: Ignatious Johnson <ijohnson@juniper.net>; Abhay Joshi <abhayj@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: Re: service issue
Hi Wenqing,
The production contrail-collector binary has optimized out bunch of variables in the core and hence we cannot determine which task caused the core. The reason is a call to empty boost function. To debug further, can you please do the following:
1. Change log_level to SYS_DEBUG in all the controller’s /etc/contrail/contrail-collector.conf
2. Copy over the debug vizd binary from /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd to all the controllers and run it instead of the production binary.
scp /github-build/R3.1/23/ubuntu-14-04/mitaka/store/sandbox/build/debug//analytics/vizd root@<IP-address-of-controller-nodes>:/var/tmp/vizd.debug
mv /usr/bin/contrail-collector /usr/bin/contrail-collector.orig
ln -s /var/tmp/vizd.debug /usr/bin/contrail-collector
service contrail-collector restart
I have already done the above on 10.87.141.120
I will also look at the current checkins to find out. Also can you please confirm if you have any user defined counters configured, that is a new contrail-collector feature that was added in 3.1.
Thanks
Megh
On Aug 15, 2016, at 8:03 AM, Raj Reddy <rajreddy@juniper.net> wrote:
+Megh,
Megh, can you take a look and see why collector is crashing..
thanks,
On Aug 14, 2016, at 9:57 PM, Ignatious Johnson <ijohnson@juniper.net> wrote:
Hi Raj,
I see contrail-collector core continuously. Can you get the team to take a look.
Hi Wenqing,
Device-manager/schema/svc-monitor will be active in one of the config node and backup in other nodes. This is expected.
Thanks,
Ignatious
From: Wenqing Liang <wliang@juniper.net>
Date: Sunday, August 14, 2016 at 9:03 AM
To: Abhay Joshi <abhayj@juniper.net>, Ignatious Johnson <ijohnson@juniper.net>
Cc: Raj Reddy <rajreddy@juniper.net>, dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: RE: service issue
Any news? Can I take back the setup?
Thanks,
Wenqing
From: Abhay Joshi
Sent: Friday, August 12, 2016 7:26 PM
To: Ignatious Johnson <ijohnson@juniper.net>
Cc: Wenqing Liang <wliang@juniper.net>; Raj Reddy <rajreddy@juniper.net>; dl-contrail-server-manager <dl-contrail-server-manager@juniper.net>
Subject: Re: service issue
You looking into this right, Ignatious?
On Aug 12, 2016, at 6:18 PM, Ignatious Johnson <ijohnson@juniper.net> wrote:
+ Abhay
On Aug 12, 2016 5:44 PM, Wenqing Liang <wliang@juniper.net> wrote:
Hi,
On my r3.1.0.0-23 mitaka sm provisioned ha cluster, I see the following. Could u pls have a look? 10.87.141.120 is the cfg node and the sm is 10.87.141.124. root/n1keenA.
root@a5d01e03:~# contrail-status | egrep -v "active|="
contrail-snmp-collector initializing (ApiServer:SNMP[Not connected] connection down)
contrail-device-manager backup
contrail-schema backup
contrail-svc-monitor backup
/var/crashes/core.contrail-collec.10616.a5d01e03.1471037078
/var/crashes/core.contrail-collec.7173.a5d01e03.1471038160
/var/crashes/core.contrail-collec.18920.a5d01e03.1471030477
/var/crashes/core.contrail-collec.1853.a5d01e03.1471037920
root@a5d01e03:~#
Thanks,
Wenqing |
|