contrail-collector core continuously in ha setup
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R3.1 |
Fix Committed
|
High
|
ted ghose | |||
Trunk |
Fix Committed
|
High
|
ted ghose |
Bug Description
In a sm provisioned r3.1.0.0-23 mitaka ha setup, contrail-collector was seen core continuously. cores uploaded to /cs-shared/
From: Ted Ghose
Sent: Monday, August 15, 2016 7:26 PM
To: Megh Bhatt <email address hidden>; Wenqing Liang <email address hidden>
Cc: Raj Reddy <email address hidden>; Ignatious Johnson <email address hidden>; Abhay Joshi <email address hidden>; dl-contrail-
Subject: Re: service issue
its a scoped pointer, if the callback is recorded, it wont be deleted, right?
_______
From: Megh Bhatt
Sent: Monday, August 15, 2016 7:20 PM
To: Wenqing Liang; Ted Ghose
Cc: Raj Reddy; Ignatious Johnson; Abhay Joshi; dl-contrail-
Subject: Re: service issue
Sorry I mean VncApi client can get deleted and we can still access it from the RespHandler callback ?? and then that can end up enqueueing a callback on a deleted object/empty boost function.
Thanks
Megh
On Aug 15, 2016, at 7:13 PM, Megh Bhatt <email address hidden> wrote:
Added Ted
The collector did core on the first controller
/var/log/
Core was generated by `/usr/bin/
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_
(gdb) bt
#0 0x00007f67dabb6cc9 in raise () from /lib/x86_
#1 0x00007f67dabba0d8 in abort () from /lib/x86_
#2 0x00007f67dabafb86 in ?? () from /lib/x86_
#3 0x00007f67dabafc32 in __assert_fail () from /lib/x86_
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3b
#5 0x00007f67dc13eb3a in ?? () from /usr/lib/
#6 0x00007f67dc13a816 in ?? () from /usr/lib/
#7 0x00007f67dc139f4b in ?? () from /usr/lib/
#8 0x00007f67dc1360ff in ?? () from /usr/lib/
#9 0x00007f67dc1362f9 in ?? () from /usr/lib/
#10 0x00007f67dc35a182 in start_thread () from /lib/x86_
#11 0x00007f67dac7a47d in clone () from /lib/x86_
(gdb) f 4
#4 0x000000000045c3d7 in TaskImpl::execute (this=0x7f67d3b
291 controller/
(gdb) info locals
what = "call to empty boost::function"
e = @0x7f67a0007cb0: <incomplete type>
running = @0x7f67d3b03c80: 0x2a857b0
__PRETTY_FUNCTION__ = "virtual tbb::task* TaskImpl:
(gdb) p *this
$1 = {<tbb::task> = {<No data fields>}, parent_ = 0x2a857b0}
(gdb) set print object on
(gdb) p parent_
$2 = (QueueTaskRunne
(gdb) p *parent_
$3 = (QueueTaskRunne
task_recycle_ = false, task_cancel_ = false, enqueue_time_ = 0, schedule_time_ = 0, execute_delay_ = 0, schedule_delay_ = 0,
waitq_hook_ = {<boost:
queue_ = 0x7f677c00a9d0}
(gdb) p TaskScheduler:
There is no member or method named px_.
(gdb) p TaskScheduler:
$4 = (TaskScheduler *) 0x2a044a0
(gdb) p *TaskScheduler:
$5 = {static kVectorGrowSize = 16, static singleton_ = {px = 0x2a044a0}, stop_entry_ = 0x2a46530, task_scheduler_ = {<tbb::
static is_recursive_mutex = false, static is_fair_mutex = false, impl = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0}}, running_ = true, seqno_ = 519, task_group_db_ = std::vector of length 33, capacity 33 = {
0x0, 0x2ad3450, 0x0, 0x0, 0x2ad17f0, 0x2a58470, 0x2a58660, 0x2a49f40, 0x2a592e0, 0x0, 0x0, 0x0, 0x2a97a10, 0x2a939f0, 0x2a97230, 0x2ad53d0, 0x0, 0x2ad4a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, id_map_mutex_ = {<tbb::
reader_head = {<tbb::
writer_head = {<tbb::
writer_tail = {<tbb::
rdr_
["analytics
["sandesh:
func_ptr = 0x29ff688, bound_memfunc_ptr = {memfunc_ptr = (void (boost:
static arity = <optimized out>}, <No data fields>}, hw_thread_count_ = 16, track_run_time_ = false, measure_delay_ = false, schedule_delay_ = 0, execute_delay_ = 0, enqueue_count_ = 519, done_count_ = 506, cancel_count_ = 12, static ThreadAmpFactor_ = 1}
(gdb) p *(TaskImpl *)0x7f67d3bc3c40
$6 = {void (TaskImpl * const, Task *)} 0x7f67d3bc3c40
(gdb) p parent_->task_impl_
$7 = (tbb::task *) 0x7f67d3bc3c40
(gdb) p *parent_
$8 = <incomplete type>
(gdb)
It looks like the http client task Workqueue DequeueEvent function is causing this. Ted some of the vncapi.cc RemoveConnection calls will delete the connection right after the call ??? That does not seem correct at first glance.
Can you please confirm?
Thanks
Megh
On Aug 15, 2016, at 4:18 PM, Wenqing Liang <email address hidden> wrote:
Hi Megh,
I made the same changes to the other two controllers 10.87.141.118 and 10.87.141.126. root/n1keenA. Restart of contrail-collector exited with ERROR however.
root@a5d01e04:~# scp root@10.
vizd.debug 100% 83MB 83.4MB/s 00:01
root@a5d01e04:~# mv /usr/bin/
root@a5d01e04:~# ln -s /var/tmp/vizd.debug /usr/bin/
root@a5d01e04:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e04:~# ssh -l root 10.87.141.126
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-85-generic x86_64)
* Documentation: https:/
New release '16.04.1 LTS' available.
Run 'do-release-
Last login: Mon Aug 15 16:06:40 2016 from 10.87.141.118
root@a5d01e05:~# scp root@10.
vizd.debug 100% 83MB 83.4MB/s 00:00
root@a5d01e05:~# mv /usr/bin/
root@a5d01e05:~# ln -s /var/tmp/vizd.debug /usr/bin/
root@a5d01e05:~# service contrail-collector restart
contrail-collector: stopped
contrail-collector: ERROR (abnormal termination)
root@a5d01e05:~#
No, I don’t think I have any user defined counters configured.
Thanks,
Wenqing
From: Megh Bhatt
Sent: Monday, August 15, 2016 3:54 PM
To: Raj Reddy <email address hidden>; Wenqing Liang <email address hidden>
Cc: Ignatious Johnson <email address hidden>; Abhay Joshi <email address hidden>; dl-contrail-
Subject: Re: service issue
Hi Wenqing,
The production contrail-collector binary has optimized out bunch of variables in the core and hence we cannot determine which task caused the core. The reason is a call to empty boost function. To debug further, can you please do the following:
1. Change log_level to SYS_DEBUG in all the controller’s /etc/contrail/
2. Copy over the debug vizd binary from /github-
scp /github-
mv /usr/bin/
ln -s /var/tmp/vizd.debug /usr/bin/
service contrail-collector restart
I have already done the above on 10.87.141.120
I will also look at the current checkins to find out. Also can you please confirm if you have any user defined counters configured, that is a new contrail-collector feature that was added in 3.1.
Thanks
Megh
On Aug 15, 2016, at 8:03 AM, Raj Reddy <email address hidden> wrote:
+Megh,
Megh, can you take a look and see why collector is crashing..
thanks,
On Aug 14, 2016, at 9:57 PM, Ignatious Johnson <email address hidden> wrote:
Hi Raj,
I see contrail-collector core continuously. Can you get the team to take a look.
Hi Wenqing,
Device-
Thanks,
Ignatious
From: Wenqing Liang <email address hidden>
Date: Sunday, August 14, 2016 at 9:03 AM
To: Abhay Joshi <email address hidden>, Ignatious Johnson <email address hidden>
Cc: Raj Reddy <email address hidden>, dl-contrail-
Subject: RE: service issue
Any news? Can I take back the setup?
Thanks,
Wenqing
From: Abhay Joshi
Sent: Friday, August 12, 2016 7:26 PM
To: Ignatious Johnson <email address hidden>
Cc: Wenqing Liang <email address hidden>; Raj Reddy <email address hidden>; dl-contrail-
Subject: Re: service issue
You looking into this right, Ignatious?
On Aug 12, 2016, at 6:18 PM, Ignatious Johnson <email address hidden> wrote:
+ Abhay
On Aug 12, 2016 5:44 PM, Wenqing Liang <email address hidden> wrote:
Hi,
On my r3.1.0.0-23 mitaka sm provisioned ha cluster, I see the following. Could u pls have a look? 10.87.141.120 is the cfg node and the sm is 10.87.141.124. root/n1keenA.
root@a5d01e03:~# contrail-status | egrep -v "active|="
contrail-
contrail-
contrail-schema backup
contrail-
/var/crashes/
/var/crashes/
/var/crashes/
/var/crashes/
root@a5d01e03:~#
Thanks,
Wenqing
information type: | Proprietary → Public |
description: | updated |
tags: | added: analytics |
Review in progress for https:/ /review. opencontrail. org/23379
Submitter: ted ghose (<email address hidden>)