I have come across memory leak in contrail-collector referring to kaka_processor.cc (https://github.com/Juniper/contrail-analytics/blob/master/contrail-collector/kafka_processor.cc#L186).
This is observed only after the restart of kafka in anayticsdb. I have attached report generated by valgrind memcheck tool stating the below two leaks. And at the end of this, are the steps which I used to reproduce the issue in two setups running CAN version 4.1.1.0-10.
Can you please look into this?
From valgrind_memcheck_contrailcollector_orig_without_structured_syslog_kafka.log, (attachment)
First leak ,
==10081==
==10081== 142 (72 direct, 70 indirect) bytes in 1 blocks are definitely lost in loss record 8,005 of 9,287
==10081== at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x549589: RdKafka::HandleImpl::metadata(bool, RdKafka::Topic const*, RdKafka::Metadata**, int) (in /usr/bin/contrail-collector)
==10081== by 0x7EFD34: KafkaProcessor::KafkaTimer() (kafka_processor.cc:406)
==10081== by 0x8003C0: boost::_mfi::mf0<bool, KafkaProcessor>::operator()(KafkaProcessor*) const (mem_fn_template.hpp:49)
==10081== by 0x7FFD7E: bool boost::_bi::list1<boost::_bi::value<KafkaProcessor*> >::operator()<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list0>(boost::_bi::type<bool>, boost::_mfi::mf0<bool, KafkaProcessor>&, boost::_bi::list0&, long) (bind.hpp:243)
==10081== by 0x7FF46A: boost::_bi::bind_t<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list1<boost::_bi::value<KafkaProcessor*> > >::operator()() (bind_template.hpp:20)
==10081== by 0x7FE59A: boost::detail::function::function_obj_invoker0<boost::_bi::bind_t<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list1<boost::_bi::value<KafkaProcessor*> > >, bool>::invoke(boost::detail::function::function_buffer&) (function_template.hpp:132)
==10081== by 0x450387: boost::function0<bool>::operator()() const (function_template.hpp:767)
==10081== by 0x491C84: Timer::TimerTask::Run() (timer.cc:44)
==10081== by 0x46841B: TaskImpl::execute() (task.cc:277)
==10081== by 0x6F95B39: ??? (in /usr/lib/libtbb.so.2)
==10081== by 0x6F91815: ??? (in /usr/lib/libtbb.so.2)
Second leak,
==10081==
==10081== 5,184 bytes in 39 blocks are definitely lost in loss record 9,076 of 9,287
==10081== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x53FF61: rd_list_grow (in /usr/bin/contrail-collector)
==10081== by 0x53FFE6: rd_list_init (in /usr/bin/contrail-collector)
==10081== by 0x545189: rd_kafka_metadata_leader_query_tmr_cb (in /usr/bin/contrail-collector)
==10081== by 0x503467: rd_kafka_timers_run (in /usr/bin/contrail-collector)
==10081== by 0x4E45A6: rd_kafka_thread_main (in /usr/bin/contrail-collector)
==10081== by 0x5405E6: _thrd_wrapper_function (in /usr/bin/contrail-collector)
==10081== by 0x6D5F183: start_thread (pthread_create.c:312)
==10081== by 0x848003C: clone (clone.S:111)
==10081==
I could remove the first leak by deleting metadata pointer object at the end. I couldn’t understand much for the second leak. (though it involves rdkafka timer too)
1. root@csp-ucpe-bglr153(analytics):/usr/bin# service contrail-collector stop
2. root@csp-ucpe-bglr153(analytics):/usr/bin# valgrind --leak-check=yes --leak-check=full --track-origins=yes --show-leak-kinds=all ./contrail-collector &> valgrind_memcheck_contrailcollector_orig_without_structured_syslog_kafka.log
3. check for contrail-status root@csp-ucpe-bglr152(analytics):/# contrail-status
== Contrail Analytics == contrail-alarm-gen active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector inactive // this remains inactive as I have started contrail-collector using valgrind and not as a service.
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active
4. root@csp-ucpe-bglr153(analyticsdb):/# service kafka stop
kafka: stopped
root@csp-ucpe-bglr153(analyticsdb):/# service kafka start
kafka: started
wait for some time and stop the valgrind process, the report will be generated
I have come across memory leak in contrail-collector referring to kaka_processor.cc (https:/ /github. com/Juniper/ contrail- analytics/ blob/master/ contrail- collector/ kafka_processor .cc#L186).
This is observed only after the restart of kafka in anayticsdb. I have attached report generated by valgrind memcheck tool stating the below two leaks. And at the end of this, are the steps which I used to reproduce the issue in two setups running CAN version 4.1.1.0-10.
Can you please look into this?
From valgrind_ memcheck_ contrailcollect or_orig_ without_ structured_ syslog_ kafka.log, (attachment)
First leak ,
==10081== valgrind/ vgpreload_ memcheck- amd64-linux. so) :HandleImpl: :metadata( bool, RdKafka::Topic const*, RdKafka: :Metadata* *, int) (in /usr/bin/ contrail- collector) :KafkaTimer( ) (kafka_ processor. cc:406) _mfi::mf0< bool, KafkaProcessor> ::operator( )(KafkaProcesso r*) const (mem_fn_ template. hpp:49) _bi::list1< boost:: _bi::value< KafkaProcessor* > >::operator()<bool, boost:: _mfi::mf0< bool, KafkaProcessor>, boost:: _bi::list0> (boost: :_bi::type< bool>, boost:: _mfi::mf0< bool, KafkaProcessor>&, boost::_bi::list0&, long) (bind.hpp:243) _bi::bind_ t<bool, boost:: _mfi::mf0< bool, KafkaProcessor>, boost:: _bi::list1< boost:: _bi::value< KafkaProcessor* > > >::operator()() (bind_template. hpp:20) detail: :function: :function_ obj_invoker0< boost:: _bi::bind_ t<bool, boost:: _mfi::mf0< bool, KafkaProcessor>, boost:: _bi::list1< boost:: _bi::value< KafkaProcessor* > > >, bool>:: invoke( boost:: detail: :function: :function_ buffer& ) (function_ template. hpp:132) function0< bool>:: operator( )() const (function_ template. hpp:767) TimerTask: :Run() (timer.cc:44) libtbb. so.2) libtbb. so.2)
==10081== 142 (72 direct, 70 indirect) bytes in 1 blocks are definitely lost in loss record 8,005 of 9,287
==10081== at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/
==10081== by 0x549589: RdKafka:
==10081== by 0x7EFD34: KafkaProcessor:
==10081== by 0x8003C0: boost::
==10081== by 0x7FFD7E: bool boost::
==10081== by 0x7FF46A: boost::
==10081== by 0x7FE59A: boost::
==10081== by 0x450387: boost::
==10081== by 0x491C84: Timer::
==10081== by 0x46841B: TaskImpl::execute() (task.cc:277)
==10081== by 0x6F95B39: ??? (in /usr/lib/
==10081== by 0x6F91815: ??? (in /usr/lib/
Second leak, valgrind/ vgpreload_ memcheck- amd64-linux. so) valgrind/ vgpreload_ memcheck- amd64-linux. so) contrail- collector) contrail- collector) metadata_ leader_ query_tmr_ cb (in /usr/bin/ contrail- collector) contrail- collector) thread_ main (in /usr/bin/ contrail- collector) function (in /usr/bin/ contrail- collector) create. c:312)
==10081==
==10081== 5,184 bytes in 39 blocks are definitely lost in loss record 9,076 of 9,287
==10081== at 0x4C2AB80: malloc (in /usr/lib/
==10081== by 0x4C2CF1F: realloc (in /usr/lib/
==10081== by 0x53FF61: rd_list_grow (in /usr/bin/
==10081== by 0x53FFE6: rd_list_init (in /usr/bin/
==10081== by 0x545189: rd_kafka_
==10081== by 0x503467: rd_kafka_timers_run (in /usr/bin/
==10081== by 0x4E45A6: rd_kafka_
==10081== by 0x5405E6: _thrd_wrapper_
==10081== by 0x6D5F183: start_thread (pthread_
==10081== by 0x848003C: clone (clone.S:111)
==10081==
I could remove the first leak by deleting metadata pointer object at the end. I couldn’t understand much for the second leak. (though it involves rdkafka timer too)
diff --git a/src/analytics /kafka_ processor. cc b/src/analytics /kafka_ processor. cc /kafka_ processor. cc /kafka_ processor. cc :KafkaTimer( ) {
&metadata, 5000); :ERR_NO_ ERROR) {
LOG( ERROR, "Failed to acquire metadata: " << RdKafka: :err2str( err));
LOG( ERROR, "Kafka Metadata Detected");
LOG( ERROR, "Metadata for " << metadata- >orig_broker_ id() <<
":" << metadata- >orig_broker_ name()) ;
index 6b23d28..e760344 100644
--- a/src/analytics
+++ b/src/analytics
@@ -406,10 +406,14 @@ KafkaProcessor:
if (err != RdKafka:
} else {
if (collector_ && redis_up_) {
LOG(ERROR, "Kafka Restarting Redis");
KafkaProcessor: :KafkaTimer( ) {
k_event_ cb.disableKafka = false;
}
}
+ LOG(DEBUG, "Deleting metadata !!!");
+ delete metadata;
}
}
Steps used to reproduce the issue,
1. root@csp- ucpe-bglr153( analytics) :/usr/bin# service contrail-collector stop ucpe-bglr153( analytics) :/usr/bin# valgrind --leak-check=yes --leak-check=full --track-origins=yes --show- leak-kinds= all ./contrail- collector &> valgrind_ memcheck_ contrailcollect or_orig_ without_ structured_ syslog_ kafka.log
root@ csp-ucpe- bglr152( analytics) :/# contrail-status
contrail- alarm-gen active analytics- api active analytics- nodemgr active query-engine active snmp-collector active ucpe-bglr153( analyticsdb) :/# service kafka stop ucpe-bglr153( analyticsdb) :/# service kafka start
2. root@csp-
3. check for contrail-status
== Contrail Analytics ==
contrail-
contrail-
contrail-collector inactive // this remains inactive as I have started contrail-collector using valgrind and not as a service.
contrail-
contrail-
contrail-topology active
4. root@csp-
kafka: stopped
root@csp-
kafka: started
wait for some time and stop the valgrind process, the report will be generated