Comment 0 for bug 1770123

Revision history for this message
Parth Sarupria (psarupria) wrote :

I have come across memory leak in contrail-collector referring to kaka_processor.cc (https://github.com/Juniper/contrail-analytics/blob/master/contrail-collector/kafka_processor.cc#L186).
This is observed only after the restart of kafka in anayticsdb. I have attached report generated by valgrind memcheck tool stating the below two leaks. And at the end of this, are the steps which I used to reproduce the issue in two setups running CAN version 4.1.1.0-10.

Can you please look into this?

From valgrind_memcheck_contrailcollector_orig_without_structured_syslog_kafka.log, (attachment)

First leak ,

==10081==
==10081== 142 (72 direct, 70 indirect) bytes in 1 blocks are definitely lost in loss record 8,005 of 9,287
==10081== at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x549589: RdKafka::HandleImpl::metadata(bool, RdKafka::Topic const*, RdKafka::Metadata**, int) (in /usr/bin/contrail-collector)
==10081== by 0x7EFD34: KafkaProcessor::KafkaTimer() (kafka_processor.cc:406)
==10081== by 0x8003C0: boost::_mfi::mf0<bool, KafkaProcessor>::operator()(KafkaProcessor*) const (mem_fn_template.hpp:49)
==10081== by 0x7FFD7E: bool boost::_bi::list1<boost::_bi::value<KafkaProcessor*> >::operator()<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list0>(boost::_bi::type<bool>, boost::_mfi::mf0<bool, KafkaProcessor>&, boost::_bi::list0&, long) (bind.hpp:243)
==10081== by 0x7FF46A: boost::_bi::bind_t<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list1<boost::_bi::value<KafkaProcessor*> > >::operator()() (bind_template.hpp:20)
==10081== by 0x7FE59A: boost::detail::function::function_obj_invoker0<boost::_bi::bind_t<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list1<boost::_bi::value<KafkaProcessor*> > >, bool>::invoke(boost::detail::function::function_buffer&) (function_template.hpp:132)
==10081== by 0x450387: boost::function0<bool>::operator()() const (function_template.hpp:767)
==10081== by 0x491C84: Timer::TimerTask::Run() (timer.cc:44)
==10081== by 0x46841B: TaskImpl::execute() (task.cc:277)
==10081== by 0x6F95B39: ??? (in /usr/lib/libtbb.so.2)
==10081== by 0x6F91815: ??? (in /usr/lib/libtbb.so.2)

Second leak,
==10081==
==10081== 5,184 bytes in 39 blocks are definitely lost in loss record 9,076 of 9,287
==10081== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x53FF61: rd_list_grow (in /usr/bin/contrail-collector)
==10081== by 0x53FFE6: rd_list_init (in /usr/bin/contrail-collector)
==10081== by 0x545189: rd_kafka_metadata_leader_query_tmr_cb (in /usr/bin/contrail-collector)
==10081== by 0x503467: rd_kafka_timers_run (in /usr/bin/contrail-collector)
==10081== by 0x4E45A6: rd_kafka_thread_main (in /usr/bin/contrail-collector)
==10081== by 0x5405E6: _thrd_wrapper_function (in /usr/bin/contrail-collector)
==10081== by 0x6D5F183: start_thread (pthread_create.c:312)
==10081== by 0x848003C: clone (clone.S:111)
==10081==

I could remove the first leak by deleting metadata pointer object at the end. I couldn’t understand much for the second leak. (though it involves rdkafka timer too)

diff --git a/src/analytics/kafka_processor.cc b/src/analytics/kafka_processor.cc
index 6b23d28..e760344 100644
--- a/src/analytics/kafka_processor.cc
+++ b/src/analytics/kafka_processor.cc
@@ -406,10 +406,14 @@ KafkaProcessor::KafkaTimer() {
                                   &metadata, 5000);
             if (err != RdKafka::ERR_NO_ERROR) {
                 LOG(ERROR, "Failed to acquire metadata: " << RdKafka::err2str(err));
             } else {
                 LOG(ERROR, "Kafka Metadata Detected");
                 LOG(ERROR, "Metadata for " << metadata->orig_broker_id() <<
                     ":" << metadata->orig_broker_name());

                 if (collector_ && redis_up_) {
                     LOG(ERROR, "Kafka Restarting Redis");
                     KafkaProcessor::KafkaTimer() {
                      k_event_cb.disableKafka = false;
                 }
             }
+ LOG(DEBUG, "Deleting metadata !!!");
+ delete metadata;
         }
     }

Steps used to reproduce the issue,

1. root@csp-ucpe-bglr153(analytics):/usr/bin# service contrail-collector stop
2. root@csp-ucpe-bglr153(analytics):/usr/bin# valgrind --leak-check=yes --leak-check=full --track-origins=yes --show-leak-kinds=all ./contrail-collector &> valgrind_memcheck_contrailcollector_orig_without_structured_syslog_kafka.log
3. check for contrail-status
                                root@csp-ucpe-bglr152(analytics):/# contrail-status
== Contrail Analytics ==
                                contrail-alarm-gen active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector inactive // this remains inactive as I have started contrail-collector using valgrind and not as a service.
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active
4. root@csp-ucpe-bglr153(analyticsdb):/# service kafka stop
kafka: stopped
root@csp-ucpe-bglr153(analyticsdb):/# service kafka start
kafka: started

wait for some time and stop the valgrind process, the report will be generated