Memory leak in contrail-collector after kafka restart

Bug #1770123 reported by Parth Sarupria on 2018-05-09
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.0
Fix Committed
Medium
Zhiqiang Cui
R4.1
Fix Committed
Medium
Zhiqiang Cui
R5.0
Fix Committed
Medium
Zhiqiang Cui
Trunk
Fix Committed
Medium
Zhiqiang Cui

Bug Description

I have come across memory leak in contrail-collector referring to kaka_processor.cc (https://github.com/Juniper/contrail-analytics/blob/master/contrail-collector/kafka_processor.cc#L186).
This is observed only after the restart of kafka in anayticsdb. I have attached report generated by valgrind memcheck tool stating the below two leaks. And at the end of this, are the steps which I used to reproduce the issue in two setups running CAN version 4.1.1.0-10.

From valgrind_memcheck_contrailcollector_orig_without_structured_syslog_kafka.log, (attachment)

First leak ,

==10081==
==10081== 142 (72 direct, 70 indirect) bytes in 1 blocks are definitely lost in loss record 8,005 of 9,287
==10081== at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x549589: RdKafka::HandleImpl::metadata(bool, RdKafka::Topic const*, RdKafka::Metadata**, int) (in /usr/bin/contrail-collector)
==10081== by 0x7EFD34: KafkaProcessor::KafkaTimer() (kafka_processor.cc:406)
==10081== by 0x8003C0: boost::_mfi::mf0<bool, KafkaProcessor>::operator()(KafkaProcessor*) const (mem_fn_template.hpp:49)
==10081== by 0x7FFD7E: bool boost::_bi::list1<boost::_bi::value<KafkaProcessor*> >::operator()<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list0>(boost::_bi::type<bool>, boost::_mfi::mf0<bool, KafkaProcessor>&, boost::_bi::list0&, long) (bind.hpp:243)
==10081== by 0x7FF46A: boost::_bi::bind_t<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list1<boost::_bi::value<KafkaProcessor*> > >::operator()() (bind_template.hpp:20)
==10081== by 0x7FE59A: boost::detail::function::function_obj_invoker0<boost::_bi::bind_t<bool, boost::_mfi::mf0<bool, KafkaProcessor>, boost::_bi::list1<boost::_bi::value<KafkaProcessor*> > >, bool>::invoke(boost::detail::function::function_buffer&) (function_template.hpp:132)
==10081== by 0x450387: boost::function0<bool>::operator()() const (function_template.hpp:767)
==10081== by 0x491C84: Timer::TimerTask::Run() (timer.cc:44)
==10081== by 0x46841B: TaskImpl::execute() (task.cc:277)
==10081== by 0x6F95B39: ??? (in /usr/lib/libtbb.so.2)
==10081== by 0x6F91815: ??? (in /usr/lib/libtbb.so.2)

Second leak,
==10081==
==10081== 5,184 bytes in 39 blocks are definitely lost in loss record 9,076 of 9,287
==10081== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x4C2CF1F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10081== by 0x53FF61: rd_list_grow (in /usr/bin/contrail-collector)
==10081== by 0x53FFE6: rd_list_init (in /usr/bin/contrail-collector)
==10081== by 0x545189: rd_kafka_metadata_leader_query_tmr_cb (in /usr/bin/contrail-collector)
==10081== by 0x503467: rd_kafka_timers_run (in /usr/bin/contrail-collector)
==10081== by 0x4E45A6: rd_kafka_thread_main (in /usr/bin/contrail-collector)
==10081== by 0x5405E6: _thrd_wrapper_function (in /usr/bin/contrail-collector)
==10081== by 0x6D5F183: start_thread (pthread_create.c:312)
==10081== by 0x848003C: clone (clone.S:111)
==10081==

I could remove the first leak by deleting metadata pointer object at the end. I couldn’t understand much for the second leak. (though it involves rdkafka timer too)

diff --git a/src/analytics/kafka_processor.cc b/src/analytics/kafka_processor.cc
index 6b23d28..e760344 100644
--- a/src/analytics/kafka_processor.cc
+++ b/src/analytics/kafka_processor.cc
@@ -406,10 +406,14 @@ KafkaProcessor::KafkaTimer() {
                                   &metadata, 5000);
             if (err != RdKafka::ERR_NO_ERROR) {
                 LOG(ERROR, "Failed to acquire metadata: " << RdKafka::err2str(err));
             } else {
                 LOG(ERROR, "Kafka Metadata Detected");
                 LOG(ERROR, "Metadata for " << metadata->orig_broker_id() <<
                     ":" << metadata->orig_broker_name());

                 if (collector_ && redis_up_) {
                     LOG(ERROR, "Kafka Restarting Redis");
                     KafkaProcessor::KafkaTimer() {
                      k_event_cb.disableKafka = false;
                 }
             }
+ LOG(DEBUG, "Deleting metadata !!!");
+ delete metadata;
         }
     }

Steps used to reproduce the issue,

1. root@csp-ucpe-bglr153(analytics):/usr/bin# service contrail-collector stop
2. root@csp-ucpe-bglr153(analytics):/usr/bin# valgrind --leak-check=yes --leak-check=full --track-origins=yes --show-leak-kinds=all ./contrail-collector &> valgrind_memcheck_contrailcollector_orig_without_structured_syslog_kafka.log
3. check for contrail-status
                                root@csp-ucpe-bglr152(analytics):/# contrail-status
== Contrail Analytics ==
                                contrail-alarm-gen active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector inactive // this remains inactive as I have started contrail-collector using valgrind and not as a service.
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active
4. root@csp-ucpe-bglr153(analyticsdb):/# service kafka stop
kafka: stopped
root@csp-ucpe-bglr153(analyticsdb):/# service kafka start
kafka: started

wait for some time and stop the valgrind process, the report will be generated

Parth Sarupria (psarupria) wrote :
information type: Proprietary → Public
description: updated
Zhiqiang Cui (zcui) wrote :

Problem is librdkafka, need upgrade librdkafka

Zhiqiang Cui (zcui) wrote :

https://github.com/edenhill/librdkafka/commit/15ecde1b263b32a3b61c688298621a604b0636fc#diff-f6ab82d6884be8f30145a102f5e6f535

This is the fix source. This is merged to librdkafka on March 9th, 2017.

Will select a suitable librdkafka and Kafka version for upgrading.

At same time will commit code to fix leak 1.

Review in progress for https://review.opencontrail.org/43045
Submitter: Zhiqiang Cui (<email address hidden>)

Review in progress for https://review.opencontrail.org/43217
Submitter: Zhiqiang Cui (<email address hidden>)

Reviewed: https://review.opencontrail.org/43045
Committed: http://github.com/Juniper/contrail-analytics/commit/7d2dfc4d2fbce7b5371694f16f2dc3b5e72d3611
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit 7d2dfc4d2fbce7b5371694f16f2dc3b5e72d3611
Author: zcui <email address hidden>
Date: Tue May 15 10:10:46 2018 -0700

Fix memory leak in contrail-collector when kafka down

When kafka down, contrail-collecor can receive event. To detect if
kafka start again, contrail-collector need call librdkafka api to
detect metadata, the metada is a pointer, librdkafka will malloc
memory for metadata, and need caller to free it.

Change-Id: I84698de36f57ece17ea3776b2ef99b5154ddd2ff
Closes-bug: 1770123

Review in progress for https://review.opencontrail.org/43368
Submitter: Zhiqiang Cui (<email address hidden>)

Review in progress for https://review.opencontrail.org/43369
Submitter: Zhiqiang Cui (<email address hidden>)

Reviewed: https://review.opencontrail.org/43217
Committed: http://github.com/Juniper/contrail-analytics/commit/1b51d98048c59a077ead2f73ebc0d966f782a67b
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit 1b51d98048c59a077ead2f73ebc0d966f782a67b
Author: zcui <email address hidden>
Date: Tue May 15 10:10:46 2018 -0700

Fix memory leak in contrail-collector when kafka down

When kafka down, contrail-collecor can receive event. To detect if
kafka start again, contrail-collector need call librdkafka api to
detect metadata, the metada is a pointer, librdkafka will malloc
memory for metadata, and need caller to free it.

Change-Id: I84698de36f57ece17ea3776b2ef99b5154ddd2ff
Closes-bug: 1770123

Reviewed: https://review.opencontrail.org/43369
Committed: http://github.com/Juniper/contrail-controller/commit/fb8c9c7df3583ef118942c5043f15d64fdec77e3
Submitter: Zuul (<email address hidden>)
Branch: R4.1

commit fb8c9c7df3583ef118942c5043f15d64fdec77e3
Author: zcui <email address hidden>
Date: Tue May 29 16:22:51 2018 -0700

Fix memory leak in contrail-collector when kafka down

When kafka down, contrail-collecor can receive event. To detect if
kafka start again, contrail-collector need call librdkafka api to
detect metadata, the metada is a pointer, librdkafka will malloc
memory for metadata, and need caller to free it.

Change-Id: I0f43c9aec204f983333e10fb4c0122f37becb4af
Closes-bug: 1770123

Reviewed: https://review.opencontrail.org/43368
Committed: http://github.com/Juniper/contrail-controller/commit/a90c233501709dd0a3a31117b05c1cdd88771933
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit a90c233501709dd0a3a31117b05c1cdd88771933
Author: zcui <email address hidden>
Date: Tue May 29 16:22:51 2018 -0700

Fix memory leak in contrail-collector when kafka down

When kafka down, contrail-collecor can receive event. To detect if
kafka start again, contrail-collector need call librdkafka api to
detect metadata, the metada is a pointer, librdkafka will malloc
memory for metadata, and need caller to free it.

Change-Id: I0f43c9aec204f983333e10fb4c0122f37becb4af
Closes-bug: 1770123

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers