contrail-logs command sometimes does not complete in time and process hogs memory

Bug #1673865 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.2
Fix Committed
High
mkheni
Trunk
Fix Committed
High
mkheni

Bug Description

R3.2 Build 30

Below mail thread captures the issue details :

From: Megh Bhatt <email address hidden>
Date: Thursday, March 16, 2017 at 11:18 PM
To: Vedamurthy Ananth Joshi <email address hidden>, Sundaresan Rajangam <email address hidden>
Cc: Raj Reddy <email address hidden>, Nikhil Bansal <email address hidden>, Sudheendra Rao <email address hidden>
Subject: Re: contrail-logs hogging memory and stuck

Added Sundar,

It is not really stuck

write(1, "es = 1, dequeues = 1, enqueue_fa"..., 4096) = 4096
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=265, ...}) = 0
write(1, "tes_sent = 0, messages_received "..., 4096) = 4096
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=265, ...}) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f085361f000
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=265, ...}) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08535df000
munmap(0x7f08535df000, 262144) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08535df000
munmap(0x7f08535df000, 262144) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08535df000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f085359f000
write(1, "_dropped = 0, messages_received_"..., 4096) = 4096
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=265, ...}) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f085355f000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f085351f000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08534df000
munmap(0x7f08534df000, 262144) = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f08534df000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f085349f000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f085345f000
^CProcess 4094 detached
root@nodec7:~#

Looks like contrail-logs has taken around 13GB and is in D state for extended times which suggests uninterruptible sleep mostly due to waiting for disk/IO when writing

4094 root 20 0 25.189g 0.013t 2804 R 0.0 43.1 20:00.13 contrail-logs

4094 root 20 0 25.189g 0.013t 2804 D 0.0 43.1 20:00.33 contrail-logs

root@nodec7:~# ps aux | grep contrail-logs
root 997 0.0 0.0 10460 944 pts/12 R+ 23:11 0:00 grep --color=auto contrail-logs
root 4091 0.0 0.0 63664 1044 pts/0 Ss+ Mar15 0:00 sudo -S -p sudo password: /bin/bash -l -c /usr/bin/contrail-logs --last 266m --all --admin-user admin --admin-password contrail123 >> /var/log/cassandra_log_nodec7_2017_03_15_20_41_37.log
root 4092 0.0 0.0 11132 680 pts/0 S+ Mar15 0:00 /bin/bash -l -c /usr/bin/contrail-logs --last 266m --all --admin-user admin --admin-password contrail123 >> /var/log/cassandra_log_nodec7_2017_03_15_20_41_37.log
root 4094 1.2 44.1 26679096 14543776 pts/0 Dl+ Mar15 20:10 /usr/bin/python /usr/bin/contrail-logs --last 266m --all --admin-user admin --admin-password contrail123

 We had changed contrail-logs to do all read and then print all at once which might cause the D state, I think we will need to revert that change to avoid such issues and/or add some sort of sleep.

 Also we have an option to specify output file in contrail-logs, that might help too rather than redirecting to a file.

Thanks

Megh

On Mar 16, 2017, at 3:37 AM, Vedamurthy Ananth Joshi <email address hidden> wrote:

Megh,
It happened again on R3.2 now on the same setup and Nikhil couldn’t find out much from strace.

Can you please take a look as soon as possible ? Node is the same : nodec7.englab.juniper.net
Sudhee has temporarily blocked the setup for you

Vedu

From: Megh Bhatt <email address hidden>
Date: Friday, February 10, 2017 at 9:47 AM
To: Vedamurthy Ananth Joshi <email address hidden>
Cc: Raj Reddy <email address hidden>, Nikhil Bansal <email address hidden>
Subject: Re: contrail-logs hogging memory and stuck

Can we do a strace on the process to check what it is doing?

Thanks

Megh

On Feb 9, 2017, at 7:34 PM, Vedamurthy Ananth Joshi <email address hidden> wrote:

Hmm…possibly it got killed somehow… what can I check next time it happens ?

Including Nikhil..

From: Megh Bhatt <email address hidden>
Date: Thursday, February 9, 2017 at 7:08 PM
To: Vedamurthy Ananth Joshi <email address hidden>
Cc: Raj Reddy <email address hidden>
Subject: Re: contrail-logs hogging memory and stuck

I logged on but could not see the process. Were you redirecting to a file?

Thanks

Megh

On Feb 9, 2017, at 6:09 PM, Vedamurthy Ananth Joshi <email address hidden> wrote:

Could you please take a look.. we need the setup as soon as possible

Vedu

Sent from my iPhone

Begin forwarded message:

From: Vedamurthy Ananth Joshi <email address hidden>
Date: 10 February 2017 at 1:37:09 AM IST
To: Contrail Systems Analytics Team <email address hidden>
Cc: cf-test <email address hidden>
Subject: contrail-logs hogging memory and stuck

On this node nodec7.englab.juniper.net(running 3.2.10-22), contrail-logs cmd is stuck for over a day . It was asked to show all logs for ~4 hrs after a sanity run.
It has used up ~14G RSS already. Can you please help figure out what is happening ?

%MEM PID %CPU RSS VSZ TIME COMMAND
45.1 1498 1.5 14870128 24964224 00:19:30 /usr/bin/python /usr/bin/contrail-logs --last 234m --all --admin-user admin --admin-password contrail123

Vedu

Tags: analytics
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/30114
Submitter: mkheni (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/30149
Submitter: mkheni (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/30176
Submitter: mkheni (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/30178
Submitter: mkheni (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/30149
Submitter: mkheni (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/30149
Committed: http://github.org/Juniper/contrail-controller/commit/9ec7e8d1d2f34149556b4bf8e874445aa2ebfd68
Submitter: Zuul (<email address hidden>)
Branch: master

commit 9ec7e8d1d2f34149556b4bf8e874445aa2ebfd68
Author: mkheni <email address hidden>
Date: Tue Apr 4 09:49:47 2017 -0700

contrail-logs command does not complete in time and process hogs memory.

If run in a scale setup and for a longer time period (i.e. 5-6h or more), contrail-logs would take a lot of memory and go to D state due to waiting
for IO. This fix divides the query in multiple queries of 10 min so that contrail-logs does not take a lot of memory and have a smaller amount of
data to write to the disk. Moreover, this fix would fetch the results for the next query while writing the results for the current query to the disk
in parallel because writing to the disk may take some time.

Change-Id: Ib68c7058e970e853288d8236ea4239bdb471b313
Closes-bug: #1673865

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/30178
Submitter: mkheni (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/30178
Committed: http://github.org/Juniper/contrail-controller/commit/7f5a1a1889a7272059a632256102bc978a3e96c7
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit 7f5a1a1889a7272059a632256102bc978a3e96c7
Author: mkheni <email address hidden>
Date: Tue Apr 4 09:49:47 2017 -0700

contrail-logs command does not complete in time and process hogs memory.

If run in a scale setup and for a longer time period (i.e. 5-6h or more), contrail-logs would take a lot of memory and go to D state due to waiting
for IO. This fix divides the query in multiple queries of 10 min so that contrail-logs does not take a lot of memory and have a smaller amount of
data to write to the disk. Moreover, this fix would fetch the results for the next query while writing the results for the current query to the disk
in parallel because writing to the disk may take some time.

Change-Id: Ib68c7058e970e853288d8236ea4239bdb471b313
Closes-bug: #1673865
(cherry picked from commit 9ec7e8d1d2f34149556b4bf8e874445aa2ebfd68)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.