Bug #1836638 “RT kernel memory leak when creating/deleting pods” : Bugs : StarlingX

Brent Rowsell (brent-rowsell) on 2019-07-15

Changed in starlingx:
importance:	Undecided → High

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-17:

#1

Marking as stx.2.0 gating based on input from distro.other TL (Brent Rowsell). The current leak is about 70M/hour which is too high.

tags:	added: stx.distro.other
tags:	added: stx.2.0
Changed in starlingx:
status:	New → Triaged
assignee:	nobody → Cindy Xie (xxie1)

Cindy Xie (xxie1) on 2019-07-17

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → Bin Yang (byangintel)
assignee:	Bin Yang (byangintel) → Yi Wang (wangyi4)

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-19:

#2

@Bart, does this bug only occur on RT kernel?

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-19:

#3

Yes - the 70M/hour leak was only seen on the RT kernel. Performing the same test on the same lab with the standard kernel showed a much smaller slab increase (about 8M/hour).

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-19:

#4

@Bart, thanks.

One more question, you mentioned a script to reproduce this issue. Could you confirm if below script is the one you used?

#!/bin/bash
mkdir -p pages
for x in `seq 1280000`; do
        [ $((x % 1000)) -eq 0 ] && echo $x
        mkdir /sys/fs/cgroup/memory/foo
        # echo 1M > /sys/fs/cgroup/memory/foo/memory.limit_in_bytes
        echo 100M > /sys/fs/cgroup/memory/foo/memory.kmem.limit_in_bytes
        echo $$ >/sys/fs/cgroup/memory/foo/cgroup.procs
        memhog 4K &>/dev/null
        echo trex>pages/$x
        echo $$ >/sys/fs/cgroup/memory/cgroup.procs
        rmdir /sys/fs/cgroup/memory/foo
done

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-19:

#5

I tried to use it in my test. But I found I need to comment "echo 100M > /sys/fs/cgroup/memory/foo/memory.kmem.limit_in_bytes". Otherwise, it will report error.

Revision history for this message

Bart Wensley (bartwensley) wrote on 2019-07-19:

#6

The link to the script is found in the "Steps to reproduce" above.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-19:

#7

When I clicked on the link you gave, the browser just brought me to the top of that page. So I don't know which script is the one you used. I rechecked the link itself, now I know you mean the script in #44 of that page. Thanks.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-27:

#8

@Bart, need to double confirm with you. In your test, you saw slab memory continuously increasing while running the script, so you think there is kernel memory leak. Is my understanding correct?

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-29:

#9

With my Duplex RT deployment, I can see slab continuously growing while running the script. slab grew 1444MB for ~25 hours. I checked /proc/slabinfo. "dentry" and "proc_inode_cache" contributed 65% increasing. I use the command "echo 2 > /proc/sys/vm/drop_caches" to free dentries and inodes. Then I can get almost all slab back. After the command, slab is only 115 MB more than that of start point.

I did the same test on a STD deployment. I also saw slab increase as RT deployment (grew 4000MB for 48 hours). And "dentry" and "proc_inode_cache" contributed 66% increasing. The increasing of slab can be got back by "echo 2 > /proc/sys/vm/drop_caches" back too.

So I am not sure it is kernel memory leak. @Bart, any comments?

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-07-30:

#10

I'm running an overnight soak with the test in #44. My system though is not AIO, but has 4 low latency worker nodes. I'll see if there is a slab memory growth issue. There is a chance though that if there is a slab memory leak, it could be caused by the controller side (not worker side) stuff running on RT and I won't see anything. At least it'll be a data point.

Revision history for this message

Brent Rowsell (brent-rowsell) wrote on 2019-07-31:

#11

Can you capture the following from cat /proc/meminfo after running the test

Slab:
SReclaimable:
SUnreclaim:

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-31:

#12

@Brent, here is the information about my test on RT deployment,

/proc/meminfo Start after 25 hours after forcing clean up cache
Slab: 264776 1743056 382052
SReclaimable: 185752 1252176 183044
SUnreclaim: 79024 490880 199008

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-07-31:

#13

Sorry, it was not well formatted. There are three columns "Start", "after 25 hours", and "after forcing clean up cache".

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-07-31:

#14

What I see on the controller (std not rt) is no slab growth. The computes are a different story though:

compute-0:/home/sysadmin# while (true) do slabtop -o | grep "Total Size" ; sleep 3600; done
: 1053610.34K / 1218935.47K (86.4%)
: 1100021.71K / 1250042.77K (88.0%)
: 1141335.34K / 1297244.87K (88.0%)
: 1230783.93K / 1413354.38K (87.1%)
: 1290241.55K / 1455409.27K (88.7%)
: 1307190.09K / 1473857.66K (88.7%)
: 1386282.48K / 1568656.14K (88.4%)
: 1458323.35K / 1650586.02K (88.4%)
: 1544507.22K / 1763949.11K (87.6%)
: 1603460.28K / 1818082.76K (88.2%)
: 1682133.75K / 1911119.05K (88.0%)
: 1672648.27K / 1871318.73K (89.4%)
: 1763455.32K / 1989393.49K (88.6%)
: 1828832.67K / 2065162.30K (88.6%)
: 1862427.75K / 2105375.59K (88.5%)
: 1918771.76K / 2139599.43K (89.7%)
: 1982376.23K / 2211648.76K (89.6%)
: 1997120.86K / 2225361.31K (89.7%)
: 2092067.80K / 2354015.72K (88.9%)

compute-2:/home/sysadmin# while (true) do slabtop -o | grep "Total Size" ; sleep 3600; done
Active / Total Size (% used) : 1185455.68K / 1394472.16K (85.0%)
Active / Total Size (% used) : 1221560.78K / 1463775.23K (83.5%)
Active / Total Size (% used) : 1294068.54K / 1544066.90K (83.8%)
Active / Total Size (% used) : 1360230.82K / 1615695.11K (84.2%)
Active / Total Size (% used) : 1370656.77K / 1611325.72K (85.1%)
Active / Total Size (% used) : 1444176.09K / 1725588.27K (83.7%)
Active / Total Size (% used) : 1511712.16K / 1810056.80K (83.5%)
Active / Total Size (% used) : 1584970.48K / 1903476.67K (83.3%)
Active / Total Size (% used) : 1656908.51K / 1994978.12K (83.1%)
Active / Total Size (% used) : 1721124.51K / 2078758.56K (82.8%)
Active / Total Size (% used) : 1726846.77K / 2066286.02K (83.6%)
Active / Total Size (% used) : 1810385.39K / 2196059.80K (82.4%)
Active / Total Size (% used) : 1877687.39K / 2281186.59K (82.3%)
Active / Total Size (% used) : 1892249.86K / 2276503.49K (83.1%)
Active / Total Size (% used) : 1983879.84K / 2407284.12K (82.4%)
Active / Total Size (% used) : 2053470.85K / 2491182.89K (82.4%)
Active / Total Size (% used) : 2045246.96K / 2471911.83K (82.7%)
Active / Total Size (% used) : 2110375.64K / 2549514.54K (82.8%)
Active / Total Size (% used) : 2211556.00K / 2702286.45K (81.8%)

I'll check the meminfo, stay tuned.

What I see on the controller (std not rt) is no slab growth.  The computes are a different story though:

compute-0:/home/sysadmin# while (true) do slabtop -o | grep "Total Size" ; sleep 3600; done
 Active / Total Size (% used)       : 1053610.34K / 1218935.47K (86.4%)
 Active / Total Size (% used)       : 1100021.71K / 1250042.77K (88.0%)
 Active / Total Size (% used)       : 1141335.34K / 1297244.87K (88.0%)
 Active / Total Size (% used)       : 1230783.93K / 1413354.38K (87.1%)
 Active / Total Size (% used)       : 1290241.55K / 1455409.27K (88.7%)
 Active / Total Size (% used)       : 1307190.09K / 1473857.66K (88.7%)
 Active / Total Size (% used)       : 1386282.48K / 1568656.14K (88.4%)
 Active / Total Size (% used)       : 1458323.35K / 1650586.02K (88.4%)
 Active / Total Size (% used)       : 1544507.22K / 1763949.11K (87.6%)
 Active / Total Size (% used)       : 1603460.28K / 1818082.76K (88.2%)
 Active / Total Size (% used)       : 1682133.75K / 1911119.05K (88.0%)
 Active / Total Size (% used)       : 1672648.27K / 1871318.73K (89.4%)
 Active / Total Size (% used)       : 1763455.32K / 1989393.49K (88.6%)
 Active / Total Size (% used)       : 1828832.67K / 2065162.30K (88.6%)
 Active / Total Size (% used)       : 1862427.75K / 2105375.59K (88.5%)
 Active / Total Size (% used)       : 1918771.76K / 2139599.43K (89.7%)
 Active / Total Size (% used)       : 1982376.23K / 2211648.76K (89.6%)
 Active / Total Size (% used)       : 1997120.86K / 2225361.31K (89.7%)
 Active / Total Size (% used)       : 2092067.80K / 2354015.72K (88.9%)

compute-2:/home/sysadmin# while (true) do slabtop -o | grep "Total Size" ; sleep 3600; done
 Active / Total Size (% used)       : 1185455.68K / 1394472.16K (85.0%)
 Active / Total Size (% used)       : 1221560.78K / 1463775.23K (83.5%)
 Active / Total Size (% used)       : 1294068.54K / 1544066.90K (83.8%)
 Active / Total Size (% used)       : 1360230.82K / 1615695.11K (84.2%)
 Active / Total Size (% used)       : 1370656.77K / 1611325.72K (85.1%)
 Active / Total Size (% used)       : 1444176.09K / 1725588.27K (83.7%)
 Active / Total Size (% used)       : 1511712.16K / 1810056.80K (83.5%)
 Active / Total Size (% used)       : 1584970.48K / 1903476.67K (83.3%)
 Active / Total Size (% used)       : 1656908.51K / 1994978.12K (83.1%)
 Active / Total Size (% used)       : 1721124.51K / 2078758.56K (82.8%)
 Active / Total Size (% used)       : 1726846.77K / 2066286.02K (83.6%)
 Active / Total Size (% used)       : 1810385.39K / 2196059.80K (82.4%)
 Active / Total Size (% used)       : 1877687.39K / 2281186.59K (82.3%)
 Active / Total Size (% used)       : 1892249.86K / 2276503.49K (83.1%)
 Active / Total Size (% used)       : 1983879.84K / 2407284.12K (82.4%)
 Active / Total Size (% used)       : 2053470.85K / 2491182.89K (82.4%)
 Active / Total Size (% used)       : 2045246.96K / 2471911.83K (82.7%)
 Active / Total Size (% used)       : 2110375.64K / 2549514.54K (82.8%)
 Active / Total Size (% used)       : 2211556.00K / 2702286.45K (81.8%)

I'll check the meminfo, stay tuned.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-07-31:

#15

compute-0:

Slab: 2440092 kB
SReclaimable: 1497100 kB
SUnreclaim: 942992 kB

compute-2:

Slab: 2699004 kB
SReclaimable: 1543636 kB
SUnreclaim: 1155368 kB

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-07-31:

#16

compute-2:

Original size:

Active / Total Size (% used) : 1185455.68K / 1394472.16K (85.0%)

After test:

Active / Total Size (% used) : 2211556.00K / 2702286.45K (81.8%)

Slab: 2699004 kB
SReclaimable: 1543636 kB
SUnreclaim: 1155368 kB

Run: echo 2 > /proc/sys/vm/drop_caches

After cache drop:

Active / Total Size (% used) : 462882.69K / 916830.89K (50.5%)

Slab: 915472 kB
SReclaimable: 347784 kB
SUnreclaim: 567688 kB

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-07-31:

#17

I'm now running a soak with this on the computes:

while (true) do echo 2 > /proc/sys/vm/drop_caches ; cat /proc/meminfo | grep -e Slab -e SReclaimable -e SUnreclaim ; sleep 3600 ; done

Looking to see if the non-cache slab keeps growing.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-01:

#18

memtop.log Edit (1.4 MiB, text/plain)

Jim, since there is no slab growth on your controller, can you check whether busybox containers were created on your controller or not while running the script?

My second-round test on std was end yesterday. slab increased ~4G in two days. Pls check the log I attached.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-01:

#19

One exp I did on my std deployment. I increased my hugepage number to consume free memory. When the free memory was less than 1G, I saw a few GB memory was freed suddenly by kernel. slab was reduced from 4592.7 to 854.9.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-01:

#20

"can you check whether busybox containers were created on your controller or not while running the script?"

I would be extremely surprised if they were, since controllers are not worker nodes. I will verify though.

Results from the soak yesterday:

Compute-0:

Slab: 633812 kB
Slab: 544508 kB
Slab: 539820 kB
Slab: 550332 kB
Slab: 603928 kB
Slab: 640924 kB
Slab: 580504 kB
Slab: 651064 kB
Slab: 652480 kB
Slab: 647492 kB
Slab: 588076 kB
Slab: 656500 kB
Slab: 601280 kB
Slab: 598632 kB
Slab: 596580 kB
Slab: 648440 kB
Slab: 598152 kB
Slab: 661296 kB
Slab: 602204 kB
Slab: 608260 kB

Compute-1:

Slab: 686936 kB
Slab: 592648 kB
Slab: 587924 kB
Slab: 601324 kB
Slab: 626056 kB
Slab: 619568 kB
Slab: 617952 kB
Slab: 621208 kB
Slab: 694220 kB
Slab: 621596 kB
Slab: 616248 kB
Slab: 622316 kB
Slab: 684316 kB
Slab: 679484 kB
Slab: 621780 kB
Slab: 666504 kB
Slab: 616776 kB
Slab: 656132 kB
Slab: 618440 kB
Slab: 683832 kB

Compute-2 and 3, similar results

It looks to me like non-AIO low-latency doesn't exhibit a problem. Slab size bounces around a bit as expected, but doesn't appear to be slowly growing. My system has been up for 2 days+ now and has endured 2 consecutive overnight soaks of the testing in #44, busybox cube launches/destructions. The numbers just aren't concerning.

Yi says that when he put hugepage free memory under pressure, the slab went from 4.5 GB down to 854 MB. That seems to be working properly and is consistent with the kernel releasing buffers/cache.

If there is a slab leak issue in RT, it appears that it would be specific to AIO. I will have to obtain an AIO lab to do further investigation.

"can you check whether busybox containers were created on your controller or not while running the script?"

I would be extremely surprised if they were, since controllers are not worker nodes.  I will verify though.

Results from the soak yesterday:

Compute-0:

Slab:             633812 kB
Slab:             544508 kB
Slab:             539820 kB
Slab:             550332 kB
Slab:             603928 kB
Slab:             640924 kB
Slab:             580504 kB
Slab:             651064 kB
Slab:             652480 kB
Slab:             647492 kB
Slab:             588076 kB
Slab:             656500 kB
Slab:             601280 kB
Slab:             598632 kB
Slab:             596580 kB
Slab:             648440 kB
Slab:             598152 kB
Slab:             661296 kB
Slab:             602204 kB
Slab:             608260 kB

Compute-1:

Slab:             686936 kB
Slab:             592648 kB
Slab:             587924 kB
Slab:             601324 kB
Slab:             626056 kB
Slab:             619568 kB
Slab:             617952 kB
Slab:             621208 kB
Slab:             694220 kB
Slab:             621596 kB
Slab:             616248 kB
Slab:             622316 kB
Slab:             684316 kB
Slab:             679484 kB
Slab:             621780 kB
Slab:             666504 kB
Slab:             616776 kB
Slab:             656132 kB
Slab:             618440 kB
Slab:             683832 kB

Compute-2 and 3, similar results

It looks to me like non-AIO low-latency doesn't exhibit a problem.  Slab size bounces around a bit as expected, but doesn't appear to be slowly growing.  My system has been up for 2 days+ now and has endured 2 consecutive overnight soaks of the testing in #44, busybox cube launches/destructions.  The numbers just aren't concerning.

Yi says that when he put hugepage free memory under pressure, the slab went from 4.5 GB down to 854 MB.  That seems to be working properly and is consistent with the kernel releasing buffers/cache.

If there is a slab leak issue in RT, it appears that it would be specific to AIO.  I will have to obtain an AIO lab to do further investigation.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-01:

#21

I'm now running a soak on lowlat AIO-DX, same hourly cache dropping as before. I'm not running the stx-openstack application as I don't see how that would be relevant anyway.

I'm also going to continue soaking the non-AIO lab, looking for any sign that the Slab is slowly creeping up.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-02:

#22

Jim, thanks for your update! According to your data, my understanding is that there are no obvious memory leak on your RT compute nodes.

If there was no container created on controller nodes, and controller nodes didn't have slab growing. That is expected. I just want to double confirm that.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-02:

#23

Yi, yes, there were no containers being created on controller nodes in the non-AIO configuration test, and the slab wasn't growing there either.

Results from my AIO-DX soak does however show the slab slowly growing:

Controller-0:

Slab: 268936 kB
Slab: 296940 kB
Slab: 277736 kB
Slab: 278688 kB
Slab: 300752 kB
Slab: 277628 kB
Slab: 268448 kB
Slab: 301892 kB
Slab: 297892 kB
Slab: 302120 kB
Slab: 295380 kB
Slab: 287824 kB
Slab: 299636 kB
Slab: 305488 kB
Slab: 310592 kB
Slab: 305516 kB
Slab: 318464 kB
Slab: 321188 kB
Slab: 322368 kB
Slab: 312160 kB
Slab: 323324 kB
Slab: 317580 kB
Slab: 334332 kB

Controller-1:

Slab: 263728 kB
Slab: 308420 kB
Slab: 298448 kB
Slab: 306616 kB
Slab: 300252 kB
Slab: 286104 kB
Slab: 280876 kB
Slab: 324992 kB
Slab: 304344 kB
Slab: 337228 kB
Slab: 310084 kB
Slab: 333780 kB
Slab: 317436 kB
Slab: 352048 kB
Slab: 323700 kB
Slab: 338316 kB
Slab: 359848 kB
Slab: 358048 kB
Slab: 350500 kB
Slab: 339132 kB
Slab: 358372 kB
Slab: 348456 kB
Slab: 350408 kB

This is from the usual drop cache and print out the Slab total every hour. It moves up and down but the trend is clearly upwards.

We have a long weekend here in Canada, I'm hoping to be able to perform an AIO-DX soak over that entire period.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-02:

#24

The soak is launched, and I'm also grabbing the top 100 entries in /proc/slabinfo every hour.

I'm not completely convinced that there is a problem here yet.

I haven't installed the openstack application, keep that in mind.

We could also try turning on the CONFIG_DEBUG_KMEMLEAK option next week to see if that uncovers anything interesting.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-05:

#25

Jim, based on your data, controller-0 grew at the speed of 2.9MB per hour, and controller-1 3.8MB per hour. In my test on RT kernel, I flushed the cached memory once after ~25 hours. The growing speed is 4.6MB per hour. By checking slabinfo, I saw kmalloc-2048 and buffer_head contributed almost all the growth. Let's see if you can get similar result.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-06:

#26

After the weekend soak, I see 2.38 MB/hr growth on controller-0 (inactive) and 2.98 MB/hr on the active controller-1.

Slabinfo, like yours, shows kmalloc-2048 and buffer_head being the culprits. So we get similar results though your growth numbers are higher. You are probably using a faster platform than I am.

I tried turning on slub tracing through kmalloc-2048. I get a *lot* of trace points with tracebacks in the kernel log, but it also slows down everything by a massive amount.

Huge buffer_head usage tends to implicate i/o. Tons of 2 Kb sized buffers used makes me wonder about networking, with mtus of size 1500. Are we not properly freeing some network buffers somewhere?

A couple of things to potentially try. Just drive a lot of network traffic using something like netcat and see if slab leaks. Try setting the mtu of network interfaces to 3K in size and see if the leakage moves to kmalloc-4096.

I also wonder if this could be the source of the leak https://lkml.org/lkml/2018/11/2/182

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-07:

#27

Jim, you can use "echo 3 > /proc/sys/vm/drop_caches" to further flush cache memory. In my test, this command can release most of "buffer_head", but no effect on "kmalloc-2048". Below is the detailed results(I enable slub_debug for all entries, the growing is higher than before).

running script (~13.5 hours) after echo 2 after echo 3
proc_inode_cache increase 305 0 -1
dentry increase 893 0 0
buffer_head increase 90 83 12
kmalloc-2048 increase 60 60 60
kmalloc-128 increase 291 11 11
total slab increase 1997 165 89

after echoing 3 to drop cache, slab grew 89MB in total. kmalloc-2048 grew 60MB, buffer_head grew 12MB, and kmalloc-128 grew 11MB. So I am focusing on kmalloc-2048.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-07:

#28

I'm booting with linux bootarg "slub_debug=U,kmalloc-2048" which seems much better than tracing kmalloc-2048 which just crippled my performance and flooded the kernel logs. I'll get the AIO-DX lab back tomorrow, today I'll try to see if AIO-SX has the issue.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-07:

#29

I see a lot of kmalloc-2048 allocations coming from:

1271 sget_userns+0xca/0x610 age=14530/240297/558472 pid=0-174597 cpus=0-1

Which makes me wonder if it might be this issue:

https://lkml.org/lkml/2019/5/28/187

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-08:

#30

kmalloc-2048 - 500 rounds.png Edit (72.3 KiB, image/png)

Jim, I wrote a script to capture the information of kmalloc-2048. In the script, there are four steps:
1. Create a busybox container,
2. Get the info of kmalloc-2084 by slabinfo
3. Delete the container
4. Get the info of kmalloc-2084 again.

An interesting thing is I found the number of kmalloc-2048 objects show periodic characteristics as shown in the attached picture, and has a upward trend.

Furthermore, I found the number of kmalloc-2048 objects is not equal to the number of alloc_calls. The difference between the two numbers was even big. I don't know if it is correct because the two numbers of some slabs are equal, some are not.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-08:

#31

t_slab.sh Edit (580 bytes, text/x-sh)

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-08:

#32

kmalloc-2048 - 500 rounds.log Edit (6.6 MiB, text/plain)

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-08:

#33

I checked https://lkml.org/lkml/2019/5/28/187. But I don't find related code in StarlingX 3.10 kernel codebase.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-08:

#34

Not to mention that https://lkml.org/lkml/2019/5/28/187 occurs while freeing slabs in an error path, so you'd think that there would be a log of the triggering error somewhere. Also not to mention that this is seen in the rt kernel only.

The trend of the graph in the picture does show a long slow small regular upward progression, so an off-by-one error in a free operation would sort of fit the bill here.

I'm going to look at turning on the config option for the kmemleak feature and rebuild.

I'm also going to stare at source code diffs between std and rt kernels, essentially looking at the rt patch set, to see if I can find anything that might stand out as a causing candidate.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-09:

#35

kmalloc-2048 objects.png Edit (35.8 KiB, image/png)

Jim, ran my script on a simplex configuration with standard kernel (Linux controller-0 3.10.0-957.12.2.el7.2.tis.x86_64), I can get similar result as rt kernel. So I still doubt it is not a unique issue to rt kernel.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-09:

#36

Yi, that makes sense to me, it not being specific to rt. The rt patch set is often credited with finding issues in std linux, and I thought that might be the case here. But you're saying that you can see the leak even with the std kernel.

The next step I would suggest is to try to reproduce the problem in simplex with std linux but with all of our StarlingX patches to it dropped, just to rule out *our* kernel patches. Even better if we can reproduce it with out-of-the-box CentOS 7.6 (we might have to recompile it with that broken MEMCG accounting config option turned off), cutting all StarlingX software completely out of the picture. Then we can report it to CentOS/RedHat and perhaps get more eyes on it.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-09:

#37

I'm having trouble trying to use the kmemleak feature in the kernel. When I boot with it on, eventually I get systemd and a bunch of kworker tasks all hanging, and I cannot make it through the bootup.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-12:

#38

Jim, I am not saying std linux has memory leak. I just found all experiments I did had similar phenomenon on both std and rt kernel.
Based on all the results we got so far, can we say there is memory leak in the kernel? I am not a kernel expert, not sure about this.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-12:

#39

Yi, it's not an easy call to make, but imo if slab use keeps increasing, and we don't have a valid explanation for it, then there is probably a kernel leak. Kernel also includes out-of-tree drivers and other kernel modules.

Another topic though is whether or not the leak is big enough to actually worry about.

I'm going to try AIO-DX std, not rt, and see if I can confirm your findings, that std and rt exhibit the same behavior.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-13:

#40

Jim, thanks for the explanation! Per your suggestion, I am setting up a vanilla centos 7.6 to verify this issue. Furthermore, I will try to compile a kernel with MEMCG accounting config option turned off to replace vanilla centos kernel.

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-08-14:

#41

multi-node config, no issue on computer node; on AIO-SX, kernel slab grow @ 3MB/hour. kmalloc-2048 is taking the most part. Same observation from Jim and Yi. More experiment on getting vanilla CentOS 7.6 on stress testing. Need to add one more patch so that kernel needs a rebuild.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-14:

#42

My AIO-DX experiment, std kernel.

Echo 3 into drop caches and report every hour. The usual test script of creating and destroying 20 busybox containers at a time.

Controller-0 (active):

SUnreclaim: 96024 kB
SUnreclaim: 95880 kB
SUnreclaim: 105164 kB
SUnreclaim: 97112 kB
SUnreclaim: 105216 kB
SUnreclaim: 112376 kB
SUnreclaim: 110028 kB
SUnreclaim: 115460 kB
SUnreclaim: 109356 kB
SUnreclaim: 108704 kB
SUnreclaim: 115244 kB
SUnreclaim: 108664 kB
SUnreclaim: 114956 kB
SUnreclaim: 110532 kB
SUnreclaim: 117748 kB
SUnreclaim: 116984 kB
SUnreclaim: 113348 kB
SUnreclaim: 109864 kB
SUnreclaim: 110364 kB
SUnreclaim: 107960 kB
SUnreclaim: 117200 kB
SUnreclaim: 116636 kB
SUnreclaim: 106624 kB
SUnreclaim: 96420 kB <- test script has stopped by this point

Controller-1:

SUnreclaim: 81288 kB
SUnreclaim: 86736 kB
SUnreclaim: 82328 kB
SUnreclaim: 84320 kB
SUnreclaim: 99876 kB
SUnreclaim: 100388 kB
SUnreclaim: 100464 kB
SUnreclaim: 95744 kB
SUnreclaim: 105292 kB
SUnreclaim: 95164 kB
SUnreclaim: 96652 kB
SUnreclaim: 98364 kB
SUnreclaim: 97896 kB
SUnreclaim: 95816 kB
SUnreclaim: 98380 kB
SUnreclaim: 96396 kB
SUnreclaim: 95580 kB
SUnreclaim: 100276 kB
SUnreclaim: 102860 kB
SUnreclaim: 98244 kB
SUnreclaim: 97508 kB
SUnreclaim: 100100 kB
SUnreclaim: 85512 kB
SUnreclaim: 81800 kB <- test script has stopped by this point if not earlier

After 24 hours, there is no clear sign of any slab leak imo. kmalloc-2048 use did not grow over the course of the run.

I will rerun based on an rt load (yet again) but this time I will be echoing 3 into the drop caches control, where I was only doing 2 before, and look specifically at SUnreclaim.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-15:

#43

Jim, there are three differences between your test and mine. My env is AIO-SX, and I checked Slab instead of SUnreclaim, and I used "echo 2". I don't have AIO-DX now. So I am running another around test to check SUnreclaim with "echo 3".

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-15:

#44

I definitely see the leak using the RT load. Results of last night's run:

Controller-0:
SUnreclaim: 90616 kB
SUnreclaim: 102748 kB
SUnreclaim: 112456 kB
SUnreclaim: 117508 kB
SUnreclaim: 112396 kB
SUnreclaim: 122428 kB
SUnreclaim: 118684 kB
SUnreclaim: 125184 kB
SUnreclaim: 129824 kB
SUnreclaim: 133324 kB
SUnreclaim: 135588 kB
SUnreclaim: 130272 kB
SUnreclaim: 131140 kB
SUnreclaim: 142324 kB
SUnreclaim: 146132 kB
SUnreclaim: 149108 kB
SUnreclaim: 149480 kB
SUnreclaim: 153444 kB
SUnreclaim: 152984 kB
SUnreclaim: 157844 kB
SUnreclaim: 152300 kB
SUnreclaim: 144548 kB <- test script had stopped by this point
SUnreclaim: 144276 kB
SUnreclaim: 144332 kB

Controller-1:

SUnreclaim: 58976 kB
SUnreclaim: 97612 kB
SUnreclaim: 92252 kB
SUnreclaim: 90768 kB
SUnreclaim: 107684 kB
SUnreclaim: 97192 kB
SUnreclaim: 120032 kB
SUnreclaim: 113316 kB
SUnreclaim: 113792 kB
SUnreclaim: 117628 kB
SUnreclaim: 121772 kB
SUnreclaim: 138648 kB
SUnreclaim: 143272 kB
SUnreclaim: 134924 kB
SUnreclaim: 142976 kB
SUnreclaim: 145196 kB
SUnreclaim: 155584 kB
SUnreclaim: 155300 kB
SUnreclaim: 163536 kB
SUnreclaim: 170276 kB
SUnreclaim: 158492 kB
SUnreclaim: 152152 kB <- test script stopped around here
SUnreclaim: 152088 kB
SUnreclaim: 149624 kB
SUnreclaim: 149512 kB

My next step is going to be trying to use kmemleak again, but this time turning it off at bootup and manually mounting debugfs to get at the sysfs controls for it. It should be a simple case of turn it on, run a few create/destroy cycles of the test, then trigger a scan and hope it reports unreachable blocks with a traceback of where they were allocated. And hope that the scan doesn't cause the system to lockup.

I also may pick up this patch https://lkml.org/lkml/2018/10/22/621 if I have no choice but to enable it at bootup time. The autoscanning was causing some processes to hang and lockup my system last time I tried it at bootup.

I definitely see the leak using the RT load.  Results of last night's run:

Controller-0:
SUnreclaim:        90616 kB
SUnreclaim:       102748 kB
SUnreclaim:       112456 kB
SUnreclaim:       117508 kB
SUnreclaim:       112396 kB
SUnreclaim:       122428 kB
SUnreclaim:       118684 kB
SUnreclaim:       125184 kB
SUnreclaim:       129824 kB
SUnreclaim:       133324 kB
SUnreclaim:       135588 kB
SUnreclaim:       130272 kB
SUnreclaim:       131140 kB
SUnreclaim:       142324 kB
SUnreclaim:       146132 kB
SUnreclaim:       149108 kB
SUnreclaim:       149480 kB
SUnreclaim:       153444 kB
SUnreclaim:       152984 kB
SUnreclaim:       157844 kB
SUnreclaim:       152300 kB
SUnreclaim:       144548 kB  <- test script had stopped by this point
SUnreclaim:       144276 kB
SUnreclaim:       144332 kB

Controller-1:

SUnreclaim:        58976 kB
SUnreclaim:        97612 kB
SUnreclaim:        92252 kB
SUnreclaim:        90768 kB
SUnreclaim:       107684 kB
SUnreclaim:        97192 kB
SUnreclaim:       120032 kB
SUnreclaim:       113316 kB
SUnreclaim:       113792 kB
SUnreclaim:       117628 kB
SUnreclaim:       121772 kB
SUnreclaim:       138648 kB
SUnreclaim:       143272 kB
SUnreclaim:       134924 kB
SUnreclaim:       142976 kB
SUnreclaim:       145196 kB
SUnreclaim:       155584 kB
SUnreclaim:       155300 kB
SUnreclaim:       163536 kB
SUnreclaim:       170276 kB
SUnreclaim:       158492 kB
SUnreclaim:       152152 kB  <- test script stopped around here
SUnreclaim:       152088 kB
SUnreclaim:       149624 kB
SUnreclaim:       149512 kB

My next step is going to be trying to use kmemleak again, but this time turning it off at bootup and manually mounting debugfs to get at the sysfs controls for it.  It should be a simple case of turn it on, run a few create/destroy cycles of the test, then trigger a scan and hope it reports unreachable blocks with a traceback of where they were allocated.  And hope that the scan doesn't cause the system to lockup.

I also may pick up this patch https://lkml.org/lkml/2018/10/22/621 if I have no choice but to enable it at bootup time.  The autoscanning was causing some processes to hang and lockup my system last time I tried it at bootup.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-16:

#45

Jim, my test on AIO-SX std kernel is aligned with your result. There is no slab growing on std kernel. In my previous test, I only performed "echo 2" once just after stopping the test script. So I suppose the growing in the result was just fluctuation. I re-launch the test on a AIO-DX rt env. Let's see if I can get the same result as yours.

SUnreclaim: 198540 kB
SUnreclaim: 193500 kB
SUnreclaim: 194864 kB
SUnreclaim: 192704 kB
SUnreclaim: 197500 kB
SUnreclaim: 198544 kB
SUnreclaim: 191312 kB
SUnreclaim: 191516 kB
SUnreclaim: 189936 kB
SUnreclaim: 194952 kB
SUnreclaim: 193500 kB
SUnreclaim: 195156 kB
SUnreclaim: 192600 kB
SUnreclaim: 198220 kB
SUnreclaim: 190640 kB
SUnreclaim: 197504 kB
SUnreclaim: 189896 kB
SUnreclaim: 194724 kB
SUnreclaim: 190420 kB
SUnreclaim: 200532 kB
SUnreclaim: 192896 kB
SUnreclaim: 195264 kB
SUnreclaim: 190612 kB
SUnreclaim: 202656 kB
SUnreclaim: 190492 kB
SUnreclaim: 193272 kB
SUnreclaim: 192852 kB
SUnreclaim: 194380 kB
SUnreclaim: 191436 kB
SUnreclaim: 193264 kB
SUnreclaim: 191424 kB
SUnreclaim: 194200 kB
SUnreclaim: 194468 kB
SUnreclaim: 193752 kB
SUnreclaim: 191348 kB
SUnreclaim: 194124 kB
SUnreclaim: 191364 kB

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-16:

#46

Download full text (4.0 KiB)

We need the kmemleak feature in the kernel to get this debugged. BUT the feature doesn't want to play ball in our RT load. This is after I even patched it to stop the auto scanning part at bootup:

[ OK ] Started TIS Patching Controller Daemon.
[ OK ] Started TIS Patching Agent.
[ OK ] Started Titanium Cloud Affine Platform.
[ OK ] Started StarlingX Affine Tasks.
[ 282.976723] INFO: rcu_preempt self-detected stall on CPU { 1} (t=60000 jiffies g=28189 c=28188 q=812)
[ 462.972697] INFO: rcu_preempt self-detected stall on CPU { 1} (t=240003 jiffies g=28189 c=28188 q=827)
[ 642.968670] INFO: rcu_preempt self-detected stall on CPU { 1} (t=420006 jiffies g=28189 c=28188 q=827)
[ 822.964646] INFO: rcu_preempt self-detected stall on CPU { 1} (t=600009 jiffies g=28189 c=28188 q=827)
[ 1002.960617] INFO: rcu_preempt self-detected stall on CPU { 1} (t=780012 jiffies g=28189 c=28188 q=827)
[ 1182.956593] INFO: rcu_preempt self-detected stall on CPU { 1} (t=960015 jiffies g=28189 c=28188 q=827)
[ 1296.079429] INFO: task systemd:1 blocked for more than 600 seconds.
[ 1296.123474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.131403] INFO: task kworker/0:0:5 blocked for more than 600 seconds.
[ 1296.138006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.145903] INFO: task kworker/u32:0:7 blocked for more than 600 seconds.
[ 1296.152679] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.160643] INFO: task kworker/9:2:203 blocked for more than 600 seconds.
[ 1296.167423] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.175315] INFO: task kworker/u32:3:2166 blocked for more than 600 seconds.
[ 1296.182352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.190294] INFO: task kworker/u32:4:2167 blocked for more than 600 seconds.
[ 1296.197335] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.205228] INFO: task kworker/u32:6:2169 blocked for more than 600 seconds.
[ 1296.212266] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.220145] INFO: task systemd-journal:3521 blocked for more than 600 seconds.
[ 1296.227354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.235229] INFO: task syslog-ng:6634 blocked for more than 600 seconds.
[ 1296.241924] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.249804] INFO: task irq/61-eno1-TxR:7108 blocked for more than 600 seconds.
[ 1296.257012] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

While I was looking for a possible solution to get the kmemleak feature working in RT, I stumbled across this patch in the upstream RT patch set:

bash-4.2$ cat mm-slub-close-possible-memory-leak-in-kmem_cache_all.patch
From d7c930ac33f8ced075737f9b8d7816163650381e Mon Sep 17 00:00:00 2001
Message-Id: <d7c930ac33f8ced075737f9b8d7816163650381e<email address hidden>>
From: Sebastian Andrzej Siewior <email address hidden>
Date: Wed, 13 Dec 2017 12:44:14 +0100
Subject: [PATCH 1/1] mm/slub...

We need the kmemleak feature in the kernel to get this debugged.  BUT the feature doesn't want to play ball in our RT load.  This is after I even patched it to stop the auto scanning part at bootup:

[  OK  ] Started TIS Patching Controller Daemon.
[  OK  ] Started TIS Patching Agent.
[  OK  ] Started Titanium Cloud Affine Platform.
[  OK  ] Started StarlingX Affine Tasks.
[  282.976723] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=60000 jiffies g=28189 c=28188 q=812)
[  462.972697] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=240003 jiffies g=28189 c=28188 q=827)
[  642.968670] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=420006 jiffies g=28189 c=28188 q=827)
[  822.964646] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=600009 jiffies g=28189 c=28188 q=827)
[ 1002.960617] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=780012 jiffies g=28189 c=28188 q=827)
[ 1182.956593] INFO: rcu_preempt self-detected stall on CPU { 1}  (t=960015 jiffies g=28189 c=28188 q=827)
[ 1296.079429] INFO: task systemd:1 blocked for more than 600 seconds.
[ 1296.123474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.131403] INFO: task kworker/0:0:5 blocked for more than 600 seconds.
[ 1296.138006] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.145903] INFO: task kworker/u32:0:7 blocked for more than 600 seconds.
[ 1296.152679] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.160643] INFO: task kworker/9:2:203 blocked for more than 600 seconds.
[ 1296.167423] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.175315] INFO: task kworker/u32:3:2166 blocked for more than 600 seconds.
[ 1296.182352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.190294] INFO: task kworker/u32:4:2167 blocked for more than 600 seconds.
[ 1296.197335] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.205228] INFO: task kworker/u32:6:2169 blocked for more than 600 seconds.
[ 1296.212266] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.220145] INFO: task systemd-journal:3521 blocked for more than 600 seconds.
[ 1296.227354] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.235229] INFO: task syslog-ng:6634 blocked for more than 600 seconds.
[ 1296.241924] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1296.249804] INFO: task irq/61-eno1-TxR:7108 blocked for more than 600 seconds.
[ 1296.257012] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

While I was looking for a possible solution to get the kmemleak feature working in RT, I stumbled across this patch in the upstream RT patch set:

bash-4.2$ cat mm-slub-close-possible-memory-leak-in-kmem_cache_all.patch
From d7c930ac33f8ced075737f9b8d7816163650381e Mon Sep 17 00:00:00 2001
Message-Id: <d7c930ac33f8ced075737f9b8d7816163650381e.1565982512.git.Jim.Somerville@windriver.com>
From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed, 13 Dec 2017 12:44:14 +0100
Subject: [PATCH 1/1] mm/slub: close possible memory-leak in
 kmem_cache_alloc_bulk()

Under certain circumstances we could leak elements which were moved to
the local "to_free" list. The damage is limited since I can't find
any users here.

Cc: stable-rt@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Wagner <wagi@monom.org>
Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>
---
 mm/slub.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/slub.c b/mm/slub.c
index 95d3c8c..be4a374 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3002,6 +3002,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 	return i;
 error:
 	local_irq_enable();
+	free_delayed(&to_free);
 	slab_post_alloc_hook(s, flags, i, p);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
-- 
1.8.3.1

Our RT load doesn't have this fix in it, so I've applied it and will be soaking it over the weekend.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-19:

#47

Jim, here is my test result on AIO-DX rt. It was aligned with yours. The test on vanilla CentOS 7.6 rt kernel + kernel memory accounting disable patch is ongoing. I will update its result later.

Controller-0
SUnreclaim: 53368 kB
SUnreclaim: 78596 kB
SUnreclaim: 74176 kB
SUnreclaim: 80908 kB
SUnreclaim: 85848 kB
SUnreclaim: 91000 kB
SUnreclaim: 95708 kB
SUnreclaim: 88612 kB
SUnreclaim: 102852 kB
SUnreclaim: 106984 kB
SUnreclaim: 101264 kB
SUnreclaim: 114172 kB
SUnreclaim: 117128 kB
SUnreclaim: 120516 kB
SUnreclaim: 111548 kB
SUnreclaim: 123260 kB
SUnreclaim: 130696 kB
SUnreclaim: 134692 kB
SUnreclaim: 136868 kB
SUnreclaim: 142244 kB
SUnreclaim: 144160 kB
SUnreclaim: 146184 kB
SUnreclaim: 149604 kB
SUnreclaim: 148192 kB
SUnreclaim: 150976 kB
SUnreclaim: 151464 kB
SUnreclaim: 151524 kB
SUnreclaim: 168572 kB
SUnreclaim: 165468 kB
SUnreclaim: 170632 kB
SUnreclaim: 169788 kB
SUnreclaim: 182772 kB
SUnreclaim: 185720 kB
SUnreclaim: 179896 kB
SUnreclaim: 186524 kB
SUnreclaim: 196428 kB
SUnreclaim: 198324 kB
SUnreclaim: 191596 kB
SUnreclaim: 206572 kB
SUnreclaim: 197976 kB
SUnreclaim: 203340 kB
SUnreclaim: 205988 kB
SUnreclaim: 207568 kB
SUnreclaim: 209176 kB
SUnreclaim: 210508 kB <- stop test script before this point
SUnreclaim: 209536 kB
SUnreclaim: 207148 kB
SUnreclaim: 207080 kB
SUnreclaim: 207224 kB

Controller-1
SUnreclaim: 90860 kB
SUnreclaim: 96972 kB
SUnreclaim: 101532 kB
SUnreclaim: 103768 kB
SUnreclaim: 102772 kB
SUnreclaim: 108996 kB
SUnreclaim: 112424 kB
SUnreclaim: 114516 kB
SUnreclaim: 118416 kB
SUnreclaim: 115364 kB
SUnreclaim: 122408 kB
SUnreclaim: 125908 kB
SUnreclaim: 126604 kB
SUnreclaim: 128300 kB
SUnreclaim: 131504 kB
SUnreclaim: 133840 kB
SUnreclaim: 131316 kB
SUnreclaim: 135292 kB
SUnreclaim: 135952 kB
SUnreclaim: 136348 kB
SUnreclaim: 139832 kB
SUnreclaim: 141840 kB
SUnreclaim: 145524 kB
SUnreclaim: 149260 kB
SUnreclaim: 151480 kB
SUnreclaim: 148912 kB
SUnreclaim: 160500 kB
SUnreclaim: 160352 kB
SUnreclaim: 155608 kB
SUnreclaim: 160752 kB
SUnreclaim: 164120 kB
SUnreclaim: 162236 kB
SUnreclaim: 164764 kB
SUnreclaim: 169944 kB
SUnreclaim: 169700 kB
SUnreclaim: 171776 kB
SUnreclaim: 174736 kB
SUnreclaim: 180828 kB
SUnreclaim: 178884 kB
SUnreclaim: 184584 kB
SUnreclaim: 185980 kB
SUnreclaim: 189316 kB
SUnreclaim: 186700 kB
SUnreclaim: 189448 kB
SUnreclaim: 188132 kB <- stop test script before this point
SUnreclaim: 185928 kB
SUnreclaim: 186044 kB
SUnreclaim: 185948 kB
SUnreclaim: 185884 kB

Jim, here is my test result on AIO-DX rt. It was aligned with yours. The test on vanilla CentOS 7.6 rt kernel + kernel memory accounting disable patch is ongoing. I will update its result later.

Controller-0 
SUnreclaim:        53368 kB
SUnreclaim:        78596 kB
SUnreclaim:        74176 kB
SUnreclaim:        80908 kB
SUnreclaim:        85848 kB
SUnreclaim:        91000 kB
SUnreclaim:        95708 kB
SUnreclaim:        88612 kB
SUnreclaim:       102852 kB
SUnreclaim:       106984 kB
SUnreclaim:       101264 kB
SUnreclaim:       114172 kB
SUnreclaim:       117128 kB
SUnreclaim:       120516 kB
SUnreclaim:       111548 kB
SUnreclaim:       123260 kB
SUnreclaim:       130696 kB
SUnreclaim:       134692 kB
SUnreclaim:       136868 kB
SUnreclaim:       142244 kB
SUnreclaim:       144160 kB
SUnreclaim:       146184 kB
SUnreclaim:       149604 kB
SUnreclaim:       148192 kB
SUnreclaim:       150976 kB
SUnreclaim:       151464 kB
SUnreclaim:       151524 kB
SUnreclaim:       168572 kB
SUnreclaim:       165468 kB
SUnreclaim:       170632 kB
SUnreclaim:       169788 kB
SUnreclaim:       182772 kB
SUnreclaim:       185720 kB
SUnreclaim:       179896 kB
SUnreclaim:       186524 kB
SUnreclaim:       196428 kB
SUnreclaim:       198324 kB
SUnreclaim:       191596 kB
SUnreclaim:       206572 kB
SUnreclaim:       197976 kB
SUnreclaim:       203340 kB
SUnreclaim:       205988 kB
SUnreclaim:       207568 kB
SUnreclaim:       209176 kB
SUnreclaim:       210508 kB <- stop test script before this point
SUnreclaim:       209536 kB
SUnreclaim:       207148 kB
SUnreclaim:       207080 kB
SUnreclaim:       207224 kB

Controller-1
SUnreclaim:        90860 kB
SUnreclaim:        96972 kB
SUnreclaim:       101532 kB
SUnreclaim:       103768 kB
SUnreclaim:       102772 kB
SUnreclaim:       108996 kB
SUnreclaim:       112424 kB
SUnreclaim:       114516 kB
SUnreclaim:       118416 kB
SUnreclaim:       115364 kB
SUnreclaim:       122408 kB
SUnreclaim:       125908 kB
SUnreclaim:       126604 kB
SUnreclaim:       128300 kB
SUnreclaim:       131504 kB
SUnreclaim:       133840 kB
SUnreclaim:       131316 kB
SUnreclaim:       135292 kB
SUnreclaim:       135952 kB
SUnreclaim:       136348 kB
SUnreclaim:       139832 kB
SUnreclaim:       141840 kB
SUnreclaim:       145524 kB
SUnreclaim:       149260 kB
SUnreclaim:       151480 kB
SUnreclaim:       148912 kB
SUnreclaim:       160500 kB
SUnreclaim:       160352 kB
SUnreclaim:       155608 kB
SUnreclaim:       160752 kB
SUnreclaim:       164120 kB
SUnreclaim:       162236 kB
SUnreclaim:       164764 kB
SUnreclaim:       169944 kB
SUnreclaim:       169700 kB
SUnreclaim:       171776 kB
SUnreclaim:       174736 kB
SUnreclaim:       180828 kB
SUnreclaim:       178884 kB
SUnreclaim:       184584 kB
SUnreclaim:       185980 kB
SUnreclaim:       189316 kB
SUnreclaim:       186700 kB
SUnreclaim:       189448 kB
SUnreclaim:       188132 kB <- stop test script before this point
SUnreclaim:       185928 kB
SUnreclaim:       186044 kB
SUnreclaim:       185948 kB
SUnreclaim:       185884 kB

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-19:

#48

Yi, the leak fix patch I tested over the weekend, namely the simple one from Sebastian Andrzej Siewior, did not help at all.

So I'm back to looking at making kmemleak work. It needs to be enabled right from bootup, but doing that causes eventual lockups and stalls during bootup as I showed earlier.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-19:

#49

OK, some good news finally. I got kmemleak to work by converting the kmemleak_lock to a raw spinlock. So no more hangs at bootup.

After bootup on controller-0 I checked for kmemleaks and see none. So nothing appeared to leak during bootup:
cd /sys/kernel/debug
echo "scan" >kmemleak
# pauses for some time while the scanning is happening
cat kmemleak
# nothing displayed

So I started the test script, and after a couple of iterations did another scan and cat'ed out the results. This time it showed something, the 2048 sized block being leaked is:

unreferenced object 0xffff8c2366115000 (size 2048):
  comm "runc:[1:CHILD]", pid 228381, jiffies 4296019725 (age 3317.421s)
  hex dump (first 32 bytes):
    01 1c 62 c0 ff ff ff ff 60 17 e1 59 23 8c ff ff ..b.....`..Y#...
    04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
  backtrace:
    [<ffffffffb47f3475>] __kmalloc_track_caller+0xf5/0x220
    [<ffffffffb47ad1a0>] kmemdup+0x20/0x50
    [<ffffffffc06147c4>] ip_vs_control_net_init+0x254/0x520 [ip_vs]
    [<ffffffffc060b9eb>] __ip_vs_init+0x8b/0x120 [ip_vs]
    [<ffffffffb4cfdf24>] ops_init+0x44/0x150
    [<ffffffffb4cfe103>] setup_net+0xd3/0x190
    [<ffffffffb4cfe8d5>] copy_net_ns+0xb5/0x180
    [<ffffffffb46ad949>] create_new_namespaces+0xf9/0x180
    [<ffffffffb46adb9a>] unshare_nsproxy_namespaces+0x5a/0xc0
    [<ffffffffb467a673>] SyS_unshare+0x173/0x2f0
    [<ffffffffb4e3c2ad>] tracesys+0xa3/0xc9
    [<ffffffffffffffff>] 0xffffffffffffffff

All 2048 byte leaks have this same backtrace.

We also seem to have another leak here in logrotate:

unreferenced object 0xffff8c2265174d40 (size 64):
  comm "logrotate", pid 1156105, jiffies 4302291385 (age 192.662s)
  hex dump (first 32 bytes):
    01 00 00 00 22 8c ff ff 00 00 00 00 00 00 00 00 ...."...........
    03 00 00 00 01 00 06 00 00 00 00 00 04 00 04 00 ................
  backtrace:
    [<ffffffffb47f04da>] __kmalloc+0xfa/0x230
    [<ffffffffb4872bcc>] posix_acl_alloc+0x1c/0x30
    [<ffffffffb4873522>] posix_acl_from_xattr+0x82/0x190
    [<ffffffffc0a7f152>] ext4_xattr_set_acl+0x92/0x2d0 [ext4]
    [<ffffffffb483366b>] generic_setxattr+0x6b/0x90
    [<ffffffffb4833f75>] __vfs_setxattr_noperm+0x65/0x1b0
    [<ffffffffb4834175>] vfs_setxattr+0xb5/0xc0
    [<ffffffffb48342cc>] setxattr+0x14c/0x1e0
    [<ffffffffb48346be>] SyS_fsetxattr+0xce/0x110
    [<ffffffffb4e3c2ad>] tracesys+0xa3/0xc9
    [<ffffffffffffffff>] 0xffffffffffffffff

OK, some good news finally.  I got kmemleak to work by converting the kmemleak_lock to a raw spinlock.  So no more hangs at bootup.

After bootup on controller-0 I checked for kmemleaks and see none.  So nothing appeared to leak during bootup:
cd /sys/kernel/debug
echo "scan" >kmemleak
# pauses for some time while the scanning is happening
cat kmemleak 
# nothing displayed

So I started the test script, and after a couple of iterations did another scan and cat'ed out the results.  This time it showed something, the 2048 sized block being leaked is:

unreferenced object 0xffff8c2366115000 (size 2048):
  comm "runc:[1:CHILD]", pid 228381, jiffies 4296019725 (age 3317.421s)
  hex dump (first 32 bytes):
    01 1c 62 c0 ff ff ff ff 60 17 e1 59 23 8c ff ff  ..b.....`..Y#...
    04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffffb47f3475>] __kmalloc_track_caller+0xf5/0x220
    [<ffffffffb47ad1a0>] kmemdup+0x20/0x50
    [<ffffffffc06147c4>] ip_vs_control_net_init+0x254/0x520 [ip_vs]
    [<ffffffffc060b9eb>] __ip_vs_init+0x8b/0x120 [ip_vs]
    [<ffffffffb4cfdf24>] ops_init+0x44/0x150
    [<ffffffffb4cfe103>] setup_net+0xd3/0x190
    [<ffffffffb4cfe8d5>] copy_net_ns+0xb5/0x180
    [<ffffffffb46ad949>] create_new_namespaces+0xf9/0x180
    [<ffffffffb46adb9a>] unshare_nsproxy_namespaces+0x5a/0xc0
    [<ffffffffb467a673>] SyS_unshare+0x173/0x2f0
    [<ffffffffb4e3c2ad>] tracesys+0xa3/0xc9
    [<ffffffffffffffff>] 0xffffffffffffffff

All 2048 byte leaks have this same backtrace.

We also seem to have another leak here in logrotate:

unreferenced object 0xffff8c2265174d40 (size 64):
  comm "logrotate", pid 1156105, jiffies 4302291385 (age 192.662s)
  hex dump (first 32 bytes):
    01 00 00 00 22 8c ff ff 00 00 00 00 00 00 00 00  ...."...........
    03 00 00 00 01 00 06 00 00 00 00 00 04 00 04 00  ................
  backtrace:
    [<ffffffffb47f04da>] __kmalloc+0xfa/0x230
    [<ffffffffb4872bcc>] posix_acl_alloc+0x1c/0x30
    [<ffffffffb4873522>] posix_acl_from_xattr+0x82/0x190
    [<ffffffffc0a7f152>] ext4_xattr_set_acl+0x92/0x2d0 [ext4]
    [<ffffffffb483366b>] generic_setxattr+0x6b/0x90
    [<ffffffffb4833f75>] __vfs_setxattr_noperm+0x65/0x1b0
    [<ffffffffb4834175>] vfs_setxattr+0xb5/0xc0
    [<ffffffffb48342cc>] setxattr+0x14c/0x1e0
    [<ffffffffb48346be>] SyS_fsetxattr+0xce/0x110
    [<ffffffffb4e3c2ad>] tracesys+0xa3/0xc9
    [<ffffffffffffffff>] 0xffffffffffffffff

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-19:

#50

Looks like the fix should be (from upstream Linus' tree):

commit f30bf2a5cac6c60ab366c4bc6db913597bf4d6ab
Author: Tommi Rantala <email address hidden>
Date: Thu May 7 15:12:21 2015 +0300

ipvs: fix memory leak in ip_vs_ctl.c

Fix memory leak introduced in commit a0840e2e165a ("IPVS: netns,
ip_vs_ctl local vars moved to ipvs struct."):

    unreferenced object 0xffff88005785b800 (size 2048):
      comm "(-localed)", pid 1434, jiffies 4294755650 (age 1421.089s)
      hex dump (first 32 bytes):
        bb 89 0b 83 ff ff ff ff b0 78 f0 4e 00 88 ff ff .........x.N....
        04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
        [<ffffffff8262ea8e>] kmemleak_alloc+0x4e/0xb0
        [<ffffffff811fba74>] __kmalloc_track_caller+0x244/0x430
        [<ffffffff811b88a0>] kmemdup+0x20/0x50
        [<ffffffff823276b7>] ip_vs_control_net_init+0x1f7/0x510
        [<ffffffff8231d630>] __ip_vs_init+0x100/0x250
        [<ffffffff822363a1>] ops_init+0x41/0x190
        [<ffffffff82236583>] setup_net+0x93/0x150
        [<ffffffff82236cc2>] copy_net_ns+0x82/0x140
        [<ffffffff810ab13d>] create_new_namespaces+0xfd/0x190
        [<ffffffff810ab49a>] unshare_nsproxy_namespaces+0x5a/0xc0
        [<ffffffff810833e3>] SyS_unshare+0x173/0x310
        [<ffffffff8265cbd7>] system_call_fastpath+0x12/0x6f
        [<ffffffffffffffff>] 0xffffffffffffffff

    Fixes: a0840e2e165a ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
    Signed-off-by: Tommi Rantala <email address hidden>
    Acked-by: Julian Anastasov <email address hidden>
    Signed-off-by: Simon Horman <email address hidden>

I will proceed to test this overnight

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-20:

#51

Jim, that's awesome!! Looks like the backtrace exactly match. I will check that patch in my side too.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-20:

#52

I wasn't able to get my controller-1 into service, but ran the test anyway just using controller-0. Containers only launched on controller-0, so the test script was only running at about half the speed that it normally does. But the results look good.

This appears to be the logrotate leak:

https://lists.openvz.org/pipermail/devel/2018-February/071751.html

I'll get that fix in as well and make another attempt at getting an install done with both nodes up so I can run the standard test.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-20:

#53

OK, I now have both controllers up and am able to do the usual soak test on rt.

I'm testing these 3 patches:
ipvs-fix-memory-leak-in-ip_vs_ctl.c.patch
rh-ext4-release-leaked-posix-acl-in-ext4_acl_chmod.patch
rh-ext4-release-leaked-posix-acl-in-ext4_xattr_set_a.patch

The last 2 are from the openvz project, to fix the 64 byte logrotate leak. I also grabbed the acl chmod one since it was done at the same time by the same person as the xattr set one.

Stay tuned for results tomorrow.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-21:

#54

Jim, here is the result I got with the first patch applied. Test script is running on controller-1. On controller-0, seems no growing. On controller-1, appear to have some growing.

Controller-0
SUnreclaim: 66316 kB
SUnreclaim: 71264 kB
SUnreclaim: 69800 kB
SUnreclaim: 81964 kB
SUnreclaim: 71620 kB
SUnreclaim: 72144 kB
SUnreclaim: 78672 kB
SUnreclaim: 80092 kB
SUnreclaim: 81612 kB
SUnreclaim: 83260 kB
SUnreclaim: 72724 kB
SUnreclaim: 74252 kB
SUnreclaim: 72308 kB
SUnreclaim: 69212 kB
SUnreclaim: 66540 kB
SUnreclaim: 74192 kB
SUnreclaim: 74480 kB
SUnreclaim: 68396 kB

Controller-1
SUnreclaim: 53980 kB
SUnreclaim: 66052 kB
SUnreclaim: 70932 kB
SUnreclaim: 58520 kB
SUnreclaim: 65676 kB
SUnreclaim: 70644 kB
SUnreclaim: 69268 kB
SUnreclaim: 57400 kB
SUnreclaim: 57352 kB
SUnreclaim: 54480 kB
SUnreclaim: 68200 kB
SUnreclaim: 69148 kB
SUnreclaim: 78348 kB
SUnreclaim: 85056 kB
SUnreclaim: 85232 kB
SUnreclaim: 73728 kB
SUnreclaim: 72876 kB
SUnreclaim: 87264 kB

I am building a kernel to include all three patches.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-21:

#55

My results, all 3 patches:

controller-0:~$ sudo ./monitor
Password:
SUnreclaim: 87988 kB
SUnreclaim: 96640 kB <- test script started
SUnreclaim: 97604 kB
SUnreclaim: 107916 kB
SUnreclaim: 110892 kB
SUnreclaim: 109180 kB
SUnreclaim: 98960 kB
SUnreclaim: 106880 kB
SUnreclaim: 103748 kB
SUnreclaim: 109372 kB
SUnreclaim: 97964 kB
SUnreclaim: 108020 kB
SUnreclaim: 96884 kB
SUnreclaim: 101424 kB
SUnreclaim: 109536 kB
SUnreclaim: 100108 kB
SUnreclaim: 107372 kB
SUnreclaim: 100436 kB
SUnreclaim: 98532 kB
SUnreclaim: 91628 kB <- test script stopped

controller-1:~$ sudo ./monitor
Password:
SUnreclaim: 62788 kB
SUnreclaim: 91200 kB <- test script started
SUnreclaim: 92548 kB
SUnreclaim: 80656 kB
SUnreclaim: 90224 kB
SUnreclaim: 82636 kB
SUnreclaim: 89412 kB
SUnreclaim: 75256 kB
SUnreclaim: 92008 kB
SUnreclaim: 80916 kB
SUnreclaim: 84104 kB
SUnreclaim: 81492 kB
SUnreclaim: 90524 kB
SUnreclaim: 90660 kB
SUnreclaim: 85580 kB
SUnreclaim: 91420 kB
SUnreclaim: 88492 kB
SUnreclaim: 91520 kB
SUnreclaim: 90088 kB
SUnreclaim: 71116 kB <- test script stopped

Definitely no long term continuous growth like we've seen in the past, so I'm declaring this solved.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-22:

#56

Jim, below is my result. With the three patches applied, there is no obvious growing. Do you prefer submitting a patch for this LP by yourself, or you hope me to submit a patch?

controller-0
SUnreclaim: 65944 kB
SUnreclaim: 70156 kB
SUnreclaim: 80100 kB
SUnreclaim: 71828 kB
SUnreclaim: 72796 kB
SUnreclaim: 72132 kB
SUnreclaim: 72196 kB
SUnreclaim: 82424 kB
SUnreclaim: 70720 kB
SUnreclaim: 81280 kB
SUnreclaim: 71536 kB
SUnreclaim: 80400 kB
SUnreclaim: 71368 kB
SUnreclaim: 82872 kB
SUnreclaim: 74668 kB
SUnreclaim: 72992 kB
SUnreclaim: 71952 kB
SUnreclaim: 72440 kB
SUnreclaim: 71212 kB
SUnreclaim: 74204 kB
SUnreclaim: 83620 kB
SUnreclaim: 73316 kB
SUnreclaim: 72144 kB

controller-1
SUnreclaim: 54504 kB
SUnreclaim: 68724 kB
SUnreclaim: 68444 kB
SUnreclaim: 69488 kB
SUnreclaim: 68328 kB
SUnreclaim: 55504 kB
SUnreclaim: 56220 kB
SUnreclaim: 67256 kB
SUnreclaim: 69100 kB
SUnreclaim: 69472 kB
SUnreclaim: 72244 kB
SUnreclaim: 69244 kB
SUnreclaim: 58472 kB
SUnreclaim: 69336 kB
SUnreclaim: 55620 kB
SUnreclaim: 66972 kB
SUnreclaim: 58468 kB
SUnreclaim: 68452 kB
SUnreclaim: 57408 kB
SUnreclaim: 69836 kB
SUnreclaim: 66720 kB
SUnreclaim: 58024 kB
SUnreclaim: 58348 kB

Yi Wang (wangyi4) on 2019-08-22

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-22:

#57

Yi, you can feel free to assign the LP over to me and I'll take care of submitting it. If it is not assigned to me I'll assume that you're doing it.

Revision history for this message

Jim Somerville (jsomervi) wrote on 2019-08-22:

#58

I noticed that I didn't actually answer your question. Yes, I'd prefer to do it.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-23:

#59

Jim, okay, I will assign the LP to you.

Revision history for this message

Yi Wang (wangyi4) wrote on 2019-08-23:

#60

Oh, Jim, I just found I don't have permission to reassign LP to other people. Need you to take it by yourself.

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-08-23:

#61

assigned to Jim Somerville as he will submit kernel patches to both std and rt kernel.

Changed in starlingx:
assignee:	Yi Wang (wangyi4) → Jim Somerville (jsomervi)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-23: Fix proposed to integ (master)

#62

Fix proposed to branch: master
Review: https://review.opendev.org/678305

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-28: Fix merged to integ (master)

#63

Reviewed: https://review.opendev.org/678305
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=d7784ee45221ac8cedf6d53b6179fb8e5880bc29
Submitter: Zuul
Branch: master

commit d7784ee45221ac8cedf6d53b6179fb8e5880bc29
Author: Jim Somerville <email address hidden>
Date: Fri Aug 23 16:34:48 2019 -0400

Fix kernel memory leaks in ipvs and ext4

    These leaks were observed in the RT kernel but the fixes
    are not RT specific. We deemed it prudent to also
    include the fixes in the std kernel as well.

See the specific patches for details.

    Change-Id: I00e6d06a82e289806e5d51008ea1597735b2ad0f
    Closes-Bug: 1836638
    Signed-off-by: Jim Somerville <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-18: Fix proposed to integ (r/stx.2.0)

#64

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/682995

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-18: Fix merged to integ (r/stx.2.0)

#65

Reviewed: https://review.opendev.org/682995
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=507c2ecab3405527999bf8f1403316cc70c60c6a
Submitter: Zuul
Branch: r/stx.2.0

commit 507c2ecab3405527999bf8f1403316cc70c60c6a
Author: Jim Somerville <email address hidden>
Date: Fri Aug 23 16:34:48 2019 -0400

Fix kernel memory leaks in ipvs and ext4

    These leaks were observed in the RT kernel but the fixes
    are not RT specific. We deemed it prudent to also
    include the fixes in the std kernel as well.

See the specific patches for details.

    Change-Id: I00e6d06a82e289806e5d51008ea1597735b2ad0f
    Closes-Bug: 1836638
    Signed-off-by: Jim Somerville <email address hidden>
    (cherry picked from commit d7784ee45221ac8cedf6d53b6179fb8e5880bc29)

Ghada Khalil (gkhalil) on 2019-09-19

tags:

added: in-r-stx20

StarlingX

RT kernel memory leak when creating/deleting pods

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches