Stress-ng

stress-ng triggering oomkiller running brk and stack stressors on all arches.

Bug #1729878 reported by Jeff Lane  on 2017-11-03

This bug affects 1 person

	Status	Importance	Assigned to
Stress-ng	Invalid	Undecided	Colin Ian King
linux (Ubuntu)	Won't Fix	Medium	Unassigned
Artful	Won't Fix	Medium	Unassigned

Bug Description

During regression testing on a z/VM instance, I noticed call traces being dumped to dmesg that seem to be related to stress-ng.

The stress-ng invocation we're using is:
stress-ng --aggressive --verify --timeout $runtime \
                                 --metrics-brief --tz --times \
                                 --af-alg 0 --bsearch 0 --context 0 --cpu 0 \
                                 --crypt 0 --hsearch 0 --longjmp 0 --lsearch 0 \
                                 --matrix 0 --qsort 0 --str 0 --stream 0 \
                                 --tsearch 0 --vecmath 0 --wcs 0

as executed by the cpu_stress script from the cpu/cpu_stress_ng test in the certification suite.

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.13.0-16-generic 4.13.0-16.19
ProcVersionSignature: Ubuntu 4.13.0-16.19-generic 4.13.4
Uname: Linux 4.13.0-16-generic s390x
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.7-0ubuntu3.1
Architecture: s390x
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Fri Nov 3 11:12:04 2017
HibernationDevice: RESUME=UUID=1ce0f037-449a-43a1-af49-e730f6f99ac4
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
PciMultimedia:

ProcEnviron:
TERM=screen
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=C
SHELL=/bin/bash
ProcFB: Error: [Errno 2] No such file or directory: '/proc/fb'
ProcKernelCmdLine: root=/dev/mapper/hwe0008_vg-lv crashkernel=196M BOOT_IMAGE=0
RelatedPackageVersions:
linux-restricted-modules-4.13.0-16-generic N/A
linux-backports-modules-4.13.0-16-generic N/A
linux-firmware 1.169
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to artful on 2017-11-02 (0 days ago)
---
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.7-0ubuntu3.1
Architecture: s390x
ArecordDevices: Error: [Errno 2] No such file or directory
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
DistroRelease: Ubuntu 17.10
HibernationDevice: RESUME=UUID=1ce0f037-449a-43a1-af49-e730f6f99ac4
IwConfig: Error: [Errno 2] No such file or directory
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
Package: linux (not installed)
PciMultimedia:

See original description

Tags:

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03:

AudioDevicesInUse.txt Edit (1.3 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (31.5 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.8 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (17.4 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (1.4 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.4 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (2.3 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (3.4 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (32.7 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (43.0 KiB, text/plain; charset="utf-8")

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03:

It's possible these are just OOMkiller related, but the first few call traces seem centered around snapd and I don't notice any OOMKiller messages until towards the end of the dmesg log.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03:

I'm also not sure why we're hitting the OOM killer either, the memory config is light, but not light enough that we'd trigger this sort of thing on other systems, perhaps its s390x specific:

ubuntu@hwe0008:~$ free -m
total used free shared buff/cache available
Mem: 3807 172 3280 1 354 3584
Swap: 5707 54 5652

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2017-11-03: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1729878

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: AudioDevicesInUse.txt

AudioDevicesInUse.txt Edit (1.3 KiB, text/plain)

apport information

tags:	added: apport-collected
description:	updated

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: CurrentDmesg.txt

CurrentDmesg.txt Edit (31.5 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: JournalErrors.txt

JournalErrors.txt Edit (17.4 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: ProcCpuinfo.txt

ProcCpuinfo.txt Edit (1.4 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: ProcCpuinfoMinimal.txt

ProcCpuinfoMinimal.txt Edit (1.4 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: ProcInterrupts.txt

#10

ProcInterrupts.txt Edit (2.3 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: ProcModules.txt

#11

ProcModules.txt Edit (3.4 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: UdevDb.txt

#12

UdevDb.txt Edit (32.7 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: WifiSyslog.txt

#13

WifiSyslog.txt Edit (43.0 KiB, text/plain)

apport information

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: Re: stress-ng triggering call-traces on s390x z/VM instances

#14

Fix your bot, guys... it tells me logs are missing and to run apport-collect which just collects the exact same logs ubuntu-bug collected the first time.

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-11-03:

#15

Do you know which prior kernels did not exhibit this bug? Was it 4.12 or earlier?

Could you test the mainline kernel to see if this bug is already fixed upstream:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc7

Changed in linux (Ubuntu):
importance:	Undecided → Medium
Changed in linux (Ubuntu Artful):
status:	New → Confirmed
importance:	Undecided → Medium
tags:	added: kernel-da-key

Revision history for this message

Colin Ian King (colin-king) wrote on 2017-11-03:

#16

This seems like stress-ng was eating a load of pages and snapd just got victimized because it was expanding it's memory when there were no free pages. Oom killer can seem arbitrary like this, I expect there is a good reason for it.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-03: Re: [Bug 1729878] Re: stress-ng triggering call-traces on s390x z/VM instances

#17

On Fri, Nov 3, 2017 at 1:13 PM, Joseph Salisbury
<email address hidden> wrote:
> Do you know which prior kernels did not exhibit this bug? Was it 4.12
> or earlier?

I "think" it happened with 4.10 as well, but I'm not positive, I ran
some tests before realizing I was still on 17.04 and upgrading the VM
to 17.10.

> Could you test the mainline kernel to see if this bug is already fixed upstream:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc7

Just a heads up, no, I cant:

Build for s390x failed (see BUILD.LOG.s390x):
linux-headers-4.14.0-041400rc7_4.14.0-041400rc7.201710292231_all.deb
*_s390x.deb
\

Is there a better place (internal perhaps) to get mainline builds?

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2017-11-06:

#18

On 3 November 2017 at 19:27, Jeff Lane <email address hidden> wrote:
> On Fri, Nov 3, 2017 at 1:13 PM, Joseph Salisbury
> <email address hidden> wrote:
>> Do you know which prior kernels did not exhibit this bug? Was it 4.12
>> or earlier?
>
> I "think" it happened with 4.10 as well, but I'm not positive, I ran
> some tests before realizing I was still on 17.04 and upgrading the VM
> to 17.10.
>
>> Could you test the mainline kernel to see if this bug is already fixed upstream:
>> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc7
>
> Just a heads up, no, I cant:
>
> Build for s390x failed (see BUILD.LOG.s390x):
> linux-headers-4.14.0-041400rc7_4.14.0-041400rc7.201710292231_all.deb
> *_s390x.deb
> \
>
> Is there a better place (internal perhaps) to get mainline builds?
>

There is no better places, and it seems like that has been fixed up now.

--
Regards,

Dimitri.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-06:

#19

dmesg-logs-zvm.tgz Edit (7.7 KiB, application/x-gzip; name="dmesg-logs-zvm.tgz")

So over the weekend, I retried with 4.14.0-041400rc6-generic (since
rc7 was not successfully built when I tried this).

This too ended with call traces in the kernel log after runs of stress-ng.

I only ran our memory stress and system-wide CPU stress tests, both
tests indicated that they passed (stress-ng exited with a success
code), but I noticed three call traces in the dmesg log. I've
attached dmesg taken before and after the test run for comparison.

Just for due dilligence, I will also update to rc7 and try again, now
that the packages have been fixed, per the update from xnox.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-08: Re: stress-ng triggering call-traces on s390x z/VM instances

#20

s390x-memory-oomkiller.tgz Edit (18.0 KiB, application/x-tar)

Ok, so a little bit further. This seems to be that stress-ng is triggering OOMkiller which I'm not sure it should be, unless it's just really writing a LOT of data to memory.

My setup has been the same on all architectures and I've seen the same behaviour on Artful deployments on amd64 (Xeon Phi), Power8 and s390x:

created swap space that is 2x the amount of RAM per the output of free
ran the stress-ng based memory stress test from the certification suite
stress-ng tells us that the tests all pass
dmesg reveals invocations of OOMKiller

This is all handled by the memory_stress_ng script in the cert suite. The tests we run are defined by this block:

crt_stressors=("bsearch" "context" "hsearch" "lsearch" "matrix" \
               "memcpy" "null" "pipe" "qsort" "stack" "str" "stream" \
               "tsearch" "vm-rw" "wcs" "zero" "mlock" "mmapfork" "mmapmany" \
               "mremap" "shm-sysv" "vm-splice")
if [ "$numa_nodes" -gt 1 ]; then
    crt_stressors+=("numa")
fi

and we run them individually, one test case at a time.

I've attached a tarball with the stderr and stdout from stress-ng, dmesg contents and the test script that runs all this.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-08:

#21

The call that invokes stress-ng looks like this:

timeout -s 14 $end_time stress-ng -k --aggressive --verify --timeout $runtime --$1 0

Where $end_time is the max time before killing stress-ng IF stress-ng runs longer than the --timeout value. $1, is the stressor invoked, and as I said above, we run the stressors one at a time, not in parallel.

Jeff Lane  (bladernr) on 2017-11-10

summary:

- stress-ng triggering call-traces on s390x z/VM instances
+ stress-ng triggering oomkiller running brk and stack stressors on all
+ arches.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-10:

#22

Download full text (7.2 KiB)

Ok, I've narrowed it down to the brk and stack stressors in stress-ng. I've recreated this on amd64, ppc64el and s390x across 4.4, 4.10 and 4.13 kernels.

Stress-NG reports the tests as successful so I guess now the only thing remaining is to confirm that hitting OOM killer is expected behaviour when running these stressors.

I've tried setting up swap space that is three times the amount of physical RAM and this is still invoking the OOMkiller.

ubuntu@xwing:~$ free -lm
total used free shared buff/cache available
Mem: 3838 105 3598 0 133 3517
Low: 3838 239 3598
High: 0 0 0
Swap: 11514 68 11446

And from procs ------- />----memory---------- ---swap-- -----io---- -system-- ------cpu-----
swpd free buff cache si so bi bo in cs us sy id wa st
5 138 0 2 26 2880 154 138 0 1 60 39 0
1 114 0 22 183 23495 1284 907 0 3 61 36 0
0 110 0 63 9 65792 3266 1865 0 2 6 92 0
0 108 0 71 1 73444 3466 1682 0 3 6 92 0
0 108 0 70 6 72316 3494 1673 0 3 6 91 0
1 106 0 63 165 64884 2940 1593 0 2 8 90 0
0 106 0 67 69 68825 2987 1672 0 3 8 90 0
0 104 0 73 16 76003 3376 1645 0 3 5 92 0
0 103 0 58 177 60429 2713 1596 0 2 8 90 0
0 104 0 67 122 69519 2680 1581 0 2 13 85 0
0 105 0 74 13 76143 3389 1613 0 3 5 92 0
1 104 0 66 103 67807 2964 1711 0 3 7 90 0
1 105 0 62 39 64041 2918 1530 0 3 8 89 0
1 105 0 66 17 68101 3208 1573 0 3 6 91 0
1 105 0 70 0 72289 2934 1615 0 3 7 90 0
1 104 0 64 60 66192 2953 1725 0 3 6 90 0
1 104 0 69 3 71434 2986 1694 0 3 9 88 0
1 103 0 68 11 70660 3365 1704 0 4 8 88 0
1 102 0 50 53 51572 2269 1533 0 2 15 82 0
0 106 0 26 783 27601 1323 1721 0 2 16 82 0
0 105 0 16 6 16883 1487 1438 0 1 39 60 0
/>----memory---------- ---swap-- -----io---- -system-- ------cpu-----
swpd free buff cache si so bi bo in cs us sy id wa st
0 109 0 14 558 14774 1523 2202 0 1 31 68 0
1 113 0 13 598 13945 1552 2592 0 1 17 82 0
1 115 0 17 499 18516 1434 1921 0 1 22 77 0
1 113 0 24 0 25004 1782 1547 0 2 32 67 0
1 113 ...

Ok, I've narrowed it down to the brk and stack stressors in stress-ng.  I've recreated this on amd64, ppc64el and s390x across 4.4, 4.10 and 4.13 kernels.

Stress-NG reports the tests as successful so I guess now the only thing remaining is to confirm that hitting OOM killer is expected behaviour when running these stressors.

I've tried setting up swap space that is three times the amount of physical RAM and this is still invoking the OOMkiller.

ubuntu@xwing:~$ free -lm
              total        used        free      shared  buff/cache   available
Mem:           3838         105        3598           0         133        3517
Low:           3838         239        3598
High:             0           0           0
Swap:         11514          68       11446

And from looking at vmstat while it's running:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0     37   3618      5    138    0    2    26  2880  154  138  0  1 60 39  0
 0  9    257     82      1    114    0   22   183 23495 1284  907  0  3 61 36  0
 0  9    891     84      0    110    0   63     9 65792 3266 1865  0  2  6 92  0
 0 10   1601     87      0    108    0   71     1 73444 3466 1682  0  3  6 92  0
 1  8   2300     83      0    108    0   70     6 72316 3494 1673  0  3  6 91  0
 0 10   2926     82      1    106    0   63   165 64884 2940 1593  0  2  8 90  0
 0  9   3592     95      0    106    0   67    69 68825 2987 1672  0  3  8 90  0
 0  9   4326     82      0    104    0   73    16 76003 3376 1645  0  3  5 92  0
 0  9   4910     81      0    103    0   58   177 60429 2713 1596  0  2  8 90  0
 0 10   5581     81      0    104    0   67   122 69519 2680 1581  0  2 13 85  0
 0 10   6317     89      0    105    0   74    13 76143 3389 1613  0  3  5 92  0
 0  9   6972     83      1    104    0   66   103 67807 2964 1711  0  3  7 90  0
 0  9   7591    103      1    105    0   62    39 64041 2918 1530  0  3  8 89  0
 0  9   8248     84      1    105    0   66    17 68101 3208 1573  0  3  6 91  0
 0 10   8946     82      1    105    0   70     0 72289 2934 1615  0  3  7 90  0
 0  9   9586     82      1    104    0   64    60 66192 2953 1725  0  3  6 90  0
 0  9  10276    100      1    104    0   69     3 71434 2986 1694  0  3  9 88  0
 0 10  10958     82      1    103    0   68    11 70660 3365 1704  0  4  8 88  0
 0  9  11455     82      1    102    0   50    53 51572 2269 1533  0  2 15 82  0
 0 11  10249     87      0    106    0   26   783 27601 1323 1721  0  2 16 82  0
 0 10  10407     82      0    105    0   16     6 16883 1487 1438  0  1 39 60  0
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0 11  10543     82      0    109    0   14   558 14774 1523 2202  0  1 31 68  0
 0 12  10672     82      1    113    0   13   598 13945 1552 2592  0  1 17 82  0
 1  8  10844     84      1    115    0   17   499 18516 1434 1921  0  1 22 77  0
 0  9  11080     81      1    113    0   24     0 25004 1782 1547  0  2 32 67  0
 0  9  11338     82      1    113    0   26     2 27162 1874 1606  0  2 25 73  0
 0 13   9950     82      0    105    0   25   636 26475 1672 1657  0  2 21 77  0
 0 10  10167     81      0    105    0   22   111 23060 1956 1766  0  1 26 73  0
 0  9  10368     81      0    106    0   20    62 21257 1687 1470  0  1 30 68  0
 1  8  10559     81      1    106    0   19    92 20380 1660 1542  0  1 30 68  0
 0  9  10815     81      1    106    0   26     0 26964 1960 1523  0  2 30 68  0
 0  9    237     95      1    116    0   34   714 35614 2065 2124  0  5 32 63  0
 0  9    851     84      1    115    0   61    31 63467 3612 2607  0  2  4 94  0
 0 10   1468     82      1    115    0   62   130 63985 3167 2316  0  2  5 92  0
 0  9   2156     83      1    115    0   69    12 71113 3476 2294  0  3  4 93  0
 0  9   2837     88      1    115    0   68    33 70577 3007 2155  0  3  5 92  0
 0 10   3501     82      1    110    0   66    10 68809 3015 2202  0  2  8 90  0
 0  9   4194     82      1    109    0   69    37 71556 3648 2597  0  3  5 92  0
 7  1   4901     82      1    109    0   71     3 73264 3368 2087  0  3  8 90  0
 0 10   5551     82      0    105    0   65   183 67353 3121 2189  0  3  9 88  0
 0 10   6241     82      0    105    0   69    79 71325 3448 2414  0  3  7 90  0
 0  9   6901     86      1    105    0   66    58 68414 3055 2254  0  3  6 91  0
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0 10   7619     93      1    105    0   72    12 74423 3439 2329  0  3  7 89  0
 0  9   8338     96      1    105    0   72    10 74165 3498 2213  0  4  7 90  0
 0 10   8851     82      1    104    0   51   201 53453 3791 2960  0  3  5 92  0
 0  9   9406     88      1    105    0   55   113 57404 4274 3311  0  3  5 91  0
 1  8  10015     84      1    106    0   61     4 63194 4405 3599  0  4  5 91  0
 0  9  10607     82      1    106    0   59     6 61462 4426 3353  0  4  6 90  0
 0  9  11143     81      1    105    0   54     0 55738 4029 3214  0  3  5 92  0
 1  9  11512     82      1    105    0   37     2 38579 3300 2528  0  3  5 92  0
 0 11  10280     82      0    106    0   25  1129 26677 2329 2912  0  2 12 86  0
 1  8  10373     82      1    110    0    8   454  9262 1409 1976  0  1 34 65  0
 1 10  10476     82      1    112    0   10   327 11458 1415 1763  0  1 37 63  0
 1  8  10592     82      2    112    0   12    56 12682 1437 1510  0  1 40 59  0
 0  9  10726     81      2    112    0   13    12 14585 1489 1449  0  1 46 53  0
 0  9  10927     82      2    112    0   20     0 21362 1916 1577  0  2 41 57  0
 0  9  11125     82      2    113    0   20   113 21121 1948 1669  0  2 32 67  0
 0 10  11328     82      2    113    0   20    16 21774 2122 1532  0  2 39 59  0
 0 14   9879     82      0    102    0   23   220 24158 2258 2051  0  2 37 61  0
 0 10  10075     81      0    106    0   19   547 20253 1692 1635  0  1 22 77  0
 0 10  10236     83      0    105    0   16     0 17262 1903 1699  0  1 30 69  0
 0  9  10368     82      0    105    0   13     0 14341 1674 1488  0  1 34 65  0
 0  9  10516     82      0    106    0   15     0 15947 1595 1433  0  1 36 63  0
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  9  10696     82      0    107    0   18    60 19262 1875 1551  0  1 31 68  0
 0  9  10893     82      0    106    0   20     7 21052 1972 1624  0  2 34 65  0
 0  0     69   3624      0    110    0    5   519  6506  416  471  0  1 81 18  0
 0  0     69   3624      0    110    0    0     0  1230   36   51  0  0 100  0  0
 0  0     69   3624      0    110    0    0     0  1027   36   60  0  0 100  0  0

those two tests are consuming everything available and then swapping until there's nothing left, thus hitting the OOMKiller.

So this started because I'd not noticed the stack traces before because stress-ng says it passed without error.

Thus, as mentioned above, we just need to determine if this is a legitimate issue with testing, or expected behaviour for the brk and stack stressors to hit OOMkiller.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-10:

#23

And this time OOMKiller was not hit.

ubuntu@entei:~$ ./memory_stress_ng
Total memory is 32 GiB
Constant run time is 300 seconds per stressor
Variable run time is 620 seconds per stressor
Estimated total run time is 20 minutes

Running stress-ng stack stressor for 300 seconds....
stress-ng: info: [15395] dispatching hogs: 128 stack
stress-ng: info: [15395] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info: [15395] cache allocate: default cache size: 2048K
stress-ng: info: [15395] successful run completed in 317.89s (5 mins, 17.89 secs)
return_code is 0
Running stress-ng numa stressor for 300 seconds....
stress-ng: info: [15663] dispatching hogs: 128 numa
stress-ng: info: [15663] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info: [15663] cache allocate: default cache size: 2048K
stress-ng: info: [15664] stress-ng-numa: system has 1 of a maximum 256 memory NUMA nodes
stress-ng: info: [15663] successful run completed in 300.08s (5 mins, 0.08 secs)
return_code is 0
Running stress-ng brk stressor for 620 seconds....
stress-ng: info: [15800] dispatching hogs: 128 brk
stress-ng: info: [15800] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info: [15800] cache allocate: default cache size: 2048K
stress-ng: info: [15800] successful run completed in 666.16s (11 mins, 6.16 secs)
return_code is 0
*******************************************************************
** stress-ng memory test passed!
*******************************************************************

The only message that appeared in dmesg during this test run is one warning: (I cleared the ring buffer with dmesg -C before running the test).
ubuntu@entei:~$ dmesg
[ 7390.968438] stress-ng[15584]: bad frame in setup_rt_frame: 00003fffdd0e1040 nip

And from looking at vmstat, this is the lowest and highest consumption observed:
0 0 340 30582 14 1299 0 0 3 385 27 25 3 3 75 19 0
0 260 58068 196 12 1323 2 76 2571 78490 3295 5493 0 0 25 74 0

So at the least, only 340MB swap consumed and a full 30GB ram free, and at the most, 58GB swap consumed and only 196MB free.

So there DOES seem to be a limit to what these tests are consuming.

So I did this on 4.4 on ppc64le with 32GB RAM and 96GB Swap:
ubuntu@entei:~$ free -lm
              total        used        free      shared  buff/cache   available
Mem:          32565         656       30603           2        1304       30503
Low:          32565        1961       30603
High:             0           0           0
Swap:         97695         342       97353

And this time OOMKiller was not hit.

ubuntu@entei:~$ ./memory_stress_ng 
Total memory is 32 GiB
Constant run time is 300 seconds per stressor
Variable run time is 620 seconds per stressor
Estimated total run time is 20 minutes

Running stress-ng stack stressor for 300 seconds....
stress-ng: info:  [15395] dispatching hogs: 128 stack
stress-ng: info:  [15395] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info:  [15395] cache allocate: default cache size: 2048K
stress-ng: info:  [15395] successful run completed in 317.89s (5 mins, 17.89 secs)
return_code is 0
Running stress-ng numa stressor for 300 seconds....
stress-ng: info:  [15663] dispatching hogs: 128 numa
stress-ng: info:  [15663] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info:  [15663] cache allocate: default cache size: 2048K
stress-ng: info:  [15664] stress-ng-numa: system has 1 of a maximum 256 memory NUMA nodes
stress-ng: info:  [15663] successful run completed in 300.08s (5 mins, 0.08 secs)
return_code is 0
Running stress-ng brk stressor for 620 seconds....
stress-ng: info:  [15800] dispatching hogs: 128 brk
stress-ng: info:  [15800] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info:  [15800] cache allocate: default cache size: 2048K
stress-ng: info:  [15800] successful run completed in 666.16s (11 mins, 6.16 secs)
return_code is 0
*******************************************************************
** stress-ng memory test passed!
*******************************************************************

And from looking at vmstat, this is the lowest and highest consumption observed:
 0  0    340  30582     14   1299    0    0     3   385   27   25  3  3 75 19  0
 0 260  58068    196     12   1323    2   76  2571 78490 3295 5493  0  0 25 74  0

So at the least, only 340MB swap consumed and a full 30GB ram free, and at the most, 58GB swap consumed and only 196MB free.

So there DOES seem to be a limit to what these tests are consuming.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-10:

#24

Also retried on amd64 and the effect was the same:

ubuntu@xwing:~$ free -lm
total used free shared buff/cache available
Mem: 3838 120 3601 0 115 3510
Low: 3838 236 3601
High: 0 0 0
Swap: 97695 61 97633

no OOMKiller messages appeared in dmesg

Test passed according to test output

and the low/high usage is within what is available:

0 0 62 3605 1 111 0 0 320 1231 252 511 0 0 97 3 0
1 9 19429 86 1 109 0 58 25 60607 2863 2301 0 3 11 85 0

Changed in linux (Ubuntu):
status:	Confirmed → Invalid
Changed in linux (Ubuntu Artful):
status:	Confirmed → Invalid

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-10:

#25

Given that, I'm marking the kernel tasks as invalid as this appears to be stress-ng being over aggressive.

MAAS deployments (and I presume server installs in general) only provide about 4 - 8GB swap max in a swap file created in the root filesystem (no swap partition anymore).

It is not reasonable to create, IMO, 100GB swap files just to be sure we don't accidentally trip the OOMKiller when running memory tests.

Also, the ratio seems to be variable. On my amd64 system, the swap consumed was 5x the amount of installed RAM. On the ppc64le system, the amount was just shy of 2x the amount of physical RAM (this is also a NUMA capable system, so maybe that's a contributing factor, beyond the differences in how each arch handles memory, which I freely admit to not understanding well.

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-13:

#26

phi-memory-stress-logs.tgz Edit (228.7 KiB, application/x-tar)

I'm still able to trigger problems with the "stack" stressor on the Xeon Phi...

ubuntu@lalande:~$ sudo ./memory_stress_ng
Total memory is 63 GiB
Constant run time is 300 seconds per stressor
Variable run time is 930 seconds per stressor
Estimated total run time is 20 minutes

Running stress-ng stack stressor for 300 seconds....
stress-ng: info: [4431] dispatching hogs: 256 stack
stress-ng: info: [4431] successful run completed in 742.49s (12 mins, 22.49 secs)
return_code is 124
*****************************************************************
** Error 124 reported on stressor stack!)
*****************************************************************
Running stress-ng brk stressor for 930 seconds....
stress-ng: info: [5007] dispatching hogs: 256 brk
stress-ng: info: [5007] successful run completed in 1412.12s (23 mins, 32.12 secs)
return_code is 124
*****************************************************************
** Error 124 reported on stressor brk!)
*****************************************************************
*******************************************************************
** stress-ng memory test failed; most recent error was 124
*******************************************************************

ubuntu@lalande:~$ free -m
total used free shared buff/cache available
Mem: 64281 1133 62825 11 322 62530
Swap: 796134 0 796134

I'm attaching a tarball with dmesg, the output of vmstat 1 that shows the memory consumption during testing, and a copy of the test script being used.

Changed in linux (Ubuntu):
status:	Invalid → Confirmed
Changed in linux (Ubuntu Artful):
status:	Invalid → Confirmed

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-13:

#27

Because of the above, I've re-opened the kernel tasks :(

Revision history for this message

Jeff Lane  (bladernr) wrote on 2017-11-13:

#28

I'm not able to recreate this on a smaller generic amd64 machine:

ubuntu@xwing:~$ free
total used free shared buff/cache available
Mem: 3930172 133912 3319200 6124 477060 3500044
Swap: 100040060 0 100040060

ubuntu@xwing:~$ sudo ./memory_stress_ng
Total memory is 4 GiB
Constant run time is 300 seconds per stressor
Variable run time is 340 seconds per stressor
Estimated total run time is 10 minutes

Running stress-ng stack stressor for 300 seconds....
stress-ng: info: [4058] dispatching hogs: 8 stack
stress-ng: info: [4058] cache allocate: default cache size: 8192K
stress-ng: info: [4058] successful run completed in 302.48s (5 mins, 2.48 secs)
return_code is 0
Running stress-ng brk stressor for 340 seconds....
stress-ng: info: [4129] dispatching hogs: 8 brk
stress-ng: info: [4129] cache allocate: default cache size: 8192K
stress-ng: info: [4129] successful run completed in 342.44s (5 mins, 42.44 secs)
return_code is 0
*******************************************************************
** stress-ng memory test passed!
*******************************************************************

Jeff Lane  (bladernr) on 2017-11-14

tags:

added: hwcert-server

Joseph Salisbury (jsalisbury) on 2017-11-17

tags:

removed: kernel-da-key

Revision history for this message

Colin Ian King (colin-king) wrote on 2018-03-28:

#29

Is this still an open bug or can it be closed now?

Revision history for this message

Andy Whitcroft (apw) wrote on 2018-07-24: Closing unsupported series nomination.

#30

This bug was nominated against a series that is no longer supported, ie artful. The bug task representing the artful nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Artful):
status:	Confirmed → Won't Fix

Revision history for this message

Colin Ian King (colin-king) wrote on 2018-08-15:

#31

These stressors are expected to trigger the OOM killer. The oom adjustment for the stressor child processes is always adjusted to make them the first processes to be OOM'd. However, the kernel make's it choice on what gets OOM'd depending on many factors, so the stressors may end up triggering other hoggy processes to be OOM'd as well.

This is not a bug, it is expected behaviour. See the stress-ng manual, it clearly states when a stressors can trigger the OOM killer and how each stress-ng stressor copes with this, for example:

       --brk N
              start N workers that grow the data segment by one page at a time using
              multiple brk(2) calls. Each successfully allocated new page is touched
              to ensure it is resident in memory. If an out of memory condition
              occurs then the test will reset the data segment to the point before
              it started and repeat the data segment resizing over again. The
              process adjusts the out of memory setting so that it may be killed by
              the out of memory (OOM) killer before other processes. If it is
              killed by the OOM killer then it will be automatically re-started by a
              monitoring parent process.

I shall close this as it is not a bug. Stress-ng is expected to stress a kernel to trigger OOMs.

Colin Ian King (colin-king) on 2018-11-16

Changed in stress-ng:
status:	New → Invalid
assignee:	nobody → Colin Ian King (colin-king)

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Colin Ian King (colin-king) on 2019-07-25

Changed in linux (Ubuntu):
status:	Confirmed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.