Failures of stack stressor in stress-ng 0.10.07 (in Eoan)

Bug #1851316 reported by Rod Smith
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Stress-ng
Fix Released
High
Colin Ian King
stress-ng (Ubuntu)
Fix Released
Undecided
Unassigned
Eoan
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

== SRU Justification Eoan ==

Doing regression testing of Ubuntu 19.10, we've been seeing frequent failures of stress-ng's stack stressor. These systems have stress-ng 0.10.07-1. For instance (running manually, not in the Checkbox test script):

ubuntu@kzanol:~$ stress-ng -k --aggressive --verify --timeout 30 --stack 0
stress-ng: info: [5051] dispatching hogs: 4 stack
stress-ng: info: [5051] unsuccessful run completed in 32.01s
ubuntu@kzanol:~$ stress-ng -k --aggressive --verify --timeout 30 --stack 0
stress-ng: info: [5064] dispatching hogs: 4 stack
stress-ng: info: [5064] successful run completed in 32.96s
ubuntu@kzanol:~$ stress-ng -k --aggressive --verify --timeout 30 --stack 0
stress-ng: info: [5077] dispatching hogs: 4 stack
stress-ng: info: [5077] unsuccessful run completed in 33.09s

This problem seems to affect some systems more than others; in doing my testing, some computers fail a majority of 30-second or greater test runs (our default test run in 300s in length), but others haven't failed once over several such test runs. I haven't yet identified what's causing some systems to fail but not others.

In testing this, I installed Ubuntu 19.04, which comes with stress-ng 0.09.57-0ubuntu3, on one affected system, and encountered no problems. Upon upgrading to stress-ng 0.10.07-1 from Ubuntu 19.04, the problems returned. Thus, this appears to either be a problem with stress-ng 0.10.07-1 or this new version of stress-ng is detecting previously-undetected problems on multiple servers.

== Test case ==

Run stress-ng -k --aggressive --verify --timeout 30 --stack 0 multiple times and interrupt it with control-C (SIGINT). This can trigger a segfault. With the fix, the segfault cannot be triggered.

== Fix ==

Upstream stress-ng commits:
    - 10ffe40579c5 stress-stack: return error code in child using
                   _exit() and not return
    - ef18c524df48 stress-stack: don't throw a fatal error when
                   sigaltstack fails
    - 6245e5f62eae stress-stack: check for ENOMEM fork failure and retry
    - 8fb67daea592 stress-stack: setup alternative stack in child only

The first 3 fixes are prerequisite fixes, the final fix addresses the main issue.

== Regression Potential ==

This affects just the stress-ng stack stressor. The fixes are already tested upstream fixes found in stress-ng in Ubuntu Focal. The fixes have been regression tested on arm64, amd64, i386, s390x and ppc64el architectures so the test coverage is good on these fixes. The fixes change the stack of the signal handler and also the exit of a child stress process, so the affects of the changes are small in the context of the stress test.

Changed in stress-ng:
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
status: New → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

Can you run the same command with the verbose option on -v to see if that shows any useful debug messages?

stress-ng -k --aggressive --verify --timeout 30 --stack 0 -v

Revision history for this message
Rod Smith (rodsmith) wrote :

Here are unsuccessful and successful runs, respectively, with -v:

ubuntu@kzanol:~$ sudo stress-ng -k --aggressive --verify --timeout 30 --stack 0 -v
stress-ng: debug: [4498] 4 processors online, 4 processors configured
stress-ng: info: [4498] dispatching hogs: 4 stack
stress-ng: debug: [4498] cache allocate: default cache size: 3072K
stress-ng: debug: [4498] starting stressors
stress-ng: debug: [4499] stress-ng-stack: started [4499] (instance 0)
stress-ng: debug: [4500] stress-ng-stack: started [4500] (instance 1)
stress-ng: debug: [4498] 4 stressors spawned
stress-ng: debug: [4501] stress-ng-stack: started [4501] (instance 2)
stress-ng: debug: [4502] stress-ng-stack: started [4502] (instance 3)
stress-ng: debug: [4498] process [4499] (stress-ng-stack) terminated on signal: 11 (Segmentation fault)
stress-ng: debug: [4498] process [4499] terminated
stress-ng: debug: [4498] process [4500] (stress-ng-stack) terminated on signal: 11 (Segmentation fault)
stress-ng: debug: [4498] process [4500] terminated
stress-ng: debug: [4498] process [4501] (stress-ng-stack) terminated on signal: 11 (Segmentation fault)
stress-ng: debug: [4498] process [4501] terminated
stress-ng: debug: [4498] process [4502] (stress-ng-stack) terminated on signal: 11 (Segmentation fault)
stress-ng: debug: [4498] process [4502] terminated
stress-ng: info: [4498] unsuccessful run completed in 32.09s

ubuntu@kzanol:~$ sudo stress-ng -k --aggressive --verify --timeout 30 --stack 0 -v
stress-ng: debug: [4606] 4 processors online, 4 processors configured
stress-ng: info: [4606] dispatching hogs: 4 stack
stress-ng: debug: [4606] cache allocate: default cache size: 3072K
stress-ng: debug: [4606] starting stressors
stress-ng: debug: [4607] stress-ng-stack: started [4607] (instance 0)
stress-ng: debug: [4608] stress-ng-stack: started [4608] (instance 1)
stress-ng: debug: [4606] 4 stressors spawned
stress-ng: debug: [4609] stress-ng-stack: started [4609] (instance 2)
stress-ng: debug: [4610] stress-ng-stack: started [4610] (instance 3)
stress-ng: debug: [4607] stress-ng-stack: exited [4607] (instance 0)
stress-ng: debug: [4610] stress-ng-stack: exited [4610] (instance 3)
stress-ng: debug: [4608] stress-ng-stack: exited [4608] (instance 1)
stress-ng: debug: [4609] stress-ng-stack: exited [4609] (instance 2)
stress-ng: debug: [4606] process [4607] terminated
stress-ng: debug: [4606] process [4608] terminated
stress-ng: debug: [4606] process [4609] terminated
stress-ng: debug: [4606] process [4610] terminated
stress-ng: info: [4606] successful run completed in 32.27s

(I realized a bit late that our test script was being run as root, so I added "sudo" to these runs; but the pattern of success vs. failure doesn't seem to be affected by this change.)

Revision history for this message
Colin Ian King (colin-king) wrote :

Any specific architecture?

Revision history for this message
Colin Ian King (colin-king) wrote :

The only specific change to stress-ng for this kind of issue was:

https://bugs.launchpad.net/ubuntu/+source/stress-ng/+bug/1845464

..and that should have resolved segfaulting. perhaps it breaks on some machines with this change.

Revision history for this message
Rod Smith (rodsmith) wrote :

So far, I've encountered this only on x86-64 systems. We haven't tested i386 in quite a while. Jeff has done some regression testing on IBM Z-series and has had no problems, but testing on that has been limited.

Revision history for this message
Colin Ian King (colin-king) wrote :

I'm having no success at reproducing this on my H/W. Can I get ssh access to a machine where this fails so I can debug it?

Revision history for this message
Colin Ian King (colin-king) wrote :

Can you run stress-ng -V to sanity check the version for me?

Revision history for this message
Colin Ian King (colin-king) wrote :

No worries, I've reproduced this error now.

Revision history for this message
Rod Smith (rodsmith) wrote :

In case you still need it:

ubuntu@kzanol:~$ stress-ng -V
stress-ng, version 0.10.07 (gcc 9.2, x86_64 Linux 5.3.0-19-generic) 💻🔥

I can give you access to an affected system in 1SS if you still need it, but I'm assuming from comment #8 that you don't need it now.

Revision history for this message
Colin Ian King (colin-king) wrote :

I believe I now have a fix and I've pushed it to the repo. Do you mind testing it on the H/W that causes the issue:

(you make have to edit /etc/apt/sources.list to enable all the deb-src sources)

sudo apt-get build-dep stress-ng
git clone git://kernel.ubuntu.com/cking/stress-ng
cd stress-ng
make clean
make -j $(nproc)

and run the following several times to see if it no longer segfaults:

sudo ./stress-ng -k --aggressive --verify --timeout 30 --stack 0 -v

Revision history for this message
Rod Smith (rodsmith) wrote :

That seems to do the trick, Colin. I ran about five 30-second runs and three 300-second runs (to match what we do in certification), and I had no problems. Thanks for the quick fix!

Revision history for this message
Colin Ian King (colin-king) wrote :

Thanks for testing. I'll sort out a release for focal this week and then SRU this fix for eoan too.

Changed in stress-ng:
status: In Progress → Fix Committed
description: updated
Changed in stress-ng:
status: Fix Committed → Fix Released
Changed in stress-ng (Ubuntu Eoan):
status: New → In Progress
Changed in stress-ng (Ubuntu Focal):
status: New → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package stress-ng - 0.10.09-1

---------------
stress-ng (0.10.09-1) unstable; urgency=medium

  * Makefile: bump version
  * Makefile: add stress-af-alg-defconfigs.h to the dist rule
  * stress-af-alg: make buffers static, reduces stack overhead
  * stress-af-alg: add opt_set_funcs helper for non-linux builds
  * Manual: update contributors and date
  * stress-af-alg: some minor code cleanups, no functional change
  * stress-af-alg: handle bind() ETIMEDOUT failures
  * stress-af-alg: add default configs to complement /proc/crypto list
  * stress-af-alg: add defconfigs with --af-alg-dump
  * stress-af-alg: introduce the --af-alg-dump option
  * stress-af-alg: skip 'internal' crypto algorithms
  * stress-af-alg: fix close(fd) to avoid bind() EBUSY
  * stress-af-alg: use 'aead' salg_type for CRYPTO_AEAD on bind()
  * stress-af-alg: fix sockaddr algorithm type on bind()
  * stress-stack: setup alternative stack in child only (LP: #1851316)
  * stress-stack: check for ENOMEM fork failure and retry
  * stress-stack: don't throw a fatal error when sigaltstack fails
  * stress-stack: return error code in child using _exit() and not
    return
  * core-madvise: Add 5.4 MADV_COLD and MADV_PAGEOUT hints
  * stress-prctl: add PR_GET_SPECULATION_CTRL exerciser
  * Manual: update af-alg description
  * Make a couple of const strings static
  * stress-af-alg: fix build errors on undefined macros
  * stress-af-alg: add aead support
  * stress-af-alg: remove some debugging messages
  * stress-af-alg: remove old unused crypto structures
  * stress-af-alg: only add crypto algorithms that are supported by the
    stressor
  * stress-af-alg: use crypto algorithm data from /proc/crypto
  * stress-clone: Add CLONE_NEWCGROUP
  * stress-daemon: add expanding backoff timeout
  * stress-daemon: keep retrying fork if we don't have enough resources
  * stress-daemon: add minor backoff before fork retry (LP: #1849595)
  * stress-vm: print stressor name using args->name rather than literal
    string
  * stress-readahead: print stressor name in failure message
  * stress-matrix-3d: use pr_fail for short error failure messages
  * stress-matrix-3d: use pr_fail_err for short error failure messages
  * stress-iomix: fix one more pr_fail message
  * stress-iomix: use pr_fail_err for short error failure messages

 -- Colin King <email address hidden> Wed, 6 Nov 2019 01:03:05 +0000

Changed in stress-ng (Ubuntu Focal):
status: In Progress → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Rod, or anyone else affected,

Accepted stress-ng into eoan-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/stress-ng/0.10.07-1ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-eoan to verification-done-eoan. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-eoan. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in stress-ng (Ubuntu Eoan):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-eoan
description: updated
Revision history for this message
Colin Ian King (colin-king) wrote :

I've just run stress-ng 0.10.07-1 on a 24 CPU ARM64 Eoan platform and verified this is OK. I ran the stress tests 50 times and interrupted it with SIGINT without any segfault issues. Marking this as verified.

tags: added: verification-done verification-done-eoan
removed: verification-needed verification-needed-eoan
Revision history for this message
Rod Smith (rodsmith) wrote :

If this is the same fix that Colin posted earlier and that I tested, then it's fine; I reported my test results in post #11.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Also confirming this fixes the issue on the Power8 LPAR that was failing with the previous version of stress-ng.

Revision history for this message
Colin Ian King (colin-king) wrote :

@SRU team, can this be uploaded soon?

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package stress-ng - 0.10.07-1ubuntu1

---------------
stress-ng (0.10.07-1ubuntu1) eoan; urgency=medium

  * stress-stack: fix segfaults when handling signals (LP: #1851316)
    requres upstream fix and prerequisite patches:
    - 10ffe40579c5 stress-stack: return error code in child using
                   _exit() and not return
    - ef18c524df48 stress-stack: don't throw a fatal error when
                   sigaltstack fails
    - 6245e5f62eae stress-stack: check for ENOMEM fork failure and retry
    - 8fb67daea592 stress-stack: setup alternative stack in child only

 -- Colin King <email address hidden> Wed, 6 Oct 2019 09:02:01 +0000

Changed in stress-ng (Ubuntu Eoan):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for stress-ng has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.