Hardware test stress-ng-cpu-long bind failed

Bug #1815123 reported by Shane Peters on 2019-02-07
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
High
Lee Trager
Stress-ng
Medium
Colin Ian King
stress-ng (Ubuntu)
Undecided
Unassigned
Bionic
Undecided
Unassigned

Bug Description

SRU Request, Bionic

[Justification]
The af-alg stressor in stress-ng is reporting bind failures when resources run low; this is not an error that should be reported; it should be silently handled rather than causing the stressor to bail out and finish prematurely.

[Fix]
Upstream fix:
      a5c2cb02e8ed check for EBUSY bind failures
And prerequisites:
      13c4c58d0150 expand error message to capture more information
      39184c74f1e0 forgot to add in \n
      aed180cb7b2f make ENOKEY a non-critical failure
      7f1a617adcd6 skip over ciphers that may not exist
      88cbe87a3cc1 fix errno = ENOENT assignment, should be == comparison
      3ec28f2f5438 return EXIT_NOT_IMPLEMENTED if protocol is not supported

[Testcase]

Without the fix, testing on large CPU systems one sees:

stress-ng --af-alg 0 -t 60

stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)

With the fix the connection timed out error does not get reported and the test works.

[Regression Potential]
This just affects the af-alg stressor and contains upstream stress-ng commits that are in cosmic, disco and have been exercised on large systems. This patch set reduces the false positive errors from the af-alg stressor. If there are issues, just the error reports from failed AF-ALG kernel algorithms will be ignored rather than we get false positives.

-----

Running the hardware test 'stress-ng-cpu-long' from MAAS on a machine with high CPU count, fails with a stream of af_alg bind failures causing the test to fail.

Here's snippet from the logs:

Hardware: HP Dl360 Gen10 stress-ng-cpu-long
OS: Ubuntu 18.04
MAAS:

# SNIPPET OF KERN.LOG
...
request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module crypto-xor-all, throttling...
request_module: modprobe crypto-xor-all cannot be processed, kmod busy with 50 threads for more than 5 seconds now
request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module crypto-ofb(aes), throttling...
request_module: modprobe crypto-ofb(aes) cannot be processed, kmod busy with 50 threads for more than 5 seconds now
request_module: kmod_concurrent_max (0) close to 0 (max_modprobes: 50), for module crypto-ofb(aes)-all, throttling..
...

# TEST OUTPUT
...
disabled 'cpu-online' as it may hang the machine (enable it with the --pathological option)
dispatching hogs: 72 af-alg, 72 atomic, 72 branch, 72 bsearch, 72 cache, 72 context, 72 cpu, 72 crypt, 72 fp-error, 72 funccall, 72 getrandom, 72 heapsort, 72 hsearch, 72 icache, 72 ioport, 72 lockbus, 72 longjmp, 72 lsearch, 72 malloc, 72 matrix, 72 membarrier, 72 memcpy, 72 mergesort, 72 nop, 72 numa, 72 opcode, 72 qsort, 72 radixsort, 72 rdrand, 72 str, 72 stream, 72 tree, 72 tsc, 72 tsearch, 72 vecmath, 72 wcs, 72 zlib
stress-ng-numa: system has 2 of a maximum 1024 memory NUMA nodes
stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng-stream: Using CPU cache size of 25344K
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
...

Lee Trager (ltrager) wrote :

How CPU/cores does your system have? How much RAM does your system have? Does the stress-ng-cpu-short test pass?

Changed in maas:
status: New → Incomplete
Shane Peters (shaner) wrote :

Hi Lee, apologies for the delay.

This machine has 72 cores and 512GB RAM.

The short test passes fine.

It should be noted that the test works fine outside of 'Hardware Test' mode. For example, if one were to deploy the machine, login and install stress-ng then manually run the test, we don't see the error.

Is it possible this has something to do with starting to early in the boot process?

I did try modifying the cpu-long-test, preloading many of the crypto modules before stress-ng runs and it didn't seem to work.

Changed in maas:
status: Incomplete → New
Changed in maas:
importance: Undecided → High
Changed in maas:
assignee: nobody → Lee Trager (ltrager)
Björn Tillenius (bjornt) wrote :

This commit to stress-ng upstream seems related to this:

  https://git.launchpad.net/stress-ng/commit/?id=a5c2cb02e8edb4f72b3df06414bfd061a0965f1e

Lee Trager (ltrager) wrote :

This seems like a bug in stress-ng. I've backported stress-ng_0.09.56 from Disco to Bionic, could you give that a try?

$ wget http://162.213.35.187/stress-ng-disco/stress-ng-cpu-long-disco
$ maas $PROFILE node-scripts create script@=stress-ng-cpu-long-disco
$ maas $PROFILE machine test $SYSTEM_ID testing_scripts=stress-ng-cpu-long-disco

Changed in maas:
status: New → Incomplete
Colin Ian King (colin-king) wrote :
Changed in stress-ng:
importance: Undecided → Low
assignee: nobody → Colin Ian King (colin-king)
status: New → Triaged
Shane Peters (shaner) wrote :

Hi Colin,
Yes, it would be great if you'd go ahead with SUR'ing this commit!

description: updated
Changed in stress-ng:
status: Triaged → In Progress
importance: Low → Medium
Colin Ian King (colin-king) wrote :

I won't be able to SRU this for Xenial as there are way too many prerequisites to address this fix. Is that OK?

description: updated

Hello Shane, or anyone else affected,

Accepted stress-ng into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/stress-ng/0.09.25-1ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in stress-ng (Ubuntu Bionic):
status: New → Fix Committed
tags: added: verification-needed verification-needed-bionic
Changed in stress-ng (Ubuntu):
status: New → Fix Released
Colin Ian King (colin-king) wrote :

Tested with -proposed stress-ng 0.09.25-1ubuntu2 on amd64 with 192 cpus, 300 af-alg stressors and it works fine, so verification passed.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package stress-ng - 0.09.25-1ubuntu2

---------------
stress-ng (0.09.25-1ubuntu2) bionic; urgency=medium

  [Alfonso Sanchez-Beato]
  * stress-numa: fix calculation of max nodes (LP: #1823208)
    - use the count of bits in "Mems_allowed" to calculate the
      maximum number of NUMA nodes
  [Colin Ian King]
  * stress-af-alg: check for EBUSY bind failures (LP: #1815123)
    - backport of upstream prerequisits and the fix:
      13c4c58d0150 expand error message to capture more information
      39184c74f1e0 forgot to add in \n
      aed180cb7b2f make ENOKEY a non-critical failure
      7f1a617adcd6 skip over ciphers that may not exist
      88cbe87a3cc1 fix errno = ENOENT assignment, should be == comparison
      3ec28f2f5438 return EXIT_NOT_IMPLEMENTED if protocol is not supported
      a5c2cb02e8ed stress-af-alg: check for EBUSY bind failures

 -- Colin King <email address hidden> Thu, 4 Apr 2019 17:47:11 +0100

Changed in stress-ng (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for stress-ng has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in stress-ng:
status: In Progress → Fix Released
Changed in maas:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers