stress-ng: numa stressor may be unreliable on some systems

Bug #1542741 reported by Colin Ian King
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
stress-ng (Ubuntu)
Fix Released
Medium
Colin Ian King

Bug Description

User is seeing errors:

stress-ng: fail: [19811] stress-ng-numa: mbind: errno=5 (Input/output
error)
stress-ng: info: [19811] 5 failures reached, aborting stress process
stress-ng: fail: [19658] stress-ng-numa: mbind: errno=5 (Input/output
error)
stress-ng: fail: [19792] stress-ng-numa: mbind: errno=5 (Input/output
error)
stress-ng: info: [19792] 5 failures reached, aborting stress process
stress-ng: fail: [19658] stress-ng-numa: mbind: errno=5 (Input/output
error)
stress-ng: fail: [19658] stress-ng-numa: mbind: errno=5 (Input/output
error)
stress-ng: info: [19658] 5 failures reached, aborting stress process

Need to investigate why this is occurring

Changed in stress-ng (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Colin Ian King (colin-king)
status: New → Incomplete
Revision history for this message
Rod Smith (rodsmith) wrote :

I'm attaching a tarball with additional information. This includes runs that generate these errors on three systems in the certification lab in 1SS: rollinia, wildorange, and hogplum. I'm providing both sample runs and CPU information ("lscpu" and "cat /proc/cpuinfo" output) for each system. I've been running via a simple script, which I'm also attaching. (We've omitted the --numa option from the final certification script, but I've added it back to this version so as to trigger the complaint from stress-ng.)

Changed in stress-ng (Ubuntu):
status: Incomplete → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

So it seems we're getting -EIO Input/output error because MPOL_MF_STRICT was specified and an existing page was already on a node that does not follow the policy; or MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and the kernel was unable to move all existing pages in the range.

I'll tweak the test to cater for this condition.

Revision history for this message
Colin Ian King (colin-king) wrote :
Changed in stress-ng (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Rod, thanks for that data, I think I've got it fixed.

Do you mind quickly checking it out on the boxes that it failed on just to check my fixes are OK? If building it from source is OK with you, I'd appreciate that:

git clone git://kernel.ubuntu.com/cking/stress-ng
cd stress-ng
make clean
make
./stress-ng --numa 0 -t 60

and let me know if that works OK. If so, I'll prepare a new release of stress-ng for this week.

Revision history for this message
Rod Smith (rodsmith) wrote :

I've re-run my original test script and your command on the three servers whose results I posted yesterday, and the error messages have disappeared, so this looks good from my end. Thanks for the quick fix!

Revision history for this message
Colin Ian King (colin-king) wrote :

Thanks, the updated stress-ng will be landing in Xenial hopefully today.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package stress-ng - 0.05.14-1

---------------
stress-ng (0.05.14-1) unstable; urgency=medium

  * Makefile: bump version
  * Fix make dist - typo in test-libpthread.c
  * stress-fiemap: initialise counter at start of loop
  * stress-fiemap: ensure counter is being updated
  * adt: exclude bind mount from list of stressors
  * Don't emit warning on --pathological if number of stressors is zero
  * ignite-cpu: add null entry at end of list, don't scan by array size
  * Only include <sys/mount.h> for linux for the bind mount stressor
  * stress-bind-mount: build stressor if STRESS_BIND_MOUNT defined
  * stress-bind-mount: stop cppcheck whining about uninitialised pid
  * stress-numa: don't fatally fail on -EIO with MPOL_MF_STRICT (LP: #1542741)
  * Enabled IGNITE_CPU with the --aggressive option
  * Add bind-mount stressor (LP: #1542010)
  * Add --ignite-cpu option to maximize CPU frequency
  * Make float decimal auto detect set -DHAVE_FLOAT_DECIMAL
  * Minor re-org of Makefile, and add more files to dist rule
  * Update README - increase number of stressors
  * adt tests: remove membarrier, it fails on older kernels in Debian
  * Add some more comments
  * stress-affinity: handle EINVAL when CPU(s) are offline
  * Set number of instances to on-line CPUs if N is -ve (LP: #1513546)
  * Remove opt_long, replace with get_int32 or get_uint64
  * Add libpthread build time checks
  * Add librt build time checks
  * Remove commented out old link line
  * Add libcrypt check
  * Makefile: remove test-libz correctly
  * stress-cpu: make source 80 column friendly
  * Add FORCE_DO_NOTHING macro do force compiler to stop opimizing out loops
  * Add zlib stressor
  * stress-stream: cater for systems without L3 cache
  * stress-stream: only emit cache size info on instance 0
  * Add libbsd-dev to README

 -- Colin King <email address hidden> Mon, 8 Feb 2016 18:29:11 +0000

Changed in stress-ng (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.