stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
stress-ng (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Undecided
|
Mauricio Faria de Oliveira | ||
Disco |
Fix Released
|
Undecided
|
Mauricio Faria de Oliveira | ||
Eoan |
Fix Released
|
Undecided
|
Mauricio Faria de Oliveira |
Bug Description
[Impact]
* Users running stress-ng's 'af-alg' stressor (which is part of the 'cpu'
and 'os' classes of stressors) with 50+ instances, might get failure exit
status and the message 'bind failed, errno=110 (Connection timed out)'.
* For MAAS users, this means the CPU hardware tests (that run the 'cpu'
class of stressors) on larger systems might report 'FAILED' status, thus
possibly misleading admins about the hardware present in the system.
* It has been determined the problem root cause is related to concurrent
module loading request threshold in the kernel (50), which is exercised
by the crypto API at the time of the bind() system call (so to load the
crypto algorithm module requested).
* The problem happens due to a race condition between the instance that
exceeded the threshold of concurrent module loading (50 requests), then
timed out while waiting for a second chance (5 seconds), and another
instance that successfully made it and requested the module load but the
module's self-tests didn't finish within the time-out running in the first
instance (60 seconds), as all the CPUs are currently under stress;
this error is then returned to userspace/bind().
* Not all instances fail with that error, as once the crypto algorithm
module is successfully loaded (i.e., by another concurrent instance
and the module self-tests eventually finished), the problem no longer
occurs.
* The fix simply checks for ETIMEDOUT errno/failure on the bind() system
call, and performs a bounded retry loop (3 attempts), as the module may
just have been loaded successfully by another instance.
[Test Case]
* A synthetic reproducer is available; a kernel module that uses kprobes to
force the synchronization of af-alg instances to happen in the way needed
to reproduce the problem. (comments #7 and #13, test in comments #10-#12)
* With the kernel module loaded, one of the af-alg instances (not all of
them) hits the bind() connection timed out if this patch is not applied.
[Regression Potential]
* The code changes are minimal and contained within af-alg stressor code.
* Differences in behavior might be af-alg/cpu/os stressors that now
pass/exit with successful status on larger systems.
[Other Info]
* Fix applied in stress-ng [1] on V0.10.09 and in Focal (development series).
* Backport provided for these stable releases: Bionic, Disco, Eoan.
[Original Description]
The MAAS hardware test for CPU (long/12h) fails due to stress-ng-af-alg bind() errors.
stress-ng-cpu-long <...> Failed [View log]
disabled 'cpu-online' as it may hang the machine (enable it with the --pathological option)
dispatching hogs: 72 af-alg, 72 atomic, 72 branch, 72 bsearch, 72 cache, 72 context, 72 cpu, 72 crypt, 72 fp-error, 72 funccall, 72 getrandom, 72 heapsort, 72 hsearch, 72 icache, 72 ioport, 72 lockbus, 72 longjmp, 72 lsearch, 72 malloc, 72 matrix, 72 membarrier, 72 memcpy, 72 mergesort, 72 nop, 72 numa, 72 opcode, 72 qsort, 72 radixsort, 72 rdrand, 72 str, 72 stream, 72 tree, 72 tsc, 72 tsearch, 72 vecmath, 72 wcs, 72 zlib
stress-ng-numa: system has 2 of a maximum 1024 memory NUMA nodes
stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng-stream: Using CPU cache size of 25344K
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
stress-ng-af-alg: bind failed, errno=110 (Connection timed out)
...
process 6626 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure)
process 6673 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure)
process 6713 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure)
process 6751 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure)
process 6800 (stress-ng-af_alg) terminated with an error, exit status=1 (stress-ng core failure)
...
unsuccessful run completed in 44935.38s (12 hours, 28 mins, 55.38 secs)
...
description: | updated |
description: | updated |
description: | updated |
This fix has to wait the current SRU/upload of stress-ng in Eoan (0.10.07-1ubuntu1).