cache test in ubuntu_stress_smoke_tests failed on some cloud instances / bare-metal system

Bug #1956200 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Stress-ng
Fix Released
Critical
Colin Ian King
ubuntu-kernel-tests
Fix Released
Undecided
Unassigned

Bug Description

Issue found on:
  * AWS i3.metal
    - Impish AWS 5.13.0-1008.9
    - Focal 5.4.0-92.103
    - Bionic 4.15.0-166.174
    - Xenial 4.4.0-218.251
  * Azure Standard_A2_v2
    - Impish 5.13.0-23.23
    - Focal 5.4.0-92.103
    - Bionic 4.15.0-166.174
    - Xenial 4.4.0-218.251
  * Azure Standard_D48_v3
    - Focal 5.4.0-92.103
  * Azure Standard_D4s_v3-gen2
    - Focal 5.4.0-92.103
    - Xenial 4.4.0-218.251
  * Azure Standard_F32s_v2
    - Bionic 4.15.0-166.174
  * Bare-metal i386 node onza
    - Bionic 4.15.0-166.174

With stress-ng V0.13.09

 cache FAILED
 stress-ng: debug: [34167] stress-ng 0.13.09 g757b66b49e4b
 stress-ng: debug: [34167] system: Linux ip-172-31-4-194 5.4.0-1059-aws #62~18.04.1-Ubuntu SMP Fri Oct 22 21:51:38 UTC 2021 x86_64
 stress-ng: debug: [34167] RAM total: 503.8G, RAM free: 501.2G, swap free: 1024.0M
 stress-ng: debug: [34167] 72 processors online, 72 processors configured
 stress-ng: info: [34167] setting to a 5 second run per stressor
 stress-ng: info: [34167] dispatching hogs: 4 cache
 stress-ng: debug: [34167] cache allocate: shared cache buffer size: 46080K
 stress-ng: debug: [34167] starting stressors
 stress-ng: debug: [34168] stress-ng-cache: started [34168] (instance 0)
 stress-ng: debug: [34167] 4 stressors started
 stress-ng: debug: [34168] stress-ng-cache: using cache buffer size of 46080K
 stress-ng: debug: [34169] stress-ng-cache: started [34169] (instance 1)
 stress-ng: debug: [34168] stress-ng-cache: exited [34168] (instance 0)
 stress-ng: debug: [34170] stress-ng-cache: started [34170] (instance 2)
 stress-ng: debug: [34171] stress-ng-cache: started [34171] (instance 3)
 stress-ng: debug: [34170] stress-ng-cache: exited [34170] (instance 2)
 stress-ng: debug: [34169] stress-ng-cache: exited [34169] (instance 1)
 stress-ng: error: [34167] process 34168 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
 stress-ng: debug: [34167] process [34168] terminated
 stress-ng: debug: [34167] process [34169] terminated
 stress-ng: error: [34167] process 34170 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
 stress-ng: debug: [34167] process [34170] terminated
 stress-ng: debug: [34171] stress-ng-cache: exited [34171] (instance 3)
 stress-ng: error: [34167] process 34171 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
 stress-ng: debug: [34167] process [34171] terminated
 stress-ng: info: [34167] unsuccessful run completed in 4.98s
 stress-ng: debug: [34167] metrics-check: all stressor metrics validated and sane

Error in syslog:
stress-ng: info: [34167] dispatching hogs: 4 cache
stress-ng: error: [34167] process 34168 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
stress-ng: error: [34167] process 34170 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
stress-ng: error: [34167] process 34171 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)

Po-Hsu Lin (cypressyew)
tags: added: aws azure sru-20211129 ubuntu-stress-smoke-test
Po-Hsu Lin (cypressyew)
summary: - cache test in ubuntu_stress_smoke_tests failed on some cloud instances
+ cache test in ubuntu_stress_smoke_tests failed on some cloud instances /
+ bare-metal system
Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Manually tested on AWS i3.metal with 5.4.0-1059-aws, this issue does not exist with stress-ng V0.13.07

Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Colin Ian King (colin-king) wrote :

Can you build from source the latest stress-ng from git://github.com/ColinIanKing/stress-ng and test with this version manually on the AWS instance. I've made the signal handler report more information so we can try to figure out which signal is causing this death.

Colin

Changed in stress-ng:
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :

Messages from dmesg kernel log would also be useful. Thanks. :-)

Revision history for this message
Colin Ian King (colin-king) wrote :

I suspect it my be a SIGILL because of some new cache op-codes being used. I also believe if this the root issue it may not occur on the development tip of stress-ng because it should cope with detection and handling of newer cache ops.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi Colin,
I got this build error while trying to build with the latest tip (0850686) from the github repo on AWS i3.metal with 5.4.0-1060-aws:
  CC stress-efivar.c
  CC stress-enosys.c
  CC stress-env.c
  CC stress-epoll.c
  Makefile:407: recipe for target 'core-shim.o' failed
  make[1]: Leaving directory '/home/ubuntu/autotest/client/tmp/ubuntu_stress_smoke_test/src/stress-ng'
  Makefile:392: recipe for target 'all' failed
  stderr:
  core-shim.c: In function ‘shim_gettid’:
  core-shim.c:196:9: warning: implicit declaration of function ‘gettid’; did you mean ‘getgid’? [-Wimplicit-function-declaration]
    return gettid();
           ^~~~~~
           getgid
  core-shim.c: In function ‘shim_getcpu’:
  core-shim.c:216:15: warning: implicit declaration of function ‘getcpu’; did you mean ‘getcwd’? [-Wimplicit-function-declaration]
    return (long)getcpu(cpu, node);
                 ^~~~~~
                 getcwd
  core-shim.c: In function ‘shim_statx’:
  core-shim.c:865:15: error: storage size of ‘statxbuf’ isn’t known
    struct statx statxbuf;
                 ^~~~~~~~
  core-shim.c:865:15: warning: unused variable ‘statxbuf’ [-Wunused-variable]
  make[1]: *** [core-shim.o] Error 1
  make[1]: *** Waiting for unfinished jobs....
  stress-ng.c: In function ‘stress_handle_terminate’:
  stress-ng.c:1877:3: warning: ignoring return value of ‘write’, declared with attribute warn_unused_result [-Wunused-result]
     write(fd, buf, strlen(buf));
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
  make: *** [all] Error 2

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

And here is the dmesg output for the failing cache test on AWS i3.metal with 5.4.0-1060-aws?field.comment=And here is the dmesg output for the failing cache test on AWS i3.metal with 5.4.0-1060-aws

Revision history for this message
Colin Ian King (colin-king) wrote :

I've fixed the issues for the build issues on the older releases and pushed them to the github repository. Do you mind re-testing again?

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hello Colin,
With the updated github repo (Test suite HEAD SHA1: b4075793)
this cache test is still failing on AWS i3.metal with 5.4.0-1060-aws:
 cache FAILED
 stress-ng: debug: [69056] stress-ng 0.13.09 gb4075793a861
 stress-ng: debug: [69056] system: Linux ip-172-31-4-194 5.4.0-1060-aws #63~18.04.1-Ubuntu SMP Mon Nov 15 14:31:31 UTC 2021 x86_64
 stress-ng: debug: [69056] RAM total: 503.8G, RAM free: 501.9G, swap free: 1024.0M
 stress-ng: debug: [69056] 72 processors online, 72 processors configured
 stress-ng: info: [69056] setting to a 5 second run per stressor
 stress-ng: info: [69056] dispatching hogs: 4 cache
 stress-ng: debug: [69056] cache allocate: shared cache buffer size: 46080K
 stress-ng: debug: [69056] starting stressors
 stress-ng: debug: [69057] stress-ng-cache: started [69057] (instance 0)
 stress-ng: debug: [69056] 4 stressors started
 stress-ng: debug: [69057] stress-ng-cache: using cache buffer size of 46080K
 stress-ng: debug: [69058] stress-ng-cache: started [69058] (instance 1)
 stress-ng: info: [69057] stressor terminated with unexpected signal signal 4 'SIGILL'
 stress-ng: debug: [69059] stress-ng-cache: started [69059] (instance 2)
 stress-ng: debug: [69060] stress-ng-cache: started [69060] (instance 3)
 stress-ng: error: [69056] process 69057 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
 stress-ng: debug: [69056] process [69057] terminated
 stress-ng: debug: [69058] stress-ng-cache: exited [69058] (instance 1)
 stress-ng: debug: [69059] stress-ng-cache: exited [69059] (instance 2)
 stress-ng: debug: [69056] process [69058] terminated
 stress-ng: debug: [69060] stress-ng-cache: exited [69060] (instance 3)
 stress-ng: debug: [69056] process [69059] terminated
 stress-ng: debug: [69056] process [69060] terminated
 stress-ng: info: [69056] unsuccessful run completed in 5.01s
 stress-ng: debug: [69056] metrics-check: all stressor metrics validated and sane

Please find attachment for dmesg output.

Revision history for this message
Colin Ian King (colin-king) wrote :

OK - that's good, I can see SIGILL is occurring, I'll work on a fix tonight.

Revision history for this message
Colin Ian King (colin-king) wrote :

I've added an opcode check for the cldemote instruction, do you mind re-testing with the latest github repo again.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi,
it's still failing with the latest tip 6373fc58

 cache FAILED
 stress-ng: debug: [8641] stress-ng 0.13.09 g6373fc58f42c
 stress-ng: debug: [8641] system: Linux ip-172-31-4-194 5.4.0-1060-aws #63~18.04.1-Ubuntu SMP Mon Nov 15 14:31:31 UTC 2021 x86_64
 stress-ng: debug: [8641] RAM total: 503.8G, RAM free: 500.9G, swap free: 1024.0M
 stress-ng: debug: [8641] 72 processors online, 72 processors configured
 stress-ng: info: [8641] setting to a 5 second run per stressor
 stress-ng: info: [8641] dispatching hogs: 4 cache
 stress-ng: debug: [8641] cache allocate: shared cache buffer size: 46080K
 stress-ng: debug: [8641] starting stressors
 stress-ng: debug: [8642] stress-ng-cache: started [8642] (instance 0)
 stress-ng: debug: [8641] 4 stressors started
 stress-ng: debug: [8642] stress-ng-cache: using cache buffer size of 46080K
 stress-ng: debug: [8643] stress-ng-cache: started [8643] (instance 1)
 stress-ng: info: [8642] stressor terminated with unexpected signal signal 4 'SIGILL'
 stress-ng: debug: [8644] stress-ng-cache: started [8644] (instance 2)
 stress-ng: info: [8643] stressor terminated with unexpected signal signal 4 'SIGILL'
 stress-ng: debug: [8645] stress-ng-cache: started [8645] (instance 3)
 stress-ng: info: [8645] stressor terminated with unexpected signal signal 4 'SIGILL'
 stress-ng: debug: [8644] stress-ng-cache: exited [8644] (instance 2)
 stress-ng: error: [8641] process 8642 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
 stress-ng: debug: [8641] process [8642] terminated
 stress-ng: error: [8641] process 8643 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
 stress-ng: debug: [8641] process [8643] terminated
 stress-ng: debug: [8641] process [8644] terminated
 stress-ng: error: [8641] process 8645 (stress-ng-cache) terminated with an error, exit status=5 (killed by signal)
 stress-ng: debug: [8641] process [8645] terminated
 stress-ng: info: [8641] unsuccessful run completed in 1.44s
 stress-ng: debug: [8641] metrics-check: all stressor metrics validated and sane

Please find attachment for dmesg output.

Revision history for this message
Colin Ian King (colin-king) wrote :

I've pushed another change to catch SIGILL and report which opcode is causing the issue and it auto-deselects the opcode being exercised. Can you try this to see if it helps?

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi Colin,
it has passed with the latest tip 872c170f, you will find the dmesg output here if you need.
Thanks
Sam

Revision history for this message
Colin Ian King (colin-king) wrote :

OK, I suggest using stress-ng at that tip, I'll try and get some spare time to regression test stress-ng against all my other distro/arch combos this week and get a release out in a few days.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Ack, thank for the info. I will bump our fork to 872c170f and restart tests.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote (last edit ):

Retested on those failed AWS / Azure instances with HEAD SHA1 872c170f and passed.

Our fork of stress-ng has been updated to V0.13.10.
https://git.launchpad.net/~canonical-kernel-team/+git/stress-ng/commit/?id=b81116cb69a97aa671ab207a7f600aaacca091d1

Closing this bug with Fix-released.
Thanks!

Changed in stress-ng:
status: New → Fix Released
importance: Undecided → Critical
Po-Hsu Lin (cypressyew)
Changed in ubuntu-kernel-tests:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.