disk stress test failing with code 7

Bug #1999731 reported by Ike Panhc
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Stress-ng
Fix Released
Medium
Colin Ian King
linux (Ubuntu)
Invalid
Undecided
Ike Panhc
stress-ng (Ubuntu)
Fix Released
High
Colin Ian King

Bug Description

Since mid of November we see lots of disk stress test failing with multiple Ubuntu kernel e.g. bionic-hwe, focal, focal-hwe. Most of them are with lockofd stressor and system are still alive after stress test.

05 Nov 08:51: Running stress-ng lockofd stressor for 240 seconds...
** stress-ng exited with code 7

Revision history for this message
Ike Panhc (ikepanhc) wrote :

So far we are not sure this is because of kernel update or stress-ng update. I am going to correct all failure information and see if there are hints within.

The stress-ng we use is from ppa:hardware-certification/public and kernel is from Ubuntu archive.

Changed in stress-ng (Ubuntu):
assignee: nobody → Ike Panhc (ikepanhc)
Changed in stress-ng:
assignee: nobody → Ike Panhc (ikepanhc)
Changed in stress-ng (Ubuntu):
status: New → In Progress
Changed in linux (Ubuntu):
status: New → In Progress
Changed in stress-ng:
status: New → In Progress
Revision history for this message
Colin Ian King (colin-king) wrote :

Return value 7 is "EXIT_METRICS_UNTRUSTWORTHY", this may mean the stressor may have been killed manually and/or it terminated abruptly causing it to be unable to store the bogo/ops metrics in a shared memory segment reliably.

Only two commits have changed in the lockofd stressor since Aug 2022:

commit 6274835de403ec310b20d2771ccf21a523b94cec
Author: Colin Ian King <email address hidden>
Date: Tue Nov 15 19:57:28 2022 +0000

    stress-*: rename stress_not_implemented to stress_unimplemented

    Signed-off-by: Colin Ian King <email address hidden>

commit 7ccc5fe9d18b7d467e8279c0a791df6453e5a451
Author: Colin Ian King <email address hidden>
Date: Tue Nov 15 19:47:32 2022 +0000

    stress-*: add more explanation about stressors being unimplmented

    It's useful for users to have some idea why a stressor is not implemented
    or useable on the various systems that stress-ng can build on.

    Signed-off-by: Colin Ian King <email address hidden>

These commits didn't change the functionality of stress-ng's lockofd stressor from what I can see.

Revision history for this message
Colin Ian King (colin-king) wrote :

It may be worth using stress-ng from https://launchpad.net/~colin-king/+archive/ubuntu/stress-ng or building it from source using:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
git checkout V0.15.00
make clean
make

and then running:

./stress-ng --lockofd 0 -t 300 --vmstat 1

..and see if that causes the issue.

Then checkout V0.14.06, make clean; make and see if this causes the issue.

Revision history for this message
David, Tsai (tsai-david) wrote :

Test with stress-ng_0.13.12-2ubuntu1_amd64.deb -> Passed

27 Dec 03:53: Running stress-ng locka stressor for 240 seconds...
stress-ng: info: [3300189] setting to a 240 second (4 mins, 0.00 secs) run per stressor
stress-ng: info: [3300189] dispatching hogs: 176 locka
stress-ng: info: [3300189] successful run completed in 240.09s (4 mins, 0.09 secs)

27 Dec 03:57: Running stress-ng lockf stressor for 240 seconds...
stress-ng: info: [3300551] setting to a 240 second (4 mins, 0.00 secs) run per stressor
stress-ng: info: [3300551] dispatching hogs: 176 lockf
stress-ng: info: [3300551] successful run completed in 240.10s (4 mins, 0.10 secs)

27 Dec 04:01: Running stress-ng lockofd stressor for 240 seconds...
stress-ng: info: [3300914] setting to a 240 second (4 mins, 0.00 secs) run per stressor
stress-ng: info: [3300914] dispatching hogs: 176 lockofd
stress-ng: info: [3300914] successful run completed in 240.09s (4 mins, 0.09 secs)

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks cking and David,

I am running 0.11.07-1ubuntu2 from archive, 0.14.06-0~202210291239~ubuntu20.04.1 from ppa:hardware-certification/public and 0.15.01-1~f1 from ppa:colin-king/stress-ng with 5.4.0-135.152 kernel to see if stress-ng version matters.

If not, next step is to test on kernels.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Running 5.4.0-135.152 kernel on

d05-2 with stress-ng 0.11.07-1ubuntu2 -> 10/10 passed
d05-2 with stress-ng 0.14.06-0~202210291239~ubuntu20.04.1 -> 10/10 passed
d05-5 with stress-ng 0.15.01-1~f1 -> 9/10 passed

and on d05-5 with stress-ng 0.14.06-0~202210291239~ubuntu20.04.1 I see failure yesterday.

I will run more test on d05-5 to find out.

The disk on d05-2 is 4T rotary and d05-5 is 8T rotary.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Here is the fail log with stress-ng 0.15.01-1~f1 on d05-5 with `--vmstat 1`

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi Ike,

I'm having difficulty reproducing this.

1. Please attach the dmesg log to the bug report when this fails
2. What file system is being used?
3. Does this occur with the latest version of stress-ng
    - one can use the packages from https://launchpad.net/~colin-king/+archive/ubuntu/stress-ng for the latest version for older releases.

Changed in stress-ng (Ubuntu):
importance: Undecided → High
assignee: Ike Panhc (ikepanhc) → Colin Ian King (colin-king)
Revision history for this message
Michael Reed (mreed8855) wrote :

Hi Colin,

I have attached a test run that reproduces this issue on Focal. The dmesg file is located in the following location:

attachment_files/com.canonical.certification__dmesg_attachment

Revision history for this message
Colin Ian King (colin-king) wrote :

I managed to corner the bug, the termination signal on timeout was killing the contention process with SIGKILL and this was killing the bogo-op increment mid increment causing the internal state to be inconsistent on the bogo-op counter. I've pushed a fix to the main repo:

commit efb0ad344e735986b29e0e1d68454edf9d793ee4 (HEAD -> master)
Author: Colin Ian King <email address hidden>
Date: Fri Feb 3 17:36:09 2023 +0000

    stress-lock{a|f|ofd}: terminate contention process with SIGALRM

This will be included in the next stress-ng release at the end of this month.

Revision history for this message
Colin Ian King (colin-king) wrote :

stress-ng V0.15.04 has been released with a fix for this issue.

Revision history for this message
Colin Ian King (colin-king) wrote :

the updated stress-ng is available from the stress-ng ppa: https://launchpad.net/~colin-king/+archive/ubuntu/stress-ng/

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks cking, I have trouble finding a machine stably reproducing this issue.

I will run >100 times test on V0.15.04 and V0.15.03. If V0.15.04 is much more stable, we can switch to V0.15.04 to avoid this issue.

Many thanks.

Revision history for this message
David, Tsai (tsai-david) wrote :

Test with stress-ng_0.15.04-1~j_amd64 + ubunut 22.04(5.15.0-58-generic) + Micron_7450_MTFDKCB3T8TFR -> Passed

Revision history for this message
David, Tsai (tsai-david) wrote :

The log was attached

Changed in stress-ng (Ubuntu):
status: In Progress → Fix Released
Ike Panhc (ikepanhc)
Changed in linux (Ubuntu):
status: In Progress → Invalid
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks Colin,

For your information, I ran a loop test[1] on each of stress-ng tags since V0.12.00, with 100 times lockofd run and see the return value. This issue starts to be seen since V0.12.09 (or maybe early because the reprudce chance is very low). Full console log is attached for your information.

--
#!/bin/bash

for i in `cat gittag.txt`; do
 echo ==== $i ====
 cd /home/ubuntu/stress-ng
 make clean
 git reset --hard
 git checkout $i
 make clean
 make
 cd /home/ubuntu
 for j in `seq 1 100`; do
  ./stress-ng/stress-ng --lockofd 0 -t 240
  echo == $? ==
 done
done

Revision history for this message
Ike Panhc (ikepanhc) wrote :

For more information, test V0.15.00 to V0.15.04 for 300 times each[1], I can reproduce on V0.15.00 and V0.15.01 only.

console output are attached.
--
[1]
#!/bin/bash

for i in `cat gittag.txt`; do
 echo ==== $i ====
 cd /home/ubuntu/stress-ng
 make clean
 git reset --hard
 git checkout $i
 make clean
 make
 cd /home/ubuntu
 for j in `seq 1 300`; do
  ./stress-ng/stress-ng --lockofd 0 -t 240
  echo == $? ==
 done
done

$ cat gittag.txt
V0.15.00
V0.15.01
V0.15.02
V0.15.03
V0.15.04

Revision history for this message
Ike Panhc (ikepanhc) wrote :

I believe it is ok to say this bug is fix already. Thanks Colin.

Changed in stress-ng:
assignee: Ike Panhc (ikepanhc) → nobody
Changed in stress-ng:
status: In Progress → Fix Released
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.