some stress-ng-test-for-class-* jobs consistently fail on ARM CPUs

Bug #1986511 reported by Pierre Equoy
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Checkbox Provider - Base
Expired
Undecided
Unassigned

Bug Description

we have a resource job to generate jobs for the different stress-ng classes available[1]. Those generated jobs are then run to stress the system:

- stress-ng-test-for-class-cpu
- stress-ng-test-for-class-cpu-cache
- stress-ng-test-for-class-memory
- stress-ng-test-for-class-os
- stress-ng-test-for-class-pipe
- stress-ng-test-for-class-scheduler
- stress-ng-test-for-class-vm

Each of these jobs are basically running the command[2]:

  stress-ng --sequential 0 --class ${CLASS} --timeout "${TIMEOUT:-30}" --skip-silent --verbose

After QA team reported more and more issues with these, I gathered some data from older and more recent projects we have been running (from 2020 until 2022), and a few information appeared:

1. These tests seem to work fine on amd64 CPUs. They stress the system but complete, and the job is marked as "passed" in Checkbox.
2. Going through 6 different projects using either aarch64, either armv7l CPUs, we observe the following trend:

- stress-ng-test-for-class-os failed 5 times (out of 6)
- stress-ng-test-for-class-cpu failed 3 times (out of 6)
- stress-ng-test-for-class-memory failed 2 times (out of 6)
- stress-ng-test-for-class-pipe failed 1 time (out of 6)
- the other classes always passed.

During this period, different versions of stress-ng have been used:

- 0.13.03 g9093bce765cf
- 0.14.00 gec7f6c4731a5
- 0.14.01 g597da6154263

Unfortunately, it's very difficult to capture logs, as generally, the device becomes unresponsive, freezes or crash.

I am not sure if this is an issue with Checkbox, with stress-ng on ARM CPUs, or an actual problem with ARM CPUs...

[1] https://git.launchpad.net/plainbox-provider-checkbox/tree/units/stress/stress-ng.pxu#n2
[2] https://git.launchpad.net/plainbox-provider-checkbox/tree/units/stress/stress-ng.pxu#n33

Revision history for this message
Pierre Equoy (pieq) wrote :
Revision history for this message
Pierre Equoy (pieq) wrote :
description: updated
Revision history for this message
Colin Ian King (colin-king) wrote :

One failure is because of an unexpected EOPNOTSUPP error on a read from the kernel xilinx-zynqmp-rsa crypto engine.

stress-ng: fail: [1882] stress-ng-af-alg: read using xilinx-zynqmp-rsa failed: errno=95 (Operation not supported)
stress-ng: fail: [1882] stress-ng-af-alg: read using xilinx-zynqmp-rsa failed: errno=95 (Operation not supported)
stress-ng: fail: [1882] stress-ng-af-alg: read using xilinx-zynqmp-rsa failed: errno=95 (Operation not supported)
stress-ng: fail: [1882] stress-ng-af-alg: read using xilinx-zynqmp-rsa failed: errno=95 (Operation not supported)
stress-ng: fail: [1882] stress-ng-af-alg: read using xilinx-zynqmp-rsa failed: errno=95 (Operation not supported)

Revision history for this message
Pierre Equoy (pieq) wrote :

@colin-king Thanks for the info! This is indeed what my colleague Vic told me on Friday. I thought I had removed the logs from the issue, since they don't actually reflect the problem we see (and we don't have actual logs from when the problem occurs, because so far we ran Checkbox locally, so when the device crashes, it takes down the logs that are shown on the screen).

We have a plan to re-run some of these tests through SSH, so that we can capture the logs from another device.

Revision history for this message
Rick Wu (rickwu4444) wrote (last edit ):

I have some log from previous run on project with i.MX7 cpu.
I run stress-ng class OS with our workaround which is prevent some process kill by oom-killer via set oom_score to -1000 for some processes as follow: python / bash / snapd / systemd
And in the end of the test, system was reset by watchdog.

Revision history for this message
Rick Wu (rickwu4444) wrote :

And this is the log after I disabled the watchdog. And system hang due to kernel panic in the end.
However, in my story, it seems to be only happened on UC20 instead of classic. But I don't really know what's the different between those two.

Revision history for this message
Maksim Beliaev (beliaev-maksim) wrote :

Bug was migrated to GitHub: https://github.com/canonical/checkbox/issues/242.
Bug is no more monitored here.

Changed in plainbox-provider-checkbox:
status: New → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.