stress-ng-cpu-long times out in bionic

Bug #1826789 reported by Mario Splivalo
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Medium
Adam Collard

Bug Description

I'm running Maas 2.3.5-6511-gf466fdb-0ubuntu1 on ubuntu xenial.

If I configure commissioning series to Bionic, stress-ng-cpu-long test times out, but only on NUMA machines (that is, machines that have more than one CPU).

If the commissioning series to xenial, then tests run fine, as expected.

This is the command MAAS runs when running stress-ng for stress-ng-cpu-long test:

stress-ng --aggressive -a 0 --class cpu,cpu-cache --ignite-cpu --log-brief --metrics-brief --times --tz --verify --timeout 12h

I have tried running this test with the timeout of one hour, or even just 10 minutes, and I've discovered that, on bionic/cosmic/disco, on NUMA machines, the tests usually take longer than the timeout is set.

For instance, on a dual Intel Xeon E5-2698 v3 machine (with total of 32/64 cores/threads), when I run 1h stress-ng test, these are completion times on different series:

xenial: 1.003 hours (60m 13s)
bionic: 1.120 hours (67m 12s)
cosmic: 1.190 hours (71m 35s)
disco: 1.470 hours (88m 20s)

When I run those tests on non-NUMA (single CPU) machines, the tests are done within 60 minutes, same as on xenial on NUMA machines.

I am able to make tests complete without an error if I change the metadata timeout value to 14 hours, making sure that the timeout is way greater than the expected test run.

I have opened a bug against stress-ng too, as I'm not sure if this is, maybe, normal stress-ng behavior on newer Ubuntu series.
The bug number is LP: #1826791

Related branches

description: updated
Lee Trager (ltrager)
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Mario Splivalo (mariosplivalo) wrote :

Base on the input from LP: #1826791, this is the way stress-ng behaves on newer series.

So, this should be fixed in MAAS, probably by not running stress-ng withour --agressive and/or wirhout -a 0.

Revision history for this message
Facundo Ciccioli (fandanbango) wrote :

We stumbled upon this issue. We were able to do some tests and results are:

* Flag --agressive is irelevant in respect to the issue. Taking it out yields the same time increment in test runtime.

* Flag -a0 on the other hand has an impact. Setting -a1 yields perfect timing (this is, specified timeout agrees with actual running time). On a 56 cores machine:

   - using 28 stressors increased runtime by a couple of seconds (just under 2%),
   - using 40 stressors already increased running time by 50%,
   - using -a0 (56 stressors) increased runtime by 100%.

Hope you find this useful.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

MAAS should bump up the timeout for waiting for the results of the execution of stress-ng long to take into consideration extended duration of that test on some machines.

Changed in maas:
milestone: none → 3.3.0
assignee: nobody → Adam Collard (adam-collard)
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
milestone: 3.3.0 → 3.3.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.