MAAS

stress-ng-cpu-long times out in bionic

Bug #1826789 reported by Mario Splivalo on 2019-04-28

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Medium	Adam Collard	MAAS 3.3.0-beta1

Bug Description

I'm running Maas 2.3.5-6511-gf466fdb-0ubuntu1 on ubuntu xenial.

If I configure commissioning series to Bionic, stress-ng-cpu-long test times out, but only on NUMA machines (that is, machines that have more than one CPU).

If the commissioning series to xenial, then tests run fine, as expected.

This is the command MAAS runs when running stress-ng for stress-ng-cpu-long test:

stress-ng --aggressive -a 0 --class cpu,cpu-cache --ignite-cpu --log-brief --metrics-brief --times --tz --verify --timeout 12h

I have tried running this test with the timeout of one hour, or even just 10 minutes, and I've discovered that, on bionic/cosmic/disco, on NUMA machines, the tests usually take longer than the timeout is set.

For instance, on a dual Intel Xeon E5-2698 v3 machine (with total of 32/64 cores/threads), when I run 1h stress-ng test, these are completion times on different series:

xenial: 1.003 hours (60m 13s)
bionic: 1.120 hours (67m 12s)
cosmic: 1.190 hours (71m 35s)
disco: 1.470 hours (88m 20s)

When I run those tests on non-NUMA (single CPU) machines, the tests are done within 60 minutes, same as on xenial on NUMA machines.

I am able to make tests complete without an error if I change the metadata timeout value to 14 hours, making sure that the timeout is way greater than the expected test run.

I have opened a bug against stress-ng too, as I'm not sure if this is, maybe, normal stress-ng behavior on newer Ubuntu series.
The bug number is LP: #1826791

See original description

Related branches

~adam-collard/maas:stress-ng-long-timeouts

Merged into maas:master

MAAS Lander: Approve on 2022-09-01

Jack Lloyd-Walters: Approve on 2022-09-01

Mario Splivalo (mariosplivalo) on 2019-04-28

description:

updated

Lee Trager (ltrager) on 2019-09-20

Changed in maas:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Mario Splivalo (mariosplivalo) wrote on 2019-10-03:

Base on the input from LP: #1826791, this is the way stress-ng behaves on newer series.

So, this should be fixed in MAAS, probably by not running stress-ng withour --agressive and/or wirhout -a 0.

Revision history for this message

Facundo Ciccioli (fandanbango) wrote on 2020-08-06:

We stumbled upon this issue. We were able to do some tests and results are:

* Flag --agressive is irelevant in respect to the issue. Taking it out yields the same time increment in test runtime.

* Flag -a0 on the other hand has an impact. Setting -a1 yields perfect timing (this is, specified timeout agrees with actual running time). On a 56 cores machine:

   - using 28 stressors increased runtime by a couple of seconds (just under 2%),
   - using 40 stressors already increased running time by 50%,
   - using -a0 (56 stressors) increased runtime by 100%.

Hope you find this useful.

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2022-09-01:

MAAS should bump up the timeout for waiting for the results of the execution of stress-ng long to take into consideration extended duration of that test on some machines.

Changed in maas:
milestone:	none → 3.3.0
assignee:	nobody → Adam Collard (adam-collard)

MAAS Lander (maas-lander) on 2022-09-01

Changed in maas:
status:	Triaged → Fix Committed

Alexsander de Souza (alexsander-souza) on 2022-10-20

Changed in maas:
milestone:	3.3.0 → 3.3.0-beta1

Alexsander de Souza (alexsander-souza) on 2022-10-20

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.