[2.3, HWTv2] Hardware Tests have a short timeout

Bug #1710092 reported by Robert Eikermann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Lee Trager

Bug Description

Running Hardware Tests on Machines abort with a "Timed out" error. The Test "stress-ng-cpu-long" timed out after 11:59:26. The test is designed to last 12h.

Related branches

Revision history for this message
Robert Eikermann (robert-eikermann) wrote :
Revision history for this message
Robert Eikermann (robert-eikermann) wrote :

Attachment for contents of /var/log/maas/*
- maas.log

Please let me know if you need more than maas.log

Revision history for this message
Robert Eikermann (robert-eikermann) wrote :

Content of dpkg -l '*maas*'|cat

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-====================================-============-=================================================
ii maas 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.2.2-6099-g8751f91-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Robert Eikermann (robert-eikermann) wrote :

Reproduceability:

- Set up Maas
- Commission a Machine
- run Hardware Test: stress-ng-cpu-long

Revision history for this message
Robert Eikermann (robert-eikermann) wrote :

Content as a file attachment: dpkg -l '*maas*'|cat

Revision history for this message
Nobuto Murata (nobuto) wrote :

Another example is memtester. It takes 6 minutes per GB memory. Baremetal machine may have 100+ GB memory.

====
Memory integrity (memtester) memory 0:12:36 Timed out
====
$ time memtester 1G 1
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 1024MB (1073741824 bytes)
got 1024MB (1073741824 bytes), trying mlock ...locked.
Loop 1/1:
  Stuck Address : ok
  Random Value : ok
  Compare XOR : ok
  Compare SUB : ok
  Compare MUL : ok
  Compare DIV : ok
  Compare OR : ok
  Compare AND : ok
  Sequential Increment: ok
  Solid Bits : ok
  Block Sequential : ok
  Checkerboard : ok
  Bit Spread : ok
  Bit Flip : ok
  Walking Ones : ok
  Walking Zeroes : ok
  8-bit Writes : ok
  16-bit Writes : ok

Done.

real 6m41.745s
user 6m41.164s
sys 0m0.516s

Changed in maas:
status: New → Confirmed
Revision history for this message
Blake Rouse (blake-rouse) wrote :

I am wonder if these tests are being marked timed out because they are under so much load they are not pinging back to MAAS.

The machine must ping back to MAAS every 2 minutes, and will allow a few minutes to pass before we say the machine has completely locked up and we kill the machine.

That could be what is happening here.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

I';ve experienced similar situation:

 Marking node failed - stress-ng-memory-short has run past it's timeout(0:05:00)

Note that bebfore running the memory tests I also run other hardware tests and didn't experience such issues in the other.

 Memory integrity (stress-ng-memory-short) memory 0:10:16 Timed out

that said, in my particular scenario, stress-ng was running the tests, so it has its own timer..

Changed in maas:
milestone: none → 2.3.0
importance: Undecided → High
Changed in maas:
milestone: 2.3.0 → 2.3.0beta2
summary: - Hardware Tests have a short timeout
+ [2.3, HWTv2] Hardware Tests have a short timeout
Changed in maas:
milestone: 2.3.0beta2 → 2.3.0beta3
Lee Trager (ltrager)
Changed in maas:
assignee: nobody → Lee Trager (ltrager)
status: Confirmed → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.3.0beta3 → 2.3.0beta2
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.