smartctl-validate test runs even when explicitly removed from commissioning step

Bug #1964024 reported by David A. Desrosiers
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Caleb Ellis
maas-ui
Fix Released
Unknown

Bug Description

As titled, when commissioning machines, if you explicitly remove (uncheck) the smartctl-validate unit test, it runs anyway, regardless of whether it's enabled or not.

This presents a larger problem of triggering race conditions with concurrent commissioning, where multiple smartctl-validate tests report their success/failure back to regiond, and cause race conditions, leading to commissioning failures.

Reducing the number nodes requested to commission in parallel reduces the impact of these failures, but it's still non-zero.

When tests are disabled, they should not run at all, including smartctl-validate.

Tags: ui
Alberto Donato (ack)
Changed in maas:
milestone: none → next
importance: Undecided → High
status: New → Triaged
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

There seem to be two issues: 1) disabling scripts has no effect, and 2) smartctl-validate causes issues. We will investigate both problems.

Changed in maas:
milestone: next → 3.3.0
Revision history for this message
Alberto Donato (ack) wrote :

The issue with not being able to unselect testing scripts in the UI is due to the fact that the UI passes an empty list for testing scripts in that case.

In the backend, this means "pick the default scripts", which is currently the smartctl-validate one. To request no scripts to be run ["none"] must be passed.

Ideally I think we should have the UI behaviour match the one of the API, where script names/tags are passed, not IDs.

Also note that the UI doesn't really need to pass the list of builtin commissioning scripts, since those are always included.

tags: added: ui
Changed in maas-ui:
importance: Undecided → Unknown
Changed in maas-ui:
status: New → Fix Released
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
assignee: nobody → Caleb Ellis (caleb-ellis)
Changed in maas:
milestone: 3.3.0 → 3.3.0-beta1
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
Brian McNally (bmcnally-uw) wrote :

Was this bug reintroduced in MAAS 3.4.0? I'm running in to what look like pretty similar issues. First, smartctl-validate fails against multiple bare metal systems (it works fine against VMs), and removing it from the commissioning stage doesn't seem to do anything productive as the node still fails on smartctl-validate.

From the ephemeral kernel running on a nodes, where can I look to see where/why the smartctl-validate test was aborted? Unhelpfully the UI seems to assume the only failures you'd be interested in seeing logs for are ones where the script failed, not where it was aborted.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.