[2.3] Exceptions while processing commissioning output cause timeouts rather than being appropriately surfaced

Bug #1718517 reported by Mike Pontillo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Mike Pontillo

Bug Description

When a commissioning node in MAAS sends its output to the metadata service, MAAS will often process the data upon receiving it.

For example, a bug was observed upon sending the script output to the metadata service, whereby it replied with "HTTP/1.1 400 BAD REQUEST" and closed the connection[1].

There are a few things wrong with this situation:

(1) HTTP 400 errors indicate that the *client* sent in incorrect input. If an error was returned to the client, it should have been a 500 (internal server error), since there is nothing the client can do to correct it.

(2) MAAS should surface this error in the log file and continue, rather than passing it on to the script runner (which cannot do anything about it anyway). Commissioning should not be interrupted by an internal error while processing commissioning output. (If anything, the result of the script should be changed to a "warning" icon, rather than a failure or timeout.)

(3) In the event that a commissioning node encounters a failure (such as a timeout or HTTP error) while posting results via HTTP, it should continue running the remaining scripts and then retry the POSTs for a few minutes before giving up. This way, if a MAAS API endpoint restarts in the middle of commissioning, the commissioning will not be doomed to failure.

(4) The script runner on the commissioning node should log HTTP errors appropriately, so that an observer on the console or via rsyslog can diagnose issues that occur during script runs (such as commissioning or testing).

---

[1]:
http://paste.ubuntu.com/25579880/

Related branches

summary: - [2.3] Errors while post-processing commissioning output should not cause
- commissioning to time out
+ [2.3] Exceptions while processing commissioning output cause timeouts
+ rather than being appropriately surfaced
Changed in maas:
milestone: none → 2.3.0
Changed in maas:
milestone: 2.3.0 → 2.3.0beta2
Changed in maas:
assignee: nobody → Mike Pontillo (mpontillo)
milestone: 2.3.0beta2 → 2.3.0beta3
Changed in maas:
assignee: Mike Pontillo (mpontillo) → nobody
Changed in maas:
assignee: nobody → Mike Pontillo (mpontillo)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Mike Pontillo (mpontillo) wrote :

For the record, not every element described in the bug was fixed for this this issue.

MAAS will now print a traceback in the logs, and raise node event in the following cases:

(1) Post-processing of a commissioning script fails
(2) Setting the default storage layout fails
(3) Setting the default networking configuration fails
(4) Recalculating the node tags

The first three issues will cause the node to transition to "Failed Commissioning".

In the case of (1), commissioning will continue, and commissioning script output will be recorded in the database so that the error can be investigated.

Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.