[2.3] Exceptions while processing commissioning output cause timeouts rather than being appropriately surfaced
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Fix Released
|
Critical
|
Mike Pontillo |
Bug Description
When a commissioning node in MAAS sends its output to the metadata service, MAAS will often process the data upon receiving it.
For example, a bug was observed upon sending the script output to the metadata service, whereby it replied with "HTTP/1.1 400 BAD REQUEST" and closed the connection[1].
There are a few things wrong with this situation:
(1) HTTP 400 errors indicate that the *client* sent in incorrect input. If an error was returned to the client, it should have been a 500 (internal server error), since there is nothing the client can do to correct it.
(2) MAAS should surface this error in the log file and continue, rather than passing it on to the script runner (which cannot do anything about it anyway). Commissioning should not be interrupted by an internal error while processing commissioning output. (If anything, the result of the script should be changed to a "warning" icon, rather than a failure or timeout.)
(3) In the event that a commissioning node encounters a failure (such as a timeout or HTTP error) while posting results via HTTP, it should continue running the remaining scripts and then retry the POSTs for a few minutes before giving up. This way, if a MAAS API endpoint restarts in the middle of commissioning, the commissioning will not be doomed to failure.
(4) The script runner on the commissioning node should log HTTP errors appropriately, so that an observer on the console or via rsyslog can diagnose issues that occur during script runs (such as commissioning or testing).
---
Related branches
- Lee Trager (community): Approve
-
Diff: 273 lines (+162/-12)4 files modifiedsrc/metadataserver/api.py (+69/-9)
src/metadataserver/models/scriptresult.py (+9/-2)
src/metadataserver/models/tests/test_scriptresult.py (+25/-1)
src/metadataserver/tests/test_api.py (+59/-0)
summary: |
- [2.3] Errors while post-processing commissioning output should not cause - commissioning to time out + [2.3] Exceptions while processing commissioning output cause timeouts + rather than being appropriately surfaced |
Changed in maas: | |
milestone: | none → 2.3.0 |
Changed in maas: | |
milestone: | 2.3.0 → 2.3.0beta2 |
Changed in maas: | |
assignee: | nobody → Mike Pontillo (mpontillo) |
milestone: | 2.3.0beta2 → 2.3.0beta3 |
Changed in maas: | |
assignee: | Mike Pontillo (mpontillo) → nobody |
Changed in maas: | |
assignee: | nobody → Mike Pontillo (mpontillo) |
Changed in maas: | |
status: | Triaged → In Progress |
Changed in maas: | |
status: | In Progress → Fix Committed |
Changed in maas: | |
status: | Fix Committed → Fix Released |
For the record, not every element described in the bug was fixed for this this issue.
MAAS will now print a traceback in the logs, and raise node event in the following cases:
(1) Post-processing of a commissioning script fails
(2) Setting the default storage layout fails
(3) Setting the default networking configuration fails
(4) Recalculating the node tags
The first three issues will cause the node to transition to "Failed Commissioning".
In the case of (1), commissioning will continue, and commissioning script output will be recorded in the database so that the error can be investigated.