system node install fails if lighttpd restarts during the ostree pull
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Brief Description
-----------------
System node installs are seen to fail if lighttpd process is restarted on active controller or single connection failure occurs during the ostree network pull operation.
Severity
--------
Major: Install can fail
Steps to Reproduce
------------------
start system nopde install
monitor kickstart logs on console output
run "sm-reastart-safe service lighttpd" on active controller after the ostree pull operation starts
Expected Behavior
------------------
install succeeds
Actual Behavior
----------------
install fails
Reproducibility
---------------
Infrequently in a real system but 100% of the time if lighttpd is restarted during install
System Configuration
-------
DX systems
Branch/Pull Time/Commit
-------
Any up to June 3, 2024
Timestamp/Logs
--------------
GPG: Verification enabled, found 1 signature:
Signature made Wed Jun 6 13:30:27 2024 using RSA key ID CFA856DFC7CB87BE
Good signature from "Wind-River-
Receiving metadata objects: 4621/(estimating) 677.4 kB/s 2.7 MB
error: Could not connect: Connection refused
Installation failed.
2024-06-06 13:30:36.865 kickstart mkfs warn: All-in-one Installation Failed: ERROR: ostree pull failed, rc=1
Test Activity
-------------
Developer Testing
Workaround
----------
Retry install
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.10.0 stx.metal |
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
Reviewed: https:/ /review. opendev. org/c/starlingx /metal/ +/921407 /opendev. org/starlingx/ metal/commit/ 55aa34dde7dcfd7 543af1db0c4a36c cd13b05711
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 55aa34dde7dcfd7 543af1db0c4a36c cd13b05711
Author: Eric MacDonald <email address hidden>
Date: Wed Jun 5 18:32:53 2024 +0000
Add a ostree pull retry if the first pull attempt fails
This update adds 2 retries to the ostree pull operation for
system node or remote subcloud installs in attempt to handle
transient http server connection failure during the pull
operation.
Test Plan:
PASS: Verify successful system node install with retry code present.
PASS: Verify successful handling of a single transient connection loss
PASS: Verify handling of a persistent connection loss ; all tries fail
PASS: Verify successful subcloud install with retry code present.
PASS: Verify subcloud ostree pull retry handling ; 1,2 and max failures
Closes-Bug: 2068651 6882c494731111d 36b89c03415
Change-Id: I538d5a17918896
Signed-off-by: Eric MacDonald <email address hidden>