system node install fails if lighttpd restarts during the ostree pull

Bug #2068651 reported by Eric MacDonald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
System node installs are seen to fail if lighttpd process is restarted on active controller or single connection failure occurs during the ostree network pull operation.

Severity
--------

Major: Install can fail

Steps to Reproduce
------------------
start system nopde install
monitor kickstart logs on console output
run "sm-reastart-safe service lighttpd" on active controller after the ostree pull operation starts

Expected Behavior
------------------
install succeeds

Actual Behavior
----------------
install fails

Reproducibility
---------------
Infrequently in a real system but 100% of the time if lighttpd is restarted during install

System Configuration
--------------------
DX systems

Branch/Pull Time/Commit
-----------------------
Any up to June 3, 2024

Timestamp/Logs
--------------

GPG: Verification enabled, found 1 signature:

  Signature made Wed Jun 6 13:30:27 2024 using RSA key ID CFA856DFC7CB87BE
  Good signature from "Wind-River-Linux-Sample <email address hidden>"
Receiving metadata objects: 4621/(estimating) 677.4 kB/s 2.7 MB
error: Could not connect: Connection refused

Installation failed.

2024-06-06 13:30:36.865 kickstart mkfs warn: All-in-one Installation Failed: ERROR: ostree pull failed, rc=1

Test Activity
-------------
Developer Testing

Workaround
----------
Retry install

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/921407
Committed: https://opendev.org/starlingx/metal/commit/55aa34dde7dcfd7543af1db0c4a36ccd13b05711
Submitter: "Zuul (22348)"
Branch: master

commit 55aa34dde7dcfd7543af1db0c4a36ccd13b05711
Author: Eric MacDonald <email address hidden>
Date: Wed Jun 5 18:32:53 2024 +0000

    Add a ostree pull retry if the first pull attempt fails

    This update adds 2 retries to the ostree pull operation for
    system node or remote subcloud installs in attempt to handle
    transient http server connection failure during the pull
    operation.

    Test Plan:

    PASS: Verify successful system node install with retry code present.
    PASS: Verify successful handling of a single transient connection loss
    PASS: Verify handling of a persistent connection loss ; all tries fail
    PASS: Verify successful subcloud install with retry code present.
    PASS: Verify subcloud ostree pull retry handling ; 1,2 and max failures

    Closes-Bug: 2068651
    Change-Id: I538d5a179188966882c494731111d36b89c03415
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.metal
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.