controller-1 install failed due to failed redfish netboot command

Bug #1880578 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Brief Description
-----------------
  DC-3 system install failed due to controller-1 redfish netboot command failure.

Severity
--------
  Major: system install failed

Steps to Reproduce
------------------
  System install

Expected Behavior
------------------
  System installs correctly. All bmc commands succeed.

Actual Behavior
----------------
  System install failed for controller-1 when the netboot command failed for unknown reason.

Reproducibility
---------------
Intermittent - rare

System Configuration
--------------------
  DC-3 - IPV6

Branch/Pull Time/Commit
-----------------------
  BUILD_DATE="2020-03-26 19:41:10 -0400"

Last Pass
---------
Did this test scenario pass previously? Yes, is intermittent

Timestamp/Logs
--------------

2020-03-27T19:18:44.216 fmAPI.cpp(490): Enqueue raise alarm request: UUID (a51a12e4-2d37-4434-84e7-b599cc60ab42) alarm id (200.021) instant id (host=controller-1.command=reinstall)
2020-03-27T19:18:44.216 [123316.00215] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6327) bmc_handler : Info : controller-1 bmc credentials received
2020-03-27T19:18:44.216 [123316.00216] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:3)
2020-03-27T19:18:44.216 [123316.00217] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4118) reinstall_handler : Warn : controller-1 Reinstall wait for BMC access ; 600 second timeout
2020-03-27T19:18:44.216 [123316.00218] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Wait ; BMC not accessible (seq:4)
579 2020-03-27T19:18:44.227 [123316.00219] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6395) bmc_handler : Info : controller-1 bmc communication protocol discovery
2020-03-27T19:18:44.244 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.021), entity_id (host=controller-1.command=reinstall)
2020-03-27T19:18:44.285 fmAlarmUtils.cpp(658): FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controller-1.command=reinstall)
582 2020-03-27T19:18:47.227 [123316.00220] controller-0 mtcAgent --- redfishUtil.cpp ( 305) redfishUtil_is_supported: Info : controller-1 bmc supports redfish version 1.7.0
583 2020-03-27T19:18:47.227 [123316.00221] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6465) bmc_handler : Info : controller-1 bmc control using redfishtool:2620:10a:a001:a102::125
584 2020-03-27T19:18:55.238 [123316.00222] controller-0 mtcAgent --- redfishUtil.cpp ( 533) redfishUtil_get_bmc_info: Info : controller-1 power is off
2020-03-27T19:18:55.238 [123316.00223] controller-0 mtcAgent --- redfishUtil.cpp ( 544) redfishUtil_get_bmc_info: Info : controller-1 manufacturer is Intel Corporation ; model:S2600WFQ part:J44810-003 serial:BQWT80101028
2020-03-27T19:18:55.238 [123316.00224] controller-0 mtcAgent --- redfishUtil.cpp ( 551) redfishUtil_get_bmc_info: Info : controller-1 BIOS fw version SE5C620.86B.02.01.0010.010620200716
2020-03-27T19:18:55.238 [123316.00225] controller-0 mtcAgent --- redfishUtil.cpp ( 153) _load_action_lists : Info : controller-1 bmc actions ; reset:ForceRestart power-on:On,ForceOn power-off:GracefulShutdown,ForceOff
2020-03-27T19:18:55.238 [123316.00226] controller-0 mtcAgent --- redfishUtil.cpp ( 602) redfishUtil_get_bmc_info: Info : controller-1 has 2 Processors ; Enabled and OK:OK
2020-03-27T19:18:55.238 [123316.00227] controller-0 mtcAgent --- redfishUtil.cpp ( 625) redfishUtil_get_bmc_info: Info : controller-1 has 192 GiB Memory ; Enabled and OK:OK
2020-03-27T19:18:55.238 [123316.00228] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6545) bmc_handler : Info : controller-1 bmc audit timer started (120 secs)
2020-03-27T19:18:55.238 [123316.00229] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6561) bmc_handler : Info : controller-1 bmc is accessible using redfish
2020-03-27T19:18:55.238 [123316.00230] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 add host sent to hwmond
2020-03-27T19:18:55.238 [123316.00231] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 start host service sent to hwmond
2020-03-27T19:18:55.238 [123316.00232] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4162) reinstall_handler : Info : controller-1 BMC access established ; starting install
2020-03-27T19:18:55.248 [123316.00233] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:7)
2020-03-27T19:19:00.258 [123316.00234] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4275) reinstall_handler : Info : controller-1 Reinstall power-off already
2020-03-27T19:19:00.268 [123316.00235] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4346) reinstall_handler : Info : controller-1 Reinstall netboot request sent
2020-03-27T19:19:05.269 [123316.00236] controller-0 mtcAgent --- mtcBmcUtil.cpp ( 214) bmc_command_recv :Error : controller-1 bmc redfish Netboot command failed (redfishtool) (data:) (rc:108:108:system call failed)
2020-03-27T19:19:05.269 [123316.00237] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4371) reinstall_handler :Error : controller-1 Reinstall netboot receive failed (rc:108)
2020-03-27T19:19:05.269 [123316.00238] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Failed ; netboot request (seq:8)
2020-03-27T19:19:05.279 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ca566554-db16-4c0a-94a3-0a973ed85047) alarm id (200.022) instant id (host=controller-1.status=reinstall-failed)
2020-03-27T19:19:35.290 [123316.00239] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4600) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure

The first few logs sow that redfish bmc accesses were fine only to see the netboot command fail for no apparent reason.

Suspect a mutual exclusion race condition of power/reset/reinstall command handling with the bmc connectivity audit. There is no connectivity audit in ipmi protocol case. The bmc in-service audit was an enhancement for redfish to get a feeling for how reliable the protocol is.

Test Activity
-------------
  Sanity, Developer Testing

Workaround
----------
  Retry failed command

Options: Fix options still being investigated
-------

  1. stop the in-service bmc query audit.
     - ping failure audit still detects complete loss of bmc connectivity ; same as ipmi.
  2. add retry to the netboot command in the re-install handling sequence
     - ensure all bmc commands in the reinstall handler are retried
  3. add mutex controls to bmc power/reset/reinstall command handling to avoid collision with bmc audit.
     - hold off one bmc access service until another that is in progress completes.

An update that implements the mutual exclusion for option 3 is implemented and being tested.

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as low priority. As per the reporter, the issue is rare. Also this is not commercial hardware.

Changed in starlingx:
importance: Undecided → Low
tags: added: stx.metal
Changed in starlingx:
status: New → Triaged
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The netboot command, in the reinstall handler, needs retries to accommodate BMCs that sometimes fail the command as can be seen in this example.

Retry handling implemented and tested on hardware that previously exhibited the issue with positive results.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Raised the priority to medium and tagged for stx.5.0 as a similar issue was seen on commercial servers as well. A robustness fix which adds retires will avoid this issue.

tags: added: stx.5.0
Changed in starlingx:
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/761760

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/761760
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=11960566125e395e2556af1719778d737d4b86e5
Submitter: Zuul
Branch: master

commit 11960566125e395e2556af1719778d737d4b86e5
Author: Eric MacDonald <email address hidden>
Date: Fri Nov 6 09:21:22 2020 -0500

    Disable Redfish BMC audit and improve reinstall failure handling

    The Mtce Reinstall Handler can collide with the BMC Redfish
    audit resulting in reinstall failure. BMC handler's 2 minute
    connection audit can colliding with other BMC commands.

    The reinstall handler, with 4 bmc command operations is
    particularly suseptable.

    Two additional bmc communication improvements are implemented:

    1. Add 'retry' handling to all BMC requests in the Maintenance
       Reinstall Handler FSM to handle transient command failures.

       Note: There are already retries to all but the power status
       query and the netboot requests in that handler and retries
       in other administrative commands that involve bmc requests.

    2. Switch BMC power control command management from 'static' to
       'learned' lists. Some BMCs don't support both graceful and
       immediate power commands; Graceful Restart and Force Restart.
       To remove the possibility of using an unsupported BMC command,
       this update switches from static to learned power command lists
       with log produced if a server is missing command support.

       Power commands escalate from graceful to immediate in the
       presence of retries.

    Test Cases:

    PASS: Verify bmc handler redfish audit is disabled
    PASS: Verify reinstall soak using redfish
    PASS: Verify reinstall netboot and power status retry handling
    PASS: Verify all power control commands using redfish
    PASS: Verify graceful operations are used if available
    PASS: Verify immediate operations are used for retries

    Regression:

    PASS: Verify bmc ping audit success and failure handling

    PASS: Verify Reset Handling soak (redfish and ipmi)
    PASS: Verify Power-Off/On Handling soak (redfish and ipmi)
    PASS: Verify Reinstall Handling soak (redfish and ipmi)
    PASS: Verify Standard System Install (redfish and ipmi)
    PASS: Verify AIO DX System Install (redfish and ipmi)

    PASS: Verify this update as a patch

    Change-Id: Idb484512ccb1b16e2d0ea9aff4ab7965347b1322
    Closes-Bug: 1880578
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.