Comment 0 for bug 1880578

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Brief Description
-----------------
  DC-3 system install failed due to controller-1 redfish netboot command failure.

Severity
--------
  Major: system install failed

Steps to Reproduce
------------------
  System install

Expected Behavior
------------------
  System installs correctly. All bmc commands succeed.

Actual Behavior
----------------
  System install failed for controller-1 when the netboot command failed for unknown reason.

Reproducibility
---------------
Intermittent - rare

System Configuration
--------------------
  DC-3 - IPV6

Branch/Pull Time/Commit
-----------------------
  SW_VERSION="20.03"
  BUILD_TARGET="Host Installer"
  BUILD_TYPE="Formal"
  BUILD_ID="2020-03-26_19-39-18"
  SRC_BUILD_ID="6"
  JOB="WRCP_20.03_Build"
  BUILD_BY="jenkins"
  BUILD_NUMBER="6"
  BUILD_HOST="yow-cgts4-lx.wrs.com"
  BUILD_DATE="2020-03-26 19:41:10 -0400"

Last Pass
---------
Did this test scenario pass previously? Yes, is intermittent

Timestamp/Logs
--------------

From /folk/cgts/logs/CGTS-17697

2020-03-27T19:18:44.216 fmAPI.cpp(490): Enqueue raise alarm request: UUID (a51a12e4-2d37-4434-84e7-b599cc60ab42) alarm id (200.021) instant id (host=controller-1.command=reinstall)
2020-03-27T19:18:44.216 [123316.00215] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6327) bmc_handler : Info : controller-1 bmc credentials received
2020-03-27T19:18:44.216 [123316.00216] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:3)
2020-03-27T19:18:44.216 [123316.00217] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4118) reinstall_handler : Warn : controller-1 Reinstall wait for BMC access ; 600 second timeout
2020-03-27T19:18:44.216 [123316.00218] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Wait ; BMC not accessible (seq:4)
579 2020-03-27T19:18:44.227 [123316.00219] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6395) bmc_handler : Info : controller-1 bmc communication protocol discovery
2020-03-27T19:18:44.244 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.021), entity_id (host=controller-1.command=reinstall)
2020-03-27T19:18:44.285 fmAlarmUtils.cpp(658): FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controller-1.command=reinstall)
582 2020-03-27T19:18:47.227 [123316.00220] controller-0 mtcAgent --- redfishUtil.cpp ( 305) redfishUtil_is_supported: Info : controller-1 bmc supports redfish version 1.7.0
583 2020-03-27T19:18:47.227 [123316.00221] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6465) bmc_handler : Info : controller-1 bmc control using redfishtool:2620:10a:a001:a102::125
584 2020-03-27T19:18:55.238 [123316.00222] controller-0 mtcAgent --- redfishUtil.cpp ( 533) redfishUtil_get_bmc_info: Info : controller-1 power is off
2020-03-27T19:18:55.238 [123316.00223] controller-0 mtcAgent --- redfishUtil.cpp ( 544) redfishUtil_get_bmc_info: Info : controller-1 manufacturer is Intel Corporation ; model:S2600WFQ part:J44810-003 serial:BQWT80101028
2020-03-27T19:18:55.238 [123316.00224] controller-0 mtcAgent --- redfishUtil.cpp ( 551) redfishUtil_get_bmc_info: Info : controller-1 BIOS fw version SE5C620.86B.02.01.0010.010620200716
2020-03-27T19:18:55.238 [123316.00225] controller-0 mtcAgent --- redfishUtil.cpp ( 153) _load_action_lists : Info : controller-1 bmc actions ; reset:ForceRestart power-on:On,ForceOn power-off:GracefulShutdown,ForceOff
2020-03-27T19:18:55.238 [123316.00226] controller-0 mtcAgent --- redfishUtil.cpp ( 602) redfishUtil_get_bmc_info: Info : controller-1 has 2 Processors ; Enabled and OK:OK
2020-03-27T19:18:55.238 [123316.00227] controller-0 mtcAgent --- redfishUtil.cpp ( 625) redfishUtil_get_bmc_info: Info : controller-1 has 192 GiB Memory ; Enabled and OK:OK
2020-03-27T19:18:55.238 [123316.00228] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6545) bmc_handler : Info : controller-1 bmc audit timer started (120 secs)
2020-03-27T19:18:55.238 [123316.00229] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6561) bmc_handler : Info : controller-1 bmc is accessible using redfish
2020-03-27T19:18:55.238 [123316.00230] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 add host sent to hwmond
2020-03-27T19:18:55.238 [123316.00231] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 start host service sent to hwmond
2020-03-27T19:18:55.238 [123316.00232] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4162) reinstall_handler : Info : controller-1 BMC access established ; starting install
2020-03-27T19:18:55.248 [123316.00233] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:7)
2020-03-27T19:19:00.258 [123316.00234] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4275) reinstall_handler : Info : controller-1 Reinstall power-off already
2020-03-27T19:19:00.268 [123316.00235] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4346) reinstall_handler : Info : controller-1 Reinstall netboot request sent
2020-03-27T19:19:05.269 [123316.00236] controller-0 mtcAgent --- mtcBmcUtil.cpp ( 214) bmc_command_recv :Error : controller-1 bmc redfish Netboot command failed (redfishtool) (data:) (rc:108:108:system call failed)
2020-03-27T19:19:05.269 [123316.00237] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4371) reinstall_handler :Error : controller-1 Reinstall netboot receive failed (rc:108)
2020-03-27T19:19:05.269 [123316.00238] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Failed ; netboot request (seq:8)
2020-03-27T19:19:05.279 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ca566554-db16-4c0a-94a3-0a973ed85047) alarm id (200.022) instant id (host=controller-1.status=reinstall-failed)
2020-03-27T19:19:35.290 [123316.00239] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4600) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure

The first few logs sow that redfish bmc accesses were fine only to see the netboot command fail for no apparent reason.

Suspect a mutual exclusion race condition of power/reset/reinstall command handling with the bmc connectivity audit. There is no connectivity audit in ipmi protocol case. The bmc in-service audit was an enhancement for redfish to get a feeling for how reliable the protocol is.

Test Activity
-------------
  Sanity, Developer Testing

Workaround
----------
  Retry failed command

Options: Fix options still being investigated
-------

  1. stop the in-service bmc query audit.
     - ping failure audit still detects complete loss of bmc connectivity ; same as ipmi.
  2. add retry to the netboot command in the re-install handling sequence
     - ensure all bmc commands in the reinstall handler are retried
  3. add mutex controls to bmc power/reset/reinstall command handling to avoid collision with bmc audit.
     - hold off one bmc access service until another that is in progress completes.

An update that implements the mutual exclusion for option 3 is implemented and being tested.