Activity log for bug #1880578

Date Who What changed Old value New value Message
2020-05-25 15:29:35 Eric MacDonald bug added bug
2020-05-25 15:29:40 Eric MacDonald starlingx: assignee Eric MacDonald (rocksolidmtce)
2020-05-26 02:35:52 Ghada Khalil description Brief Description ----------------- DC-3 system install failed due to controller-1 redfish netboot command failure. Severity -------- Major: system install failed Steps to Reproduce ------------------ System install Expected Behavior ------------------ System installs correctly. All bmc commands succeed. Actual Behavior ---------------- System install failed for controller-1 when the netboot command failed for unknown reason. Reproducibility --------------- Intermittent - rare System Configuration -------------------- DC-3 - IPV6 Branch/Pull Time/Commit ----------------------- SW_VERSION="20.03" BUILD_TARGET="Host Installer" BUILD_TYPE="Formal" BUILD_ID="2020-03-26_19-39-18" SRC_BUILD_ID="6" JOB="WRCP_20.03_Build" BUILD_BY="jenkins" BUILD_NUMBER="6" BUILD_HOST="yow-cgts4-lx.wrs.com" BUILD_DATE="2020-03-26 19:41:10 -0400" Last Pass --------- Did this test scenario pass previously? Yes, is intermittent Timestamp/Logs -------------- From /folk/cgts/logs/CGTS-17697 2020-03-27T19:18:44.216 fmAPI.cpp(490): Enqueue raise alarm request: UUID (a51a12e4-2d37-4434-84e7-b599cc60ab42) alarm id (200.021) instant id (host=controller-1.command=reinstall) 2020-03-27T19:18:44.216 [123316.00215] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6327) bmc_handler : Info : controller-1 bmc credentials received 2020-03-27T19:18:44.216 [123316.00216] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:3) 2020-03-27T19:18:44.216 [123316.00217] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4118) reinstall_handler : Warn : controller-1 Reinstall wait for BMC access ; 600 second timeout 2020-03-27T19:18:44.216 [123316.00218] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Wait ; BMC not accessible (seq:4) 579 2020-03-27T19:18:44.227 [123316.00219] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6395) bmc_handler : Info : controller-1 bmc communication protocol discovery 2020-03-27T19:18:44.244 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.021), entity_id (host=controller-1.command=reinstall) 2020-03-27T19:18:44.285 fmAlarmUtils.cpp(658): FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controller-1.command=reinstall) 582 2020-03-27T19:18:47.227 [123316.00220] controller-0 mtcAgent --- redfishUtil.cpp ( 305) redfishUtil_is_supported: Info : controller-1 bmc supports redfish version 1.7.0 583 2020-03-27T19:18:47.227 [123316.00221] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6465) bmc_handler : Info : controller-1 bmc control using redfishtool:2620:10a:a001:a102::125 584 2020-03-27T19:18:55.238 [123316.00222] controller-0 mtcAgent --- redfishUtil.cpp ( 533) redfishUtil_get_bmc_info: Info : controller-1 power is off 2020-03-27T19:18:55.238 [123316.00223] controller-0 mtcAgent --- redfishUtil.cpp ( 544) redfishUtil_get_bmc_info: Info : controller-1 manufacturer is Intel Corporation ; model:S2600WFQ part:J44810-003 serial:BQWT80101028 2020-03-27T19:18:55.238 [123316.00224] controller-0 mtcAgent --- redfishUtil.cpp ( 551) redfishUtil_get_bmc_info: Info : controller-1 BIOS fw version SE5C620.86B.02.01.0010.010620200716 2020-03-27T19:18:55.238 [123316.00225] controller-0 mtcAgent --- redfishUtil.cpp ( 153) _load_action_lists : Info : controller-1 bmc actions ; reset:ForceRestart power-on:On,ForceOn power-off:GracefulShutdown,ForceOff 2020-03-27T19:18:55.238 [123316.00226] controller-0 mtcAgent --- redfishUtil.cpp ( 602) redfishUtil_get_bmc_info: Info : controller-1 has 2 Processors ; Enabled and OK:OK 2020-03-27T19:18:55.238 [123316.00227] controller-0 mtcAgent --- redfishUtil.cpp ( 625) redfishUtil_get_bmc_info: Info : controller-1 has 192 GiB Memory ; Enabled and OK:OK 2020-03-27T19:18:55.238 [123316.00228] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6545) bmc_handler : Info : controller-1 bmc audit timer started (120 secs) 2020-03-27T19:18:55.238 [123316.00229] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6561) bmc_handler : Info : controller-1 bmc is accessible using redfish 2020-03-27T19:18:55.238 [123316.00230] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 add host sent to hwmond 2020-03-27T19:18:55.238 [123316.00231] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 start host service sent to hwmond 2020-03-27T19:18:55.238 [123316.00232] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4162) reinstall_handler : Info : controller-1 BMC access established ; starting install 2020-03-27T19:18:55.248 [123316.00233] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:7) 2020-03-27T19:19:00.258 [123316.00234] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4275) reinstall_handler : Info : controller-1 Reinstall power-off already 2020-03-27T19:19:00.268 [123316.00235] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4346) reinstall_handler : Info : controller-1 Reinstall netboot request sent 2020-03-27T19:19:05.269 [123316.00236] controller-0 mtcAgent --- mtcBmcUtil.cpp ( 214) bmc_command_recv :Error : controller-1 bmc redfish Netboot command failed (redfishtool) (data:) (rc:108:108:system call failed) 2020-03-27T19:19:05.269 [123316.00237] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4371) reinstall_handler :Error : controller-1 Reinstall netboot receive failed (rc:108) 2020-03-27T19:19:05.269 [123316.00238] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Failed ; netboot request (seq:8) 2020-03-27T19:19:05.279 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ca566554-db16-4c0a-94a3-0a973ed85047) alarm id (200.022) instant id (host=controller-1.status=reinstall-failed) 2020-03-27T19:19:35.290 [123316.00239] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4600) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure The first few logs sow that redfish bmc accesses were fine only to see the netboot command fail for no apparent reason. Suspect a mutual exclusion race condition of power/reset/reinstall command handling with the bmc connectivity audit. There is no connectivity audit in ipmi protocol case. The bmc in-service audit was an enhancement for redfish to get a feeling for how reliable the protocol is. Test Activity ------------- Sanity, Developer Testing Workaround ---------- Retry failed command Options: Fix options still being investigated ------- 1. stop the in-service bmc query audit. - ping failure audit still detects complete loss of bmc connectivity ; same as ipmi. 2. add retry to the netboot command in the re-install handling sequence - ensure all bmc commands in the reinstall handler are retried 3. add mutex controls to bmc power/reset/reinstall command handling to avoid collision with bmc audit. - hold off one bmc access service until another that is in progress completes. An update that implements the mutual exclusion for option 3 is implemented and being tested. Brief Description -----------------   DC-3 system install failed due to controller-1 redfish netboot command failure. Severity --------   Major: system install failed Steps to Reproduce ------------------   System install Expected Behavior ------------------   System installs correctly. All bmc commands succeed. Actual Behavior ----------------   System install failed for controller-1 when the netboot command failed for unknown reason. Reproducibility --------------- Intermittent - rare System Configuration --------------------   DC-3 - IPV6 Branch/Pull Time/Commit -----------------------   BUILD_DATE="2020-03-26 19:41:10 -0400" Last Pass --------- Did this test scenario pass previously? Yes, is intermittent Timestamp/Logs -------------- 2020-03-27T19:18:44.216 fmAPI.cpp(490): Enqueue raise alarm request: UUID (a51a12e4-2d37-4434-84e7-b599cc60ab42) alarm id (200.021) instant id (host=controller-1.command=reinstall) 2020-03-27T19:18:44.216 [123316.00215] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6327) bmc_handler : Info : controller-1 bmc credentials received 2020-03-27T19:18:44.216 [123316.00216] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:3) 2020-03-27T19:18:44.216 [123316.00217] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4118) reinstall_handler : Warn : controller-1 Reinstall wait for BMC access ; 600 second timeout 2020-03-27T19:18:44.216 [123316.00218] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Wait ; BMC not accessible (seq:4) 579 2020-03-27T19:18:44.227 [123316.00219] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6395) bmc_handler : Info : controller-1 bmc communication protocol discovery 2020-03-27T19:18:44.244 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.021), entity_id (host=controller-1.command=reinstall) 2020-03-27T19:18:44.285 fmAlarmUtils.cpp(658): FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controller-1.command=reinstall) 582 2020-03-27T19:18:47.227 [123316.00220] controller-0 mtcAgent --- redfishUtil.cpp ( 305) redfishUtil_is_supported: Info : controller-1 bmc supports redfish version 1.7.0 583 2020-03-27T19:18:47.227 [123316.00221] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6465) bmc_handler : Info : controller-1 bmc control using redfishtool:2620:10a:a001:a102::125 584 2020-03-27T19:18:55.238 [123316.00222] controller-0 mtcAgent --- redfishUtil.cpp ( 533) redfishUtil_get_bmc_info: Info : controller-1 power is off 2020-03-27T19:18:55.238 [123316.00223] controller-0 mtcAgent --- redfishUtil.cpp ( 544) redfishUtil_get_bmc_info: Info : controller-1 manufacturer is Intel Corporation ; model:S2600WFQ part:J44810-003 serial:BQWT80101028 2020-03-27T19:18:55.238 [123316.00224] controller-0 mtcAgent --- redfishUtil.cpp ( 551) redfishUtil_get_bmc_info: Info : controller-1 BIOS fw version SE5C620.86B.02.01.0010.010620200716 2020-03-27T19:18:55.238 [123316.00225] controller-0 mtcAgent --- redfishUtil.cpp ( 153) _load_action_lists : Info : controller-1 bmc actions ; reset:ForceRestart power-on:On,ForceOn power-off:GracefulShutdown,ForceOff 2020-03-27T19:18:55.238 [123316.00226] controller-0 mtcAgent --- redfishUtil.cpp ( 602) redfishUtil_get_bmc_info: Info : controller-1 has 2 Processors ; Enabled and OK:OK 2020-03-27T19:18:55.238 [123316.00227] controller-0 mtcAgent --- redfishUtil.cpp ( 625) redfishUtil_get_bmc_info: Info : controller-1 has 192 GiB Memory ; Enabled and OK:OK 2020-03-27T19:18:55.238 [123316.00228] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6545) bmc_handler : Info : controller-1 bmc audit timer started (120 secs) 2020-03-27T19:18:55.238 [123316.00229] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6561) bmc_handler : Info : controller-1 bmc is accessible using redfish 2020-03-27T19:18:55.238 [123316.00230] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 add host sent to hwmond 2020-03-27T19:18:55.238 [123316.00231] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 start host service sent to hwmond 2020-03-27T19:18:55.238 [123316.00232] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4162) reinstall_handler : Info : controller-1 BMC access established ; starting install 2020-03-27T19:18:55.248 [123316.00233] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:7) 2020-03-27T19:19:00.258 [123316.00234] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4275) reinstall_handler : Info : controller-1 Reinstall power-off already 2020-03-27T19:19:00.268 [123316.00235] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4346) reinstall_handler : Info : controller-1 Reinstall netboot request sent 2020-03-27T19:19:05.269 [123316.00236] controller-0 mtcAgent --- mtcBmcUtil.cpp ( 214) bmc_command_recv :Error : controller-1 bmc redfish Netboot command failed (redfishtool) (data:) (rc:108:108:system call failed) 2020-03-27T19:19:05.269 [123316.00237] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4371) reinstall_handler :Error : controller-1 Reinstall netboot receive failed (rc:108) 2020-03-27T19:19:05.269 [123316.00238] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Failed ; netboot request (seq:8) 2020-03-27T19:19:05.279 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ca566554-db16-4c0a-94a3-0a973ed85047) alarm id (200.022) instant id (host=controller-1.status=reinstall-failed) 2020-03-27T19:19:35.290 [123316.00239] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4600) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure The first few logs sow that redfish bmc accesses were fine only to see the netboot command fail for no apparent reason. Suspect a mutual exclusion race condition of power/reset/reinstall command handling with the bmc connectivity audit. There is no connectivity audit in ipmi protocol case. The bmc in-service audit was an enhancement for redfish to get a feeling for how reliable the protocol is. Test Activity -------------   Sanity, Developer Testing Workaround ----------   Retry failed command Options: Fix options still being investigated -------   1. stop the in-service bmc query audit.      - ping failure audit still detects complete loss of bmc connectivity ; same as ipmi.   2. add retry to the netboot command in the re-install handling sequence      - ensure all bmc commands in the reinstall handler are retried   3. add mutex controls to bmc power/reset/reinstall command handling to avoid collision with bmc audit.      - hold off one bmc access service until another that is in progress completes. An update that implements the mutual exclusion for option 3 is implemented and being tested.
2020-06-03 15:03:25 Ghada Khalil starlingx: importance Undecided Low
2020-06-03 15:03:44 Ghada Khalil tags stx.metal
2020-06-03 15:04:08 Ghada Khalil starlingx: status New Triaged
2020-07-16 02:01:50 Ghada Khalil bug added subscriber Allain Legacy
2020-07-16 02:02:18 Ghada Khalil tags stx.metal stx.5.0 stx.metal
2020-07-16 02:02:23 Ghada Khalil starlingx: importance Low Medium
2020-11-06 15:38:58 OpenStack Infra starlingx: status Triaged In Progress
2020-11-18 17:11:53 OpenStack Infra starlingx: status In Progress Fix Released