Last Pass
---------
Did this test scenario pass previously? Yes, is intermittent
Timestamp/Logs
--------------
From /folk/cgts/logs/CGTS-17697
2020-03-27T19:18:44.216 fmAPI.cpp(490): Enqueue raise alarm request: UUID (a51a12e4-2d37-4434-84e7-b599cc60ab42) alarm id (200.021) instant id (host=controller-1.command=reinstall)
2020-03-27T19:18:44.216 [123316.00215] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6327) bmc_handler : Info : controller-1 bmc credentials received
2020-03-27T19:18:44.216 [123316.00216] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:3)
2020-03-27T19:18:44.216 [123316.00217] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4118) reinstall_handler : Warn : controller-1 Reinstall wait for BMC access ; 600 second timeout
2020-03-27T19:18:44.216 [123316.00218] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Wait ; BMC not accessible (seq:4)
579 2020-03-27T19:18:44.227 [123316.00219] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6395) bmc_handler : Info : controller-1 bmc communication protocol discovery
2020-03-27T19:18:44.244 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.021), entity_id (host=controller-1.command=reinstall)
2020-03-27T19:18:44.285 fmAlarmUtils.cpp(658): FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controller-1.command=reinstall)
582 2020-03-27T19:18:47.227 [123316.00220] controller-0 mtcAgent --- redfishUtil.cpp ( 305) redfishUtil_is_supported: Info : controller-1 bmc supports redfish version 1.7.0
583 2020-03-27T19:18:47.227 [123316.00221] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6465) bmc_handler : Info : controller-1 bmc control using redfishtool:2620:10a:a001:a102::125
584 2020-03-27T19:18:55.238 [123316.00222] controller-0 mtcAgent --- redfishUtil.cpp ( 533) redfishUtil_get_bmc_info: Info : controller-1 power is off
2020-03-27T19:18:55.238 [123316.00223] controller-0 mtcAgent --- redfishUtil.cpp ( 544) redfishUtil_get_bmc_info: Info : controller-1 manufacturer is Intel Corporation ; model:S2600WFQ part:J44810-003 serial:BQWT80101028
2020-03-27T19:18:55.238 [123316.00224] controller-0 mtcAgent --- redfishUtil.cpp ( 551) redfishUtil_get_bmc_info: Info : controller-1 BIOS fw version SE5C620.86B.02.01.0010.010620200716
2020-03-27T19:18:55.238 [123316.00225] controller-0 mtcAgent --- redfishUtil.cpp ( 153) _load_action_lists : Info : controller-1 bmc actions ; reset:ForceRestart power-on:On,ForceOn power-off:GracefulShutdown,ForceOff
2020-03-27T19:18:55.238 [123316.00226] controller-0 mtcAgent --- redfishUtil.cpp ( 602) redfishUtil_get_bmc_info: Info : controller-1 has 2 Processors ; Enabled and OK:OK
2020-03-27T19:18:55.238 [123316.00227] controller-0 mtcAgent --- redfishUtil.cpp ( 625) redfishUtil_get_bmc_info: Info : controller-1 has 192 GiB Memory ; Enabled and OK:OK
2020-03-27T19:18:55.238 [123316.00228] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6545) bmc_handler : Info : controller-1 bmc audit timer started (120 secs)
2020-03-27T19:18:55.238 [123316.00229] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6561) bmc_handler : Info : controller-1 bmc is accessible using redfish
2020-03-27T19:18:55.238 [123316.00230] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 add host sent to hwmond
2020-03-27T19:18:55.238 [123316.00231] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 start host service sent to hwmond
2020-03-27T19:18:55.238 [123316.00232] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4162) reinstall_handler : Info : controller-1 BMC access established ; starting install
2020-03-27T19:18:55.248 [123316.00233] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstalling (seq:7)
2020-03-27T19:19:00.258 [123316.00234] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4275) reinstall_handler : Info : controller-1 Reinstall power-off already
2020-03-27T19:19:00.268 [123316.00235] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4346) reinstall_handler : Info : controller-1 Reinstall netboot request sent
2020-03-27T19:19:05.269 [123316.00236] controller-0 mtcAgent --- mtcBmcUtil.cpp ( 214) bmc_command_recv :Error : controller-1 bmc redfish Netboot command failed (redfishtool) (data:) (rc:108:108:system call failed)
2020-03-27T19:19:05.269 [123316.00237] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4371) reinstall_handler :Error : controller-1 Reinstall netboot receive failed (rc:108)
2020-03-27T19:19:05.269 [123316.00238] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Failed ; netboot request (seq:8)
2020-03-27T19:19:05.279 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ca566554-db16-4c0a-94a3-0a973ed85047) alarm id (200.022) instant id (host=controller-1.status=reinstall-failed)
2020-03-27T19:19:35.290 [123316.00239] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4600) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure
The first few logs sow that redfish bmc accesses were fine only to see the netboot command fail for no apparent reason.
Suspect a mutual exclusion race condition of power/reset/reinstall command handling with the bmc connectivity audit. There is no connectivity audit in ipmi protocol case. The bmc in-service audit was an enhancement for redfish to get a feeling for how reliable the protocol is.
Test Activity
-------------
Sanity, Developer Testing
Workaround
----------
Retry failed command
Options: Fix options still being investigated
-------
1. stop the in-service bmc query audit.
- ping failure audit still detects complete loss of bmc connectivity ; same as ipmi.
2. add retry to the netboot command in the re-install handling sequence
- ensure all bmc commands in the reinstall handler are retried
3. add mutex controls to bmc power/reset/reinstall command handling to avoid collision with bmc audit.
- hold off one bmc access service until another that is in progress completes.
An update that implements the mutual exclusion for option 3 is implemented and being tested.
Brief Description
-----------------
DC-3 system install failed due to controller-1 redfish netboot command failure.
Severity
--------
Major: system install failed
Steps to Reproduce
------------------
System install
Expected Behavior
------------------
System installs correctly. All bmc commands succeed.
Actual Behavior
----------------
System install failed for controller-1 when the netboot command failed for unknown reason.
Reproducibility
---------------
Intermittent - rare
System Configuration ------- ------
-------
DC-3 - IPV6
Branch/Pull Time/Commit ------- ------- -- "20.03" TARGET= "Host Installer" TYPE="Formal" ID="2020- 03-26_19- 39-18" WRCP_20. 03_Build" BY="jenkins" HOST="yow- cgts4-lx. wrs.com" DATE="2020- 03-26 19:41:10 -0400"
-------
SW_VERSION=
BUILD_
BUILD_
BUILD_
SRC_BUILD_ID="6"
JOB="
BUILD_
BUILD_NUMBER="6"
BUILD_
BUILD_
Last Pass
---------
Did this test scenario pass previously? Yes, is intermittent
Timestamp/Logs
--------------
From /folk/cgts/ logs/CGTS- 17697
2020-03- 27T19:18: 44.216 fmAPI.cpp(490): Enqueue raise alarm request: UUID (a51a12e4- 2d37-4434- 84e7-b599cc60ab 42) alarm id (200.021) instant id (host=controlle r-1.command= reinstall) 27T19:18: 44.216 [123316.00215] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6327) bmc_handler : Info : controller-1 bmc credentials received 27T19:18: 44.216 [123316.00216] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_ update_ task : Info : controller-1 Task: Reinstalling (seq:3) 27T19:18: 44.216 [123316.00217] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4118) reinstall_handler : Warn : controller-1 Reinstall wait for BMC access ; 600 second timeout 27T19:18: 44.216 [123316.00218] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_ update_ task : Info : controller-1 Task: Reinstall Wait ; BMC not accessible (seq:4) 27T19:18: 44.227 [123316.00219] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6395) bmc_handler : Info : controller-1 bmc communication protocol discovery 27T19:18: 44.244 fmAlarmUtils. cpp(624) : Sending FM raise alarm request: alarm_id (200.021), entity_id (host=controlle r-1.command= reinstall) 27T19:18: 44.285 fmAlarmUtils. cpp(658) : FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controlle r-1.command= reinstall) 27T19:18: 47.227 [123316.00220] controller-0 mtcAgent --- redfishUtil.cpp ( 305) redfishUtil_ is_supported: Info : controller-1 bmc supports redfish version 1.7.0 27T19:18: 47.227 [123316.00221] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (6465) bmc_handler : Info : controller-1 bmc control using redfishtool: 2620:10a: a001:a102: :125 27T19:18: 55.238 [123316.00222] controller-0 mtcAgent --- redfishUtil.cpp ( 533) redfishUtil_ get_bmc_ info: Info : controller-1 power is off 27T19:18: 55.238 [123316.00223] controller-0 mtcAgent --- redfishUtil.cpp ( 544) redfishUtil_ get_bmc_ info: Info : controller-1 manufacturer is Intel Corporation ; model:S2600WFQ part:J44810-003 serial:BQWT80101028 27T19:18: 55.238 [123316.00224] controller-0 mtcAgent --- redfishUtil.cpp ( 551) redfishUtil_ get_bmc_ info: Info : controller-1 BIOS fw version SE5C620. 86B.02. 01.0010. 010620200716 27T19:18: 55.238 [123316.00225] controller-0 mtcAgent --- redfishUtil.cpp ( 153) _load_action_lists : Info : controller-1 bmc actions ; reset:ForceRestart power-on:On,ForceOn power-off: GracefulShutdow n,ForceOff 27T19:18: 55.238 [123316.00226] controller-0 mtcAgent --- redfishUtil.cpp ( 602) redfishUtil_ get_bmc_ info: Info : controller-1 has 2 Processors ; Enabled and OK:OK 27T19:18: 55.238 [123316.00227] controller-0 mtcAgent --- redfishUtil.cpp ( 625) redfishUtil_ get_bmc_ info: Info : controller-1 has 192 GiB Memory ; Enabled and OK:OK 27T19:18: 55.238 [123316.00228] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6545) bmc_handler : Info : controller-1 bmc audit timer started (120 secs) 27T19:18: 55.238 [123316.00229] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6561) bmc_handler : Info : controller-1 bmc is accessible using redfish 27T19:18: 55.238 [123316.00230] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 add host sent to hwmond 27T19:18: 55.238 [123316.00231] controller-0 mtcAgent msg mtcCtrlMsg.cpp (1270) send_hwmon_command : Info : controller-1 start host service sent to hwmond 27T19:18: 55.238 [123316.00232] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4162) reinstall_handler : Info : controller-1 BMC access established ; starting install 27T19:18: 55.248 [123316.00233] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_ update_ task : Info : controller-1 Task: Reinstalling (seq:7) 27T19:19: 00.258 [123316.00234] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4275) reinstall_handler : Info : controller-1 Reinstall power-off already 27T19:19: 00.268 [123316.00235] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4346) reinstall_handler : Info : controller-1 Reinstall netboot request sent 27T19:19: 05.269 [123316.00236] controller-0 mtcAgent --- mtcBmcUtil.cpp ( 214) bmc_command_recv :Error : controller-1 bmc redfish Netboot command failed (redfishtool) (data:) (rc:108:108:system call failed) 27T19:19: 05.269 [123316.00237] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4371) reinstall_handler :Error : controller-1 Reinstall netboot receive failed (rc:108) 27T19:19: 05.269 [123316.00238] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_ update_ task : Info : controller-1 Task: Reinstall Failed ; netboot request (seq:8) 27T19:19: 05.279 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ca566554- db16-4c0a- 94a3-0a973ed850 47) alarm id (200.022) instant id (host=controlle r-1.status= reinstall- failed) 27T19:19: 35.290 [123316.00239] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4600) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure
2020-03-
2020-03-
2020-03-
2020-03-
579 2020-03-
2020-03-
2020-03-
582 2020-03-
583 2020-03-
584 2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
2020-03-
The first few logs sow that redfish bmc accesses were fine only to see the netboot command fail for no apparent reason.
Suspect a mutual exclusion race condition of power/reset/ reinstall command handling with the bmc connectivity audit. There is no connectivity audit in ipmi protocol case. The bmc in-service audit was an enhancement for redfish to get a feeling for how reliable the protocol is.
Test Activity
-------------
Sanity, Developer Testing
Workaround
----------
Retry failed command
Options: Fix options still being investigated
-------
1. stop the in-service bmc query audit. reinstall command handling to avoid collision with bmc audit.
- ping failure audit still detects complete loss of bmc connectivity ; same as ipmi.
2. add retry to the netboot command in the re-install handling sequence
- ensure all bmc commands in the reinstall handler are retried
3. add mutex controls to bmc power/reset/
- hold off one bmc access service until another that is in progress completes.
An update that implements the mutual exclusion for option 3 is implemented and being tested.