Some of the sensor datat files missing in /var/run/bmc/redfishtool for long time

Bug #1864906 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
-----------------
    During the automation run it was observed some of the worker nodes(compute-0 and compute-1) doesn’t have sensor data files in /var/run/bmc/redfishtool for a long time with dynamic provision. These files are in /var/run/bmc/ipmitool/. Further investigation it was found to be failure during dynamic protocol learning that lead to defaulting to ipmi over a bm type re-provision. Still sensor data was displayed for host-sensor-list for all the nodes but only sensor

1:/var/run/bmc$ ls -lrt redfishtool/
total 60
-rw-r--r-- 1 root root 4509 Feb 26 18:30 hwmond_controller-1_power_sensor_data
-rw-r--r-- 1 root root 18939 Feb 26 18:30 hwmond_compute-2_thermal_sensor_data
-rw-r--r-- 1 root root 22750 Feb 26 18:30 hwmond_controller-1_thermal_sensor_data
-rw-r--r-- 1 root root 3659 Feb 26 18:32 mtcAgent_controller-1_bmc_info
-rw-r--r-- 1 root root 3660 Feb 26 18:32 mtcAgent_compute-2_bmc_info
-rw-r--r-- 1 root root 0 Feb 26 18:32 hwmond_compute-2_power_sensor_data
controller-1:/var/run/bmc$ ls -lrt ipmitool/
total 72
-rw-r--r-- 1 root root 571 Feb 26 15:41 mtcAgent_compute-0_bmc_info
-rw-r--r-- 1 root root 571 Feb 26 15:41 mtcAgent_compute-1_bmc_info
-rw-r--r-- 1 root root 52 Feb 26 15:41 mtcAgent_compute-0_restart_cause
-rw-r--r-- 1 root root 30 Feb 26 15:41 mtcAgent_compute-1_restart_cause
-rw-r--r-- 1 root root 20 Feb 26 15:41 mtcAgent_compute-1_power_status
-rw-r--r-- 1 root root 20 Feb 26 15:41 mtcAgent_compute-0_power_status
-rw-r--r-- 1 root root 571 Feb 26 15:41 mtcAgent_controller-0_bmc_info
-rw-r--r-- 1 root root 52 Feb 26 15:41 mtcAgent_controller-0_restart_cause
-rw-r--r-- 1 root root 20 Feb 26 15:41 mtcAgent_controller-0_power_status
-rw-r--r-- 1 root root 10292 Feb 26 18:31 hwmond_compute-0_sensor_data
-rw-r--r-- 1 root root 10292 Feb 26 18:31 hwmond_compute-1_sensor_data
-rw-r--r-- 1 root root 10168 Feb 26 18:31 hwmond_controller-0_sensor_data
controller-1:/var/run/bmc$

2020-02-26T14:41:06.687 [252777.00634] controller-1 mtcAgent hdl mtcNodeHdlrs.cpp (6202) bmc_handler : Info : compute-0 bmc credentials received
2020-02-26T14:41:06.697 [252777.00635] controller-1 mtcAgent --- threadUtil.cpp ( 344) thread_launch : Warn : compute-0 bmc not in IDLE stage (in Done stage)
2020-02-26T14:41:06.697 [252777.00636] controller-1 mtcAgent --- mtcBmcUtil.cpp ( 144) bmc_command_send :Error : compute-0 failed to launch power control thread (rc:72)
2020-02-26T14:41:06.697 [252777.00637] controller-1 mtcAgent hdl mtcNodeHdlrs.cpp (6261) bmc_handler : Warn : compute-0 Query BMC Root send failed ; defaulting to ipmi

2020-02-26T14:41:13.361 [252777.00646] controller-1 mtcAgent hdl mtcNodeHdlrs.cpp (6202) bmc_handler : Info : compute-1 bmc credentials received
2020-02-26T14:41:13.372 [252777.00647] controller-1 mtcAgent --- threadUtil.cpp ( 344) thread_launch : Warn : compute-1 bmc not in IDLE stage (in Done stage)
2020-02-26T14:41:13.372 [252777.00648] controller-1 mtcAgent --- mtcBmcUtil.cpp ( 144) bmc_command_send :Error : compute-1 failed to launch power control thread (rc:72)
2020-02-26T14:41:13.372 [252777.00649] controller-1 mtcAgent hdl mtcNodeHdlrs.cpp (6261) bmc_handler : Warn : compute-1 Query BMC Root send failed ; defaulting to ipmi

system host-show compute-0
+-----------------------+-------------------------------------------------+
| Property | Value |
+-----------------------+-------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |
| bm_ip | 2620:10a:a001:a102::134 |
| bm_type | dynamic |
| bm_username | root |
| boot_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:8:0 |
| capabilities | {} |
| clock_synchronization | ntp |
| config_applied | 732e7b0b-2d68-4823-9f6f-5ac035995799 |
| config_status | None |
| config_target | 732e7b0b-2d68-4823-9f6f-5ac035995799 |
| console | ttyS0,115200n8 |
| created_at | 2020-02-26T05:22:34.395040+00:00 |
| hostname | compute-0 |
| id | 2 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | face::d1ad:b547:22e6:65fb |
| mgmt_mac | 3c:fd:fe:b1:26:d8 |
| operational | enabled |
| personality | worker |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:8:0 |
| serialid | None |
| software_load | 20.02 |
| subfunctions | worker,lowlatency |
| task | |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2020-02-26T16:39:14.437014+00:00 |
| uptime | 7656 |
| uuid | 9f26bdbd-a0a9-4f6d-953e-11dab24b399a |
| vim_progress_status | services-enabled |
+-----------------------+-------------------------------------------------+
[sysadmin@controller-1 ~(keystone_admin)]$ system host-show compute-1
+-----------------------+-------------------------------------------------+
| Property | Value |
+-----------------------+-------------------------------------------------+
| action | none |
| administrative | unlocked |
| availability | available |
| bm_ip | 2620:10a:a001:a102::135 |
| bm_type | dynamic |
| bm_username | root |
| boot_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:8:0 |
| capabilities | {} |
| clock_synchronization | ntp |
| config_applied | 732e7b0b-2d68-4823-9f6f-5ac035995799 |
| config_status | None |
| config_target | 732e7b0b-2d68-4823-9f6f-5ac035995799 |
| console | ttyS0,115200n8 |
| created_at | 2020-02-26T05:22:37.664627+00:00 |
| hostname | compute-1 |
| id | 3 |
| install_output | text |
| install_state | completed |
| install_state_info | None |
| inv_state | inventoried |
| invprovision | provisioned |
| location | {} |
| mgmt_ip | face::5f70:56e8:f3d4:9da6 |
| mgmt_mac | 3c:fd:fe:b5:76:e0 |
| operational | enabled |
| personality | worker |
| reserved | False |
| rootfs_device | /dev/disk/by-path/pci-0000:18:00.0-scsi-0:0:8:0 |
| serialid | None |
| software_load | 20.02 |
| subfunctions | worker,lowlatency |
| task | |
| tboot | false |
| ttys_dcd | None |
| updated_at | 2020-02-26T16:39:14.399614+00:00 |
| uptime | 37887 |
| uuid | 076f267c-3d76-4182-9a0f-dbbf79ce3722 |
| vim_progress_status | services-enabled |
+-----------------------+-------------------------------------------------+

Severity
--------
Minor

Steps to Reproduce
------------------
1. Provision host from ipmi to dynamic protocol strees test on automation
2. Verify /var/run/bmc/redfishtool for sensor files . There was not compute-0 and compute-1 sensor files created.

System Configuration
--------------------
AIO+DX+worker Wolfpass-8-12
Expected Behavior
------------------
Sensor data files available for all the node .

Actual Behavior
----------------
Sensor data files were missing for some of the nodes

Reproducibility
---------------
Seen once . Need to retested after reinstall.
Load
----
2020-02-25 17:09:33 -0500
Last Pass
---------
Timestamp/Logs
--------------
2020-02-26T14:41:06.687

Test Activity
-------------
Regression test

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
description: updated
tags: added: stx.retestneeded
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Eric MacDonald, can you make a recommendation on the priority / gate for this issue? Do the sensor files get created eventually? What is the impact on the system? Is it possible this is specific to the wolfpass hardware?

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

This issue is not service or customer impacting but can intermittently affect product verification BMC test execution.

In this case the BMC access method (bm_type) was changed from a fixed value of redfish or ipmi to 'dynamic'.

Dynamic mode tells maintenance to try Redfish but default to IPMI if Redfish appears to not be supported, i.e. fails.

The failure was in the handling of the bm_type reprovisioning change called for in the PV stress test.

The bm_type change caused the maintenance code to relearn the BMC access method which failed when it found that the thread was already running. The maintenance code does not distinguishing a thread launch failure with a Redfish protocol failure during BMC protocol discovery nor does it retry like it would normally do for other operations.

The fix is to improve thread launch failure handling during BMC protocol discovery.

The real impact is that if we don't fix this issue then the product verification BMC stress/regression test might fail occasionally like it did here.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue is not lab specific.
Issue should be fixed.
Recommend tracking as 'low priority', given that there is no segfault and mtce simply defaults to IPMI.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Low priority / not gating -- can be fixed time permitting, but not gating any stx release since there is no end user impact

tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was reproduced in build 020-03-20_04-10-00.Automated Regression test cases failed.

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was reproduced in AIO-DX+worker node WP_8_12 lab 2020-04-25_13-17-56.Automated Regression test cases failed.

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was reproduced in Automated regression WP_8_12 lab load 2020-05-08_09-49-28

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
2020-07-07_20-00-00

log at
https://files.starlingx.kube.cengn.ca/launchpad/1864906

Revision history for this message
Difu Hu (difuhu) wrote :

Hit similar issue during upgrade. Controller-1 stayed power-off
| 4 | controller-1 | controller | locked | disabled | power-off |

log at: https://files.starlingx.kube.cengn.ca/launchpad/1864906/hp380_ALL_NODES_20201103.202315.tar

Revision history for this message
Difu Hu (difuhu) wrote :

Hit similar issue during upgrade. Controller-1 stayed power-off
| 4 | controller-1 | controller | locked | disabled | power-off |

log at: https://files.starlingx.kube.cengn.ca/launchpad/1864906/hp380_ALL_NODES_20201103.202315.tar

020-11-03T17:43:01.429 [96966.00615] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4936) power_handler : Info : controller-1 Power-Off Completed
2020-11-03T17:43:01.429 fmAPI.cpp(490): Enqueue raise alarm request: UUID (13c50889-19e9-4e76-9791-ca787f4bcac3) alarm id (200.021) instant id (host=controller-1.action=power-off)
2020-11-03T17:43:01.429 [96966.00616] controller-0 mtcAgent inv mtcInvApi.cpp (1119) mtcInvApi_update_state : Info : controller-1 power-off (seq:309)
2020-11-03T17:43:01.438 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.021), entity_id (host=controller-1.action=power-off)
2020-11-03T17:43:01.483 fmAlarmUtils.cpp(658): FM Response for raise alarm: (0), alarm_id (200.021), entity_id (host=controller-1.action=power-off)
2020-11-03T17:43:06.434 [96966.00617] controller-0 mtcAgent --- threadUtil.cpp ( 344) thread_launch : Warn : controller-1 bmc not in IDLE stage (in Monitor stage)
2020-11-03T17:43:06.434 [96966.00618] controller-0 mtcAgent --- mtcBmcUtil.cpp ( 144) bmc_command_send :Error : controller-1 failed to launch power control thread (rc:72)
2020-11-03T17:43:06.434 [96966.00619] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (4340) reinstall_handler :Error : controller-1 Reinstall netboot request failed (rc:72)
2020-11-03T17:43:06.434 [96966.00620] controller-0 mtcAgent inv mtcInvApi.cpp ( 334) mtcInvApi_update_task : Info : controller-1 Task: Reinstall Failed ; netboot request (seq:310)
2020-11-03T17:43:06.440 [96966.00621] controller-0 mtcAgent --- threadUtil.cpp ( 763) thread_handler : Warn : controller-1 bmc thread kill req (rc:0)
2020-11-03T17:43:06.440 fmAPI.cpp(490): Enqueue raise alarm request: UUID (62769211-0f8b-4c17-8812-0e5b575c4009) alarm id (200.022) instant id (host=controller-1.status=reinstall-failed)
2020-11-03T17:43:06.442 fmAlarmUtils.cpp(624): Sending FM raise alarm request: alarm_id (200.022), entity_id (host=controller-1.status=reinstall-failed)
2020-11-03T17:43:06.495 fmAlarmUtils.cpp(658): FM Response for raise alarm: (0), alarm_id (200.022), entity_id (host=controller-1.status=reinstall-failed)
2020-11-03T17:43:08.739 [96966.00622] controller-0 mtcAgent --- threadUtil.cpp ( 805) pthread_signal_handler : Info : controller-1 bmc thread SIGKILL ; exiting ...
2020-11-03T17:43:36.445 [96966.00623] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (4600) reinstall_handler : Info : controller-1 Reinstall complete ; operation failure
2020-11-03T17:43:36.445 [96966.00624] controller-0 mtcAgent inv mtcInvApi.cpp ( 437) mtcInvApi_force_task : Info : controller-1 task clear (seq:311) (was:Reinstall Fa

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

This issue is believed to have been fixed by the update ...

    https://review.opendev.org/c/starlingx/metal/+/761760/

.. that merged a week or so after the last update to this issue record.

Merged update is

https://opendev.org/starlingx/metal/commit/9b82bf6f65a195e040fd92ef8156368207690f65

Please retest with the above update.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

This issue is fixed with the following two updates:

update: Disable Redfish BMC audit and improve reinstall failure handling
review: https://review.opendev.org/c/starlingx/metal/+/761760/
commit: https://opendev.org/starlingx/metal/commit/9b82bf6f65a195e040fd92ef8156368207690f65

update: Improve mtcAgent interrupted thread cleanup
review: https://review.opendev.org/c/starlingx/metal/+/780348
commit: https://opendev.org/starlingx/metal/commit/4f5bf78f55ec8b0983262ee351183b1edd8443ad

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

    Revert "Align partitions created by kickstarters"

    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

    Reason for revert: Review should have been abandoned rather than merged.

    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

    Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

    Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

    This update reverts a small code change that was
    introduced by the following update.

    https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

    Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

tags: added: in-f-centos8
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.