After uncontrolled swact /var/run/bmc/redfishtool/ doen`t have hwmond sensor data

Bug #1853471 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Eric MacDonald

Bug Description

Brief Description
-----------------
After uncontrolled swact /var/run/bmc/redfishtool/ doen`t have hwmond sensor data it was swiched to /var/run/bmc/ipmitool/ on new active controller. This was observed after hard reboot on active controller(controller-1). Below show the data collection on controller-0 for both ipmi and redfish. Prior to swact the data was sensor data was captured in /var/run/bmc/redfishtool/ on controller-1 when it was active.

2019-11-20T20:17:19.000 controller-1 -sh: info HISTORY: PID=3506364 UID=42425 sudo reboot

2019-11-20T20:18:31.026 [3406288.00359] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6507) bmc_handler : Info : controller-0 bmc is accessible using redfi sh
2019-11-20T20:18:31.072 [3406288.00368] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6507) bmc_handler : Info : compute-0 bmc is accessible using redfish
2019-11-20T20:18:31.103 [3406288.00377] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6507) bmc_handler : Info : compute-2 bmc is accessible using redfish
20
2019-11-20T20:18:31.214 [3406288.00386] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6507) bmc_handler : Info : compute-1 bmc is accessible using redfish
2019-11-20T20:19:51.210 [3406288.00401] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (6507) bmc_handler : Info : controller-1 bmc is accessible using redfi sh

ls -lrt /var/run/bmc/redfishtool/
total 40
-rw-r--r-- 1 root root 1022 Nov 20 20:18 mtcAgent_controller-0_root_query
-rw-r--r-- 1 root root 1022 Nov 20 20:18 mtcAgent_compute-2_root_query
-rw-r--r-- 1 root root 1022 Nov 20 20:18 mtcAgent_compute-0_root_query
-rw-r--r-- 1 root root 1022 Nov 20 20:18 mtcAgent_compute-1_root_query
-rw-r--r-- 1 root root 1022 Nov 20 20:19 mtcAgent_controller-1_root_query
-rw-r--r-- 1 root root 3659 Nov 21 14:54 mtcAgent_controller-1_bmc_info
-rw-r--r-- 1 root root 3659 Nov 21 14:55 mtcAgent_controller-0_bmc_info
-rw-r--r-- 1 root root 3660 Nov 21 14:55 mtcAgent_compute-1_bmc_info
-rw-r--r-- 1 root root 3660 Nov 21 14:55 mtcAgent_compute-2_bmc_info
-rw-r--r-- 1 root root 3660 Nov 21 14:55 mtcAgent_compute-0_bmc_info
controller-0:~$ ls -lrt /var/run/bmc/ipmitool/
total 60
-rw-r--r-- 1 root root 9672 Nov 21 14:54 hwmond_controller-0_sensor_data
-rw-r--r-- 1 root root 9796 Nov 21 14:54 hwmond_compute-0_sensor_data
-rw-r--r-- 1 root root 9796 Nov 21 14:55 hwmond_compute-1_sensor_data
-rw-r--r-- 1 root root 9796 Nov 21 14:55 hwmond_compute-2_sensor_data
-rw-r--r-- 1 root root 9672 Nov 21 14:56 hwmond_controller-1_sensor_data

Severity
--------
Major

Steps to Reproduce
------------------
1.Set up lab for using BMC redfishtool
2. Verify BMC sensor data collection using redfish tool. Eg check mtcAgent log and list files in /var/run/bmc/redfishtool
3.Reboot active controller
4. Verify sensor data collection is still on new active controller /var/run/bmc/redfishtool
TC-name: Uncontrolled swact and verify bmc sensor data collection tool

Expected Behavior
------------------
All the sensor data under redfish directory /var/run/bmc/redfishtool.

Actual Behavior
----------------
After swact it was not collected in /var/run/bmc/redfishtool under new active controller.

Reproducibility
---------------
Reproducible 100%

System Configuration
--------------------
AIO-DX + worker node IPv6 config
Lab-name:
WCP-8-12

Branch/Pull Time/Commit
-----------------------
2019-11-18_20-00-00

Last Pass
---------
Never tested

Timestamp/Logs
--------------
2019-11-20T20:17:19.000

Test Activity
-------------
Feature Testing

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

After a swact the Hardware Monitor on the new side is selecting ipmi even though mtcAgent selects redfish.

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.3.0 stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / high priority given this is related to the redfish feature which is an stx.3.0 deliverable

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue is understood and a fix is being worked on.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/697309

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)
Download full text (5.5 KiB)

Reviewed: https://review.opendev.org/697309
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=c4b8171ddd2670d9bcc11e6c8f7e2f2bfcdc024e
Submitter: Zuul
Branch: master

commit c4b8171ddd2670d9bcc11e6c8f7e2f2bfcdc024e
Author: Eric MacDonald <email address hidden>
Date: Wed Dec 4 10:37:01 2019 -0500

    Refactor BMC provisioning in Maintenance

    The current mechanism used to preserve the learned bmc protocol in
    the filesystem on the active controller is problematic over swact.

    This update removes the file storage method in favor of preserving
    the learned protocol in the system inventory database as a key/value
    pair at the host level in already existing mtce_info database field.

    The specified or learned bmc access protocol is then shared with the
    hardware monitor through inter-daemon maintenance messaging.

    This update refactors bmc provisioning to accommodate bmc protocol
    selection at the host rather than system level. Towards that this
    update removes system level bmc_access_method selection in favor of
    host level selection through bm_type. A bm_type of 'bmc' specifies
    that the bmc access protocol for that host be learned. This has the
    effect of making it the same as what is delivered today but without
    support for changing it as the system level.

    A system inventory update will be delivered shortly that enables bmc
    access protocol selection at the host level. That update allows the
    customer to specify the bmc access protocol at the host level to be
    either dynamic (aka learned) or to only use 'redfish' or 'ipmi'.
    That system inventory update delivers that information to maintenance
    through bm_type via bmc provisioning. Until that update is delivered
    bm_type always comes in as 'bmc' which get interpreted as 'dynamic'
    to maintain existing configuration.

    The following additional issues were also fixed in this update.

    1. The nodeTimers module defaults the 'ring' member of timers that are
       not running to false but should be true.

    2. Added a pingUtil_restart function to facilitate quicker sensor
       monitoring following provisioning changes and bmc access failures.

    3. Enhanced the hardware monitor sensor grouping filter to accommodate
       non-standard Redfish readout labelling so that more sensors fall
       into the existing canned groups ; leads to more monitored sensors.

    4. Added a 'http security mode' to hardware monitor messaging. This
       defaults to https as that is all that is supported by the Redfish
       implementation today. This field can be used to specify non-secure
       'http' mode in the future when that gets implemented.

    5. Ensure the hardware monitor performs a bmc password re-fetch on every
       provisioning change.

    Test Plan:

    PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
    PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
    PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
    PASS: Verify bmc provisioning of ipmi and redf...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Eric, please cherrypick to the r/stx.3.0 branch by Dec 11.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/698311

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (r/stx.3.0)
Download full text (5.0 KiB)

Reviewed: https://review.opendev.org/698311
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=be3cf4eeb50eef55910cf9c73ea47c168005ad64
Submitter: Zuul
Branch: r/stx.3.0

commit be3cf4eeb50eef55910cf9c73ea47c168005ad64
Author: Eric MacDonald <email address hidden>
Date: Wed Dec 4 10:37:01 2019 -0500

    Refactor BMC provisioning in Maintenance

    The current mechanism used to preserve the learned bmc protocol in
    the filesystem on the active controller is problematic over swact.

    This update removes the file storage method in favor of preserving
    the learned protocol in the system inventory database as a key/value
    pair at the host level in already existing mtce_info database field.

    The specified or learned bmc access protocol is then shared with the
    hardware monitor through inter-daemon maintenance messaging.

    This update refactors bmc provisioning to accommodate bmc protocol
    selection at the host rather than system level. Towards that this
    update removes system level bmc_access_method selection in favor of
    host level selection through bm_type. A bm_type of 'bmc' specifies
    that the bmc access protocol for that host be learned. This has the
    effect of making it the same as what is delivered today but without
    support for changing it as the system level.

    The following additional issues were also fixed in this update.

    1. The nodeTimers module defaults the 'ring' member of timers that are
       not running to false but should be true.

    2. Added a pingUtil_restart function to facilitate quicker sensor
       monitoring following provisioning changes and bmc access failures.

    3. Enhanced the hardware monitor sensor grouping filter to accommodate
       non-standard Redfish readout labelling so that more sensors fall
       into the existing canned groups ; leads to more monitored sensors.

    4. Added a 'http security mode' to hardware monitor messaging. This
       defaults to https as that is all that is supported by the Redfish
       implementation today. This field can be used to specify non-secure
       'http' mode in the future when that gets implemented.

    5. Ensure the hardware monitor performs a bmc password re-fetch on every
       provisioning change.

    Test Plan:

    PASS: Verify bmc access protocol store/fetched from the database (mtce_info)
    PASS: Verify inventory push from mtcAgent to hwmond over mtcAgent restart
    PASS: Verify inventory push from mtcAgent to hwmond over hwmon restart
    PASS: Verify bmc provisioning of ipmi and redfish servers
    PASS: Verify learned bmc protocol persists over process restart and swact
    PASS: Verify process startup with protocol already learned

    Hardware Monitor:

    PASS: Verify bmc_type=ipmi handling ; protocol forced to ipmi ; (re)prov
    PASS: Verify bmc_type=redfish handling ; protocol forced to redfish ; (re)prov
    PASS: Verify bmc_type=dynamic handling ; protocol is learned then persisted
    PASS: Verify sensor model delete and relearn over ip address change
    PASS: Verify sensor model delete and relearn o...

Read more...

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

Verified in using build date 2019-12-10_20-00-00

Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

This was verified in multi-node lab wolfpass-8_12.

Yang Liu (yliu12)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.