system health-query no response on Ceph query

Bug #1978726 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
John Kung

Bug Description

Brief Description
-----------------
Provide a brief description of the issue. Usually, it should not be more than 2 to 3 lines.
Example: After performing a restore of the system, user is unable to swact the controller.

Severity
--------
Major: System/Feature is usable but degraded. Unable to see details of health-query.

Steps to Reproduce
------------------
With failed ceph cluster, the health-query does not provide details of the failed
Ceph condition.

Expected Behavior
------------------
# system health-query
System Health:
All hosts are provisioned: [OK]
All hosts are unlocked/enabled: [OK]
All hosts have current configurations: [OK]
All hosts are patch current: [OK]
Ceph Storage Healthy: [Fail]
No alarms: [Fail]
[5] alarms found, [3] of which are management affecting
All kubernetes nodes are ready: [OK]
All kubernetes control plane pods are ready: [OK]

Actual Behavior
----------------
[root@controller-0 common(keystone_admin)]# date;system health-query;date
Unable to perform health query.

Reproducibility
---------------
<Intermittent>
Occurs when Ceph-api is unresponsive.

System Configuration
--------------------
Ceph configured

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
N/A

Timestamp/Logs
--------------
[root@controller-0 common(keystone_admin)]# date;system health-query;date
Tue Jun 14 13:49:05 UTC 2022
Unable to perform health query.
Tue Jun 14 13:50:06 UTC 2022

sysinv 2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health [-] Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_system_health" info: "<unknown>": Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_system_health" info: "<unknown>"
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health Traceback (most recent call last):
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/health.py", line 29, in get_all
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health pecan.request.context)
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1430, in get_system_health
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health alarm_ignore_list=alarm_ignore_list))
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health exc.info, real_topic, msg.get('method'))
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_system_health" info: "<unknown>"
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health
sysinv 2022-06-14 13:50:06.590 106220 WARNING wsme.api [-] Client-side error: Unable to perform health query.: ClientSideError: Unable to perform health query.

Test Activity
-------------
Integration Testing: orchestrated subcloud upgrades

Workaround
----------
The system health-query is dependent on Ceph being in good state for a response.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/845829

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/845829
Committed: https://opendev.org/starlingx/config/commit/9c2501a720f44b4498fb0a5ebf817f0459be4175
Submitter: "Zuul (22348)"
Branch: master

commit 9c2501a720f44b4498fb0a5ebf817f0459be4175
Author: John Kung <email address hidden>
Date: Tue Jun 14 17:30:51 2022 -0400

    system health-query response on ceph query

    In order to assure a response to the system health-query, when Ceph
    storage-backend is configured and the ceph-api is unresponsive,
    a Timeout is required.

    This Timeout does not rely on the underlying ceph-api timeout as
    the ceph-api may not timeout as expected.

    Test Plan:
    PASSED Verify system health-query response when Ceph is unhealthy
    PASSED Verify system health-query response when Ceph is healthy

    Closes-Bug: 1978726
    Signed-off-by: John Kung <email address hidden>
    Change-Id: I4702c409e8ea45946ba94fab6a0989a90f2f6604

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → John Kung (john-kung)
importance: Undecided → Medium
tags: added: stx.7.0 stx.config
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.