StarlingX

system health-query no response on Ceph query

Bug #1978726 reported by John Kung on 2022-06-14

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	John Kung

Bug Description

Brief Description
-----------------
Provide a brief description of the issue. Usually, it should not be more than 2 to 3 lines.
Example: After performing a restore of the system, user is unable to swact the controller.

Severity
--------
Major: System/Feature is usable but degraded. Unable to see details of health-query.

Steps to Reproduce
------------------
With failed ceph cluster, the health-query does not provide details of the failed
Ceph condition.

Expected Behavior
------------------
# system health-query
System Health:
All hosts are provisioned: [OK]
All hosts are unlocked/enabled: [OK]
All hosts have current configurations: [OK]
All hosts are patch current: [OK]
Ceph Storage Healthy: [Fail]
No alarms: [Fail]
[5] alarms found, [3] of which are management affecting
All kubernetes nodes are ready: [OK]
All kubernetes control plane pods are ready: [OK]

Actual Behavior
----------------
[root@controller-0 common(keystone_admin)]# date;system health-query;date
Unable to perform health query.

Reproducibility
---------------
<Intermittent>
Occurs when Ceph-api is unresponsive.

System Configuration
--------------------
Ceph configured

Branch/Pull Time/Commit
-----------------------
Branch and the time when code was pulled or git commit or cengn load info

Last Pass
---------
N/A

Timestamp/Logs
--------------
[root@controller-0 common(keystone_admin)]# date;system health-query;date
Tue Jun 14 13:49:05 UTC 2022
Unable to perform health query.
Tue Jun 14 13:50:06 UTC 2022

sysinv 2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health [-] Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_system_health" info: "<unknown>": Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_system_health" info: "<unknown>"
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health Traceback (most recent call last):
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/health.py", line 29, in get_all
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health pecan.request.context)
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1430, in get_system_health
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health alarm_ignore_list=alarm_ignore_list))
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health exc.info, real_topic, msg.get('method'))
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "get_system_health" info: "<unknown>"
2022-06-14 13:50:06.589 106220 ERROR sysinv.api.controllers.v1.health
sysinv 2022-06-14 13:50:06.590 106220 WARNING wsme.api [-] Client-side error: Unable to perform health query.: ClientSideError: Unable to perform health query.

Test Activity
-------------
Integration Testing: orchestrated subcloud upgrades

Workaround
----------
The system health-query is dependent on Ceph being in good state for a response.

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-14: Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/845829

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-06-15: Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/845829
Committed: https://opendev.org/starlingx/config/commit/9c2501a720f44b4498fb0a5ebf817f0459be4175
Submitter: "Zuul (22348)"
Branch: master

commit 9c2501a720f44b4498fb0a5ebf817f0459be4175
Author: John Kung <email address hidden>
Date: Tue Jun 14 17:30:51 2022 -0400

system health-query response on ceph query

    In order to assure a response to the system health-query, when Ceph
    storage-backend is configured and the ceph-api is unresponsive,
    a Timeout is required.

This Timeout does not rely on the underlying ceph-api timeout as
the ceph-api may not timeout as expected.

    Test Plan:
    PASSED Verify system health-query response when Ceph is unhealthy
    PASSED Verify system health-query response when Ceph is healthy

    Closes-Bug: 1978726
    Signed-off-by: John Kung <email address hidden>
    Change-Id: I4702c409e8ea45946ba94fab6a0989a90f2f6604

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-06-16

Changed in starlingx:
assignee:	nobody → John Kung (john-kung)
importance:	Undecided → Medium
tags:	added: stx.7.0 stx.config

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.