280.001 Subcloud offline alarm not cleared after controller swact

Bug #2040204 reported by srana
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
srana

Bug Description

Brief Description
-----------------
Provide a brief description of the issue. Usually, it should not be more than 2 to 3 lines.
Subcloud offline alarms remain raised despite the subcloud availability being online. The is observed when a subcloud comes online while swacting.

Severity
--------
Provide the severity of the defect.
Major: Unable to create patch strategy because of management affecting alarms

Steps to Reproduce
------------------
1. Power off subclouds
2. initiate swact while subclouds are becoming online

Expected Behavior
------------------
Offline alarm should be cleared

Actual Behavior
----------------
Offline alarm is not cleared

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
Seen once. Reproducible with a induced FM failure while subclouds become online.

System Configuration
--------------------
DC

Branch/Pull Time/Commit
-----------------------
Master (Oct. 23th 2023)

Last Pass
---------
Unknown

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or for large collect files use:

audit.log.1.gz:2023-09-25 13:48:29.857 217770 ERROR dcmanager.audit.alarm_aggregation [-] Failed to update alarms for subcloud678 error: Error communicating with https://[2620:10a:a001:ac12::54c2]:18003/v1/alarms/summary Request to https://[2620:10a:a001:ac12::54c2]:18003/v1/alarms/summary timed out: fmclient.common.exceptions.InvalidEndpoint: Error communicating with https://[2620:10a:a001:ac12::54c2]:18003/v1/alarms/summary Request to https://[2620:10a:a001:ac12::54c2]:18003/v1/alarms/summary timed out
audit.log.1.gz:2023-09-25 13:48:29.919 217769 ERROR dcmanager.audit.subcloud_audit_worker_manager [-] Failed to get OS Client for subcloud: subcloud929: keystoneauth1.exceptions.http.ServiceUnavailable: Service Unavailable (HTTP 503)

state.log:2023-09-25 13:48:33.099 217767 WARNING root [req-2ce74f0c-a8ef-4509-8740-86162fd0ba74 - - - - -] fm_python_extension: Failed to connect to FM manager
state.log:fmAPI.cpp(140): Failed to connect to FM Manager.

state.log:2023-09-25 13:48:35.045 217766 WARNING root [req-2ce74f0c-a8ef-4509-8740-86162fd0ba74 - - - - -] fm_python_extension: Failed to connect to FM manager
state.log:fmAPI.cpp(140): Failed to connect to FM Manager.

audit.log.1.gz:2023-09-25 13:49:00.097 217769 ERROR dccommon.drivers.openstack.fm [-] get_alarm_summary exception=Error communicating with https://[2620:10a:a001:ac12::2ec2]:18003/v1/alarms/summary Request to https://[2620:10a:a001:ac12::2ec2]:18003/v1/alarms/summary timed out: fmclient.common.exceptions.InvalidEndpoint: Error communicating with https://[2620:10a:a001:ac12::2ec2]:18003/v1/alarms/summary Request to https://[2620:10a:a001:ac12::2ec2]:18003/v1/alarms/summary timed out

state.log:2023-09-25 13:49:29.230 217767 ERROR dcmanager.state.subcloud_state_manager [req-b1f5cc10-9512-4ecc-a289-b02bef115e14 - - - - -] Problem informing dcorch of subcloud state change,subcloud: subcloud990: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 3ebeff9de5914143a1dfb9861f008aab

state.log:2023-09-25 13:49:31.169 217766 ERROR dcmanager.state.subcloud_state_manager [req-2ce74f0c-a8ef-4509-8740-86162fd0ba74 - - - - -] Problem informing dcorch of subcloud state change,subcloud: subcloud990: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID a005530a921341c7a24343fc34c46408

Test Activity
-------------
Feature Testing

Workaround
----------
Delete alarm manually

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/899175

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/899175
Committed: https://opendev.org/starlingx/distcloud/commit/91ef6f00b7790652730e3542a685b9aac205b0fa
Submitter: "Zuul (22348)"
Branch: master

commit 91ef6f00b7790652730e3542a685b9aac205b0fa
Author: srana <email address hidden>
Date: Tue Oct 24 10:20:24 2023 -0400

    Fix: Clear Stale Subcloud Offline Alarm

    Currently, subcloud status alarms are updated strictly when
    the subcloud availability changes. An audit which doesn't
    update the subcloud availability will not update the status
    alarm. Typically, this is fine; however, it can be problematic
    when alarming services are unavailable (i.e., FM failure) during
    an availability update. In this rare case, the subcloud status
    alarm will not match the availability status. Ultimately, this can
    result in an inconsistent stale alarm status, with an offline alarm
    raised indefinitely for an available subcloud. Overall, the subcloud
    state manager should not assume that the subcloud availability status
    and the subcloud status alarms are aligned. This change ensures that
    the subcloud alarm status is eventually aligned with the actual
    availability by forcing alarm updates when the availability remains
    unchanged (during audit’s update_subcloud_availability).

    Test Plan:
     1. PASS: Ensure subcloud offline (280.001) alarm is cleared for
              subcloud restarts interleaved with a host-swact.
              - Power off subcloud, confirm subcloud offline alarm raised,
                power-on subcloud and initiate host-swact
     2. PASS: Induce FM failure during an availability update and ensure
              that the subcloud offline (280.001) alarm status
              is eventually cleared:
              - Power-off subcloud
              - Wait for availability status of subcloud to show offline
                (dcmanager subcloud list)
                Subcloud offline alarm should be raised
              - unmanage FM-mgr service, ps kill FM and power-on subcloud
              - Check alarm list, subcloud offline should remain raised
                It should FAIL to CLEAR at this point
              - Manage FM-mgr (ensure FM is connected) and wait for
                next "Handling update_subcloud_availability request"
                in state.log
              - Check offline alarm has been cleared

    Closes-Bug: 2040204

    Change-Id: I8c3dd10ca0b3cdfadf7672adfb6165b3194f64aa
    Signed-off-by: Salman Rana <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud
Changed in starlingx:
assignee: nobody → srana (salmanr)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.