AIO-DX/Plus controller-1 got 250.001 alarm after installation

Bug #2045704 reported by Heitor Matsui
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Heitor Matsui

Bug Description

Brief Description
-----------------
DX/Plus after installation, 250.001 alarm raised on controller-1

Severity
--------
Major

Steps to Reproduce
------------------
Install DX/Plus

Expected Behavior
------------------
DX/Plus after installation, there is not 250.001 alarm raised

Actual Behavior
----------------
DX/Plus after installation, there is 250.001 alarm raised

Reproducibility
---------------
It is the first time seen this issue

System Configuration
--------------------
DX/Plus

Branch/Pull Time/Commit
-----------------------
2023-12-01_19-00-11

Last Pass
---------
2023-11-30_19-27-37

Timestamp/Logs
--------------
[2023-12-04 08:35:03,292] 349 DEBUG MainThread ssh.send :: Send 'fm --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid --mgmt_affecting'
[2023-12-04 08:35:03,342] 551 DEBUG MainThread ssh.exec_cmd:: Expecting [.@controller-[01] .(keystone_admin)]\$ in prompt
[2023-12-04 08:35:05,413] 471 DEBUG MainThread ssh.expect :: Output:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

UUID Alarm ID Reason Text Entity ID Management Affecting Severity Time Stamp
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2913e9f4-c009-47f4-be8c-c9cd4c536c65 250.001 controller-1 Configuration is out-of-date. (applied: 09a66b3a-92c8-4d35-8e6d-10344f7c7ace target: 88cb5e12-a308-4395-8055-3f3744c258f6) host=controller-1 True major 2023-12-04T08:13:32.108305
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[sysadmin@controller-0 ~(keystone_admin)]$

sysinv 2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager [-] Partition creation failed on host: 5. Details: Disk partition /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0-part1 already exists.: sysinv.common.exception.PartitionAlreadyExists: Disk partition /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0-part1 already exists.
2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager Traceback (most recent call last):
2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager self.dialect.do_execute(
2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 609, in do_execute
2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager cursor.execute(statement, parameters)
2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "partition_uuid_key"
2023-12-04 08:08:29.664 24498 ERROR sysinv.conductor.manager DETAIL: Key (uuid)=(3826df48-d961-41ed-a1e4-81b4bdc8cfc0) already exists.

Test Activity
-------------
Sanity

Workaround
----------
N/A

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Heitor Matsui (heitormatsui)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/902728
Committed: https://opendev.org/starlingx/config/commit/dade970c8a64e2ca0d9dbe297d98e428dc817cf2
Submitter: "Zuul (22348)"
Branch: master

commit dade970c8a64e2ca0d9dbe297d98e428dc817cf2
Author: Heitor Matsui <email address hidden>
Date: Tue Dec 5 18:27:59 2023 -0300

    Fix missing runtime apply parameter

    The parameter config_out_of_date_timeout was removed by
    commit [1], however a reference to it remained in the code,
    introduced by commit [2].

    This commit fixes the reference, now declared as a constant
    on sysinv/common/constants.py

    [1] https://review.opendev.org/c/starlingx/config/+/894544
    [2] https://review.opendev.org/c/starlingx/config/+/896164

    Test Plan
    PASS: force a scenario where an amount of runtime configurations
          are enqueued/deferred, verify no errors on sysinv.log and
          that the message "_ready_to_apply_runtime_config: wait %s secs"
          is shown on the log, indicating that the previous point of
          failure is not failing anymore.

    Closes-bug: 2045704

    Change-Id: I96d6cbe5087936790a0854a51a464798a87786ef
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/903508

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/903508
Committed: https://opendev.org/starlingx/config/commit/f831b7ba70037e6ee13a82fd6ffd87f6fca29132
Submitter: "Zuul (22348)"
Branch: master

commit f831b7ba70037e6ee13a82fd6ffd87f6fca29132
Author: Heitor Matsui <email address hidden>
Date: Tue Dec 12 11:25:36 2023 -0300

    Change host config_target when trying to reapply

    Commit [1] introduced the capability for retrying runtime
    manifests that have not reported success back to the conductor
    after a specified timeout. During configuration-intensive
    periods like host provisioning, the host config_target can
    change multiple times due to various manifests being sent
    to it, so in a scenario where a manifest is lost, when it
    is retried, the host will apply it and then remain out-of-date
    because the config_target wasn't updated accordingly.

    This commit updates the host config_target before the system
    attempts to reapply a runtime manifest whose config_uuid
    differs from the current host config_target.

    This commit also changes the message type of runtime_config
    duplicate to "info", since it is expected to already exist
    in the database in the reapply scenario and is harmless to
    the system.

    [1] https://review.opendev.org/c/starlingx/config/+/894544

    Test Plan
    PASS: install/bootstrap/unlock single and multinode deployments
    PASS: force the issue to be reproduced for a host and verify:
          - Manifest is reapplied on the host by the conductor
          - The configuration is really applied to the host
            (e.g. in this case filesystem "docker" was increased)
          - Host config_target field is updated on the database
          - No config out-of-date alarms are present
    PASS: execute common configuration operations and verify that
          behavior is unchanged
    PASS: run intensive runtime configuration stress test and verify
          that the host is configured and no config out-of-date alarms
          are present on the system

    Closes-bug: 2045704

    Change-Id: Id2e41f990bebedd443e8eca76fc05a2d4b910aaf
    Signed-off-by: Heitor Matsui <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.