Debian: cert-mon PriorityQueue regression in python3

Bug #1992680 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kyle MacLeod

Bug Description

Brief Description
-----------------

DC Debian - subcloud was with dc-cert in "out-of-sync" after coming online, cert-manager didn't audit subcloud

Severity
--------

<Major: System/Feature is usable but degraded>

Steps to Reproduce

    50 HW subclouds were deployed in parallel
    16 HW subclouds were deployed successfully (went to online state)
    Ran subcloud manage command for the 16 HW subclouds
    1 HW subcloud kept in "out-of-sync" state

        | 9 | subcloud2007 | managed | online | complete | in-sync | None | None |
        | 10 | subcloud2008 | managed | online | complete | in-sync | None | None |
        | 11 | subcloud2009 | managed | online | complete | in-sync | None | None |
        | 28 | subcloud2026 | managed | online | complete | in-sync | None | None |
        | 29 | subcloud2027 | managed | online | complete | in-sync | None | None |
        | 30 | subcloud2028 | managed | online | complete | in-sync | None | None |
        | 31 | subcloud2029 | managed | online | complete | in-sync | None | None |
        | 32 | subcloud2030 | managed | online | complete | in-sync | None | None |
        | 45 | subcloud2043 | managed | online | complete | out-of-sync | None | None |
        | 46 | subcloud2044 | managed | online | complete | in-sync | None | None |
        | 47 | subcloud2045 | managed | online | complete | in-sync | None | None |
        | 48 | subcloud2046 | managed | online | complete | in-sync | None | None |
        | 49 | subcloud2047 | managed | online | complete | in-sync | None | None |
        | 50 | subcloud2048 | managed | online | complete | in-sync | None | None |
        | 51 | subcloud2049 | managed | online | complete | in-sync | None | None |
        | 52 | subcloud2050 | managed | online | complete | in-sync | None | None |

        [sysadmin@controller-0 ~(keystone_admin)]$ dcmanager subcloud show subcloud2043
        +-----------------------------+----------------------------+
        | Field | Value |
        +-----------------------------+----------------------------+
        | id | 45 |
        | name | subcloud2043 |
        | description | None |
        | location | None |
        | software_version | 22.12 |
        | management | managed |
        | availability | online |
        | deploy_status | complete |
        | management_subnet | fdff:719a:bf60:43::/64 |
        | management_start_ip | fdff:719a:bf60:43::2 |
        | management_end_ip | fdff:719a:bf60:43::ffff |
        | management_gateway_ip | fdff:719a:bf60:43::1 |
        | systemcontroller_gateway_ip | 2620:10a:a001:d45::b4 |
        | group_id | 1 |
        | created_at | 2022-10-10 17:41:14.917949 |
        | updated_at | 2022-10-10 20:08:05.963986 |
        | backup_status | None |
        | backup_datetime | None |
        | dc-cert_sync_status | unknown |
        | firmware_sync_status | in-sync |
        | identity_sync_status | in-sync |
        | kubernetes_sync_status | in-sync |
        | kube-rootca_sync_status | in-sync |
        | load_sync_status | in-sync |
        | patching_sync_status | in-sync |
        | platform_sync_status | in-sync |
        +-----------------------------+----------------------------+

Expected Behavior
-----------------

All subclouds should go to "in-sync" state after managing them

Actual Behavior
---------------

1 subcloud kept in "out-of-sync" state

Reproducibility
---------------

intermittent, seen in one subcloud

System Configuration
--------------------

DC

Load info (eg: 2022-03-10_20-00-07)

STX Master
----------

Last Pass
----------

New test scenario - Debian Scale testing

Timestamp/Logs

Unexpected behavior:

dcmanager.log
2022-10-10 18:38:41.651 115387 INFO dcmanager.manager.subcloud_manager [-] Successfully deployed subcloud2043

cert-mon.log (DEPLOY PHASE)
2022-10-10T17:41:18.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:d45::d00]:8119/v1.0/subclouds/subcloud2043n
2022-10-10T17:41:18.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew check_filter[subcloud2043]: subcloud is not online

2022-10-10T18:47:14.000 controller-0 cert-mon: info 2022-10-10 18:47:14.156 115979 INFO sysinv.cert_mon.service [req-3d2b74fe-97bf-4945-bbfa-19d7db55d11a - - - - -] subcloud2043 is online. An online audit is queued 2022-10-10T18:47:14.000 controller-0 cert-mon: info 2022-10-10 18:47:14.156 115979 INFO sysinv.cert_mon.subcloud_audit_queue [req-3d2b74fe-97bf-4945-bbfa-19d7db55d11a - - - - -] Enqueued: SubcloudAuditData: {name: subcloud2043, audit_count: 1}

dcmanager.log (SUBCLOUD MANAGE COMMAND)
2022-10-10 20:08:05.978 115387 INFO dcmanager.manager.subcloud_manager [req-5fac322d-e6d3-4b8b-bd7a-ea15e9056d7a 35d74fdeda8e4fbabb480a374ab9e86f - - default default] Notifying dcorch, subcloud:subcloud2043 management: managed, availability:online

cert-mon.log (AFTER RUNNING SUBCLOUD MANAGE COMMAND)
2022-10-10T20:08:05.000 controller-0 cert-mon: info 2022-10-10 20:08:05.981 115979 INFO sysinv.cert_mon.service [req-5fac322d-e6d3-4b8b-bd7a-ea15e9056d7a 35d74fdeda8e4fbabb480a374ab9e86f - - default default] subcloud2043 is managed. An audit is queued

expexted behavior (subcloud2050)

cert-mon.log (deploy phase)

2022-10-10T17:41:35.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.utils [-] api_cmd http://[2620:10a:a001:d45::d00]:8119/v1.0/subclouds/subcloud2050

2022-10-10T17:41:35.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.watcher [-] DCIntermediateCertRenew check_filter[subcloud2050]: subcloud is not online

2022-10-10T18:46:13.000 controller-0 cert-mon: info 2022-10-10 18:46:13.749 115979 INFO sysinv.cert_mon.service [req-3d2b74fe-97bf-4945-bbfa-19d7db55d11a - - - - -] subcloud2050 is online. An online audit is queued
2022-10-10T18:46:13.000 controller-0 cert-mon: info 2022-10-10 18:46:13.750 115979 INFO sysinv.cert_mon.subcloud_audit_queue [req-3d2b74fe-97bf-4945-bbfa-19d7db55d11a - - - - -] Enqueued: SubcloudAuditData: {name: subcloud2050, audit_count: 1}
2022-10-10T18:46:14.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.certificate_mon_manager [-] Auditing subcloud subcloud2050, attempt #1 [qsize: 0]
2022-10-10T18:46:14.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.certificate_mon_manager [-] subcloud2050 intermediate CA cert is in-sync

2022-10-10T18:46:14.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.utils [-] Updated subcloud subcloud2050 status: in-sync

dcmanager.log
2022-10-10 20:08:22.745 115387 INFO dcmanager.manager.subcloud_manager [req-3ca2c473-3ced-4290-839d-74e3160b578b 35d74fdeda8e4fbabb480a374ab9e86f - - default default] Notifying dcorch, subcloud:subcloud2050 management: managed, availability:online

cert-mon.log (after running subcloud manage command)

2022-10-10T20:08:22.000 controller-0 cert-mon: info 2022-10-10 20:08:22.749 115979 INFO sysinv.cert_mon.service [req-3ca2c473-3ced-4290-839d-74e3160b578b 35d74fdeda8e4fbabb480a374ab9e86f - - default default] subcloud2050 is managed. An audit is queued
2022-10-10T20:08:22.000 controller-0 cert-mon: info 2022-10-10 20:08:22.749 115979 INFO sysinv.cert_mon.subcloud_audit_queue [req-3ca2c473-3ced-4290-839d-74e3160b578b 35d74fdeda8e4fbabb480a374ab9e86f - - default default] Enqueued: SubcloudAuditData: {name: subcloud2050, audit_count: 1}

2022-10-10T20:08:24.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.certificate_mon_manager [-] Auditing subcloud subcloud2050, attempt #1 [qsize: 1]

2022-10-10T20:08:24.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.certificate_mon_manager [-] subcloud2050 intermediate CA cert is in-sync

2022-10-10T20:08:24.000 controller-0 cert-mon: info 115979 INFO sysinv.cert_mon.utils [-] Updated subcloud subcloud2050 status: in-sync
n

Alarms

Test Activity

Feature Testing

Workaround

No workaround identified

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/861099

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/861099
Committed: https://opendev.org/starlingx/config/commit/da21e1f9f708b5ba173b9a2662b3e933b90c8482
Submitter: "Zuul (22348)"
Branch: master

commit da21e1f9f708b5ba173b9a2662b3e933b90c8482
Author: Kyle MacLeod <email address hidden>
Date: Wed Oct 12 12:54:18 2022 -0400

    Fix cert-mon PriorityQueue regression in python3

    In python3 the PriorityQueue raises an exception
    due to

    TypeError: '<' not supported between
            instances of 'SubcloudAuditData' and 'SubcloudAuditData'

    The fix is to include a __lt__ method in SubcloudAuditData.
    A timestamp field is added (primarily used in the tuple added to the
    queue, but easy enough to include here) in order to aid in the sorting.

    Test Plan:

    PASS: trigger cert-mon audit for subclouds. Verify that the exception
    is not raised, and that subclouds are properly enqueued for audit.

    Closes-Bug: 1992680
    Change-Id: Ibaa9a421eb809edc434793bc7e8ae92691be021f
    Signed-off-by: Kyle MacLeod <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.8.0 stx.debian stx.distcloud stx.security
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Kyle MacLeod (kmacleod)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.