AIO-DX subcloud rehoming fails intermittently when installing trusted-ca due to 250.001

Bug #2054462 reported by Reinildes Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Reinildes Oliveira

Bug Description

Brief Description
--------------------------------------

A few AIO-DX subclouds failed during parallel rehoming in the 'common/install-trusted-ca' playbook task, likely due to the 250.001 alarm.

Severity
--------------------------------------

Major

Steps to Reproduce
--------------------------------------

1. Install DC1000 and DC1000-2

2. Deploy and Manage 9 AIO-DX HW subclouds on DC1000

4. Unmanage all AIO-DX HW subclouds on DC1000
$ for i in {4001,4006,4022,4003,4018,4020,4007,4009,4008}; do dcmanager subcloud unmanage subcloud$i; done

5. Connect to DC1000-2 and initiate the parallel rehoming (Traffic Shaping/Latency disabled)
$ for i in {4001,4006,4022,4003,4018,4020,4007,4009,4008}; do "./dx_migrate_subcloud-range-hw.sh" $i; done

Expected Behavior
--------------------------------------

AIO-DX subclouds rehomed to the new System Controller, DC1000-2.

Actual Behavior
--------------------------------------

AIO-DX subclouds failed to be rehomed, likely due to 250.001 alarm.

Reproducibility
--------------------------------------

Run 1 (Manual Rehoming) 2 out of 9 subclouds - {4006, 4008}
Run 2 (GEO Redundancy Rehoming): 3 out of 21 subclouds - {subcloud4010, subcloud4003, subcloud4001}

System Configuration
--------------------------------------

DC1000 and DC1000-2 labs.

Last Pass
--------------------------------------

First time testing AIO-DX parallel rehoming.

Timestamp/Logs
--------------------------------------

TASK [common/install-trusted-ca : Check if an 250.001 alarm was raised and wait it to be cleared] ***
Wednesday 17 January 2024 17:52:15 +0000 (0:00:00.016) 0:01:47.967 *****
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (5 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (4 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (3 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (2 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (1 retries left).
changed: [subcloud4006]

TASK [common/install-trusted-ca : Fail when the alarm remains] *****************
Wednesday 17 January 2024 17:54:31 +0000 (0:02:15.236) 0:04:03.203 *****
fatal: [subcloud4006]: FAILED! => changed=false
  msg: Timed out waiting 250.001 alarm to clear out. Check sysinv and puppet logs for more information and solve any 250.001 alarms before retrying.

PLAY RECAP *********************************************************************
subcloud4006 : ok=56 changed=29 unreachable=0 failed=1 skipped=55 rescued=0 ignored=0

Wednesday 17 January 2024 17:54:31 +0000 (0:00:00.021) 0:04:03.224 *****

// fm-event.log:

/var/log/fm-event.log:2024-01-17T18:02:46.072 controller-0 fmManager: info { "event_log_id" : "250.001", "reason_text" : "controller-1 Configuration is out-of-date. (applied: ced9e00b-d538-47db-9cc9-d787900d9a2a target: 2e007cb5-bd79-4d96-991a-db6ce5823a39)", "entity_instance_id" : "region=subcloud4006.system=dc-dx-subcloud4006.host=controller-1", "severity" : "major", "state" : "clear", "timestamp" : "2024-01-17 18:02:46.071752" }

// subcloud4008
// Note: The error is a bit different but it seems the same issue with 250.001 alarm.

TASK [common/install-trusted-ca : Wait for .config_applied file stat to change] ***
Wednesday 17 January 2024 17:54:50 +0000 (0:00:00.016) 0:03:54.281 *****
FAILED - RETRYING: Wait for .config_applied file stat to change (10 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (9 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (8 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (7 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (6 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (5 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (4 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (3 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (2 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (1 retries left).
ok: [subcloud4008]TASK [common/install-trusted-ca : Set fail control variable] *******************
Wednesday 17 January 2024 17:58:26 +0000 (0:03:35.870) 0:07:30.152 *****
ok: [subcloud4008]TASK [common/install-trusted-ca : Fail when the manifest apply times out] ******
Wednesday 17 January 2024 17:58:26 +0000 (0:00:00.022) 0:07:30.175 *****
fatal: [subcloud4008]: FAILED! => changed=false
  msg: Timed out applying puppet runtime manifest. Check sysinv and puppet logs for more information and solve any 250.001 alarms before retrying.PLAY RECAP *********************************************************************
subcloud4008 : ok=56 changed=28 unreachable=0 failed=1 skipped=53 rescued=0 ignored=0

Alarms

No alarms were reported before starting the rehoming.

Test Activity
--------------------------------------

Feature Testing

Workaround
--------------------------------------

Delete the subcloud.
Re-attempt the subcloud rehoming.

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Reinildes Oliveira (rjosemat)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/909609
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/6335d7d49183d716d7f4032c81f1d4138e756755
Submitter: "Zuul (22348)"
Branch: master

commit 6335d7d49183d716d7f4032c81f1d4138e756755
Author: Rei Oliveira <email address hidden>
Date: Tue Feb 20 15:57:12 2024 -0300

    Rehome: Increase timeout for certs to be installed

    This commit addresses 2 related bugs, in the sense that they
    are related to ansible tasks that wait for certificates
    to be installed.

    Task 'Check admin-ep-cert.pem updated' depends on cert-mon to
    install the certificate. Cert-mon may fail and reattempt after
    10 minutes. This change increases that timeout for this task
    to be larger than that. I'm also decreasing the delay in a
    half as it is a quick stat operation, to allow it to be
    detected quickly in most cases where the first cert-mon attempt
    works.

    Task 'Verify if there are 250.001 (config out-of-date) alarms'
    is dependent on puppet to apply a config change to install
    the certificate and sysinv to clear the alarm. When I
    reproduced this issue it took 1 minute longer for the alarm
    to clear. This change increases the timeout of the task in
    about 50%.

    Test Plan:

    PASS: Rehome a subcloud with 1200 ms latency injected and
          50% cap on CPU capacity.

    Closes-Bug: 2054462
    Closes-Bug: 2054463
    Change-Id: I017fab0ccb13629c63a7cd855470f0a777f06e22
    Signed-off-by: Rei Oliveira <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud stx.security
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.