AIO-DX subcloud rehoming fails intermittently when installing trusted-ca due to 250.001
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Reinildes Oliveira |
Bug Description
Brief Description
-------
A few AIO-DX subclouds failed during parallel rehoming in the 'common/
Severity
-------
Major
Steps to Reproduce
-------
1. Install DC1000 and DC1000-2
2. Deploy and Manage 9 AIO-DX HW subclouds on DC1000
4. Unmanage all AIO-DX HW subclouds on DC1000
$ for i in {4001,4006,
5. Connect to DC1000-2 and initiate the parallel rehoming (Traffic Shaping/Latency disabled)
$ for i in {4001,4006,
Expected Behavior
-------
AIO-DX subclouds rehomed to the new System Controller, DC1000-2.
Actual Behavior
-------
AIO-DX subclouds failed to be rehomed, likely due to 250.001 alarm.
Reproducibility
-------
Run 1 (Manual Rehoming) 2 out of 9 subclouds - {4006, 4008}
Run 2 (GEO Redundancy Rehoming): 3 out of 21 subclouds - {subcloud4010, subcloud4003, subcloud4001}
System Configuration
-------
DC1000 and DC1000-2 labs.
Last Pass
-------
First time testing AIO-DX parallel rehoming.
Timestamp/Logs
-------
TASK [common/
Wednesday 17 January 2024 17:52:15 +0000 (0:00:00.016) 0:01:47.967 *****
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (5 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (4 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (3 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (2 retries left).
FAILED - RETRYING: Check if an 250.001 alarm was raised and wait it to be cleared (1 retries left).
changed: [subcloud4006]
TASK [common/
Wednesday 17 January 2024 17:54:31 +0000 (0:02:15.236) 0:04:03.203 *****
fatal: [subcloud4006]: FAILED! => changed=false
msg: Timed out waiting 250.001 alarm to clear out. Check sysinv and puppet logs for more information and solve any 250.001 alarms before retrying.
PLAY RECAP *******
subcloud4006 : ok=56 changed=29 unreachable=0 failed=1 skipped=55 rescued=0 ignored=0
Wednesday 17 January 2024 17:54:31 +0000 (0:00:00.021) 0:04:03.224 *****
// fm-event.log:
/var/log/
// subcloud4008
// Note: The error is a bit different but it seems the same issue with 250.001 alarm.
TASK [common/
Wednesday 17 January 2024 17:54:50 +0000 (0:00:00.016) 0:03:54.281 *****
FAILED - RETRYING: Wait for .config_applied file stat to change (10 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (9 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (8 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (7 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (6 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (5 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (4 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (3 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (2 retries left).
FAILED - RETRYING: Wait for .config_applied file stat to change (1 retries left).
ok: [subcloud4008]TASK [common/
Wednesday 17 January 2024 17:58:26 +0000 (0:03:35.870) 0:07:30.152 *****
ok: [subcloud4008]TASK [common/
Wednesday 17 January 2024 17:58:26 +0000 (0:00:00.022) 0:07:30.175 *****
fatal: [subcloud4008]: FAILED! => changed=false
msg: Timed out applying puppet runtime manifest. Check sysinv and puppet logs for more information and solve any 250.001 alarms before retrying.PLAY RECAP *******
subcloud4008 : ok=56 changed=28 unreachable=0 failed=1 skipped=53 rescued=0 ignored=0
Alarms
No alarms were reported before starting the rehoming.
Test Activity
-------
Feature Testing
Workaround
-------
Delete the subcloud.
Re-attempt the subcloud rehoming.
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
assignee: | nobody → Reinildes Oliveira (rjosemat) |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.distcloud stx.security |
Reviewed: https:/ /review. opendev. org/c/starlingx /ansible- playbooks/ +/909609 /opendev. org/starlingx/ ansible- playbooks/ commit/ 6335d7d49183d71 6d7f4032c81f1d4 138e756755
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 6335d7d49183d71 6d7f4032c81f1d4 138e756755
Author: Rei Oliveira <email address hidden>
Date: Tue Feb 20 15:57:12 2024 -0300
Rehome: Increase timeout for certs to be installed
This commit addresses 2 related bugs, in the sense that they
are related to ansible tasks that wait for certificates
to be installed.
Task 'Check admin-ep-cert.pem updated' depends on cert-mon to
install the certificate. Cert-mon may fail and reattempt after
10 minutes. This change increases that timeout for this task
to be larger than that. I'm also decreasing the delay in a
half as it is a quick stat operation, to allow it to be
detected quickly in most cases where the first cert-mon attempt
works.
Task 'Verify if there are 250.001 (config out-of-date) alarms'
is dependent on puppet to apply a config change to install
the certificate and sysinv to clear the alarm. When I
reproduced this issue it took 1 minute longer for the alarm
to clear. This change increases the timeout of the task in
about 50%.
Test Plan:
PASS: Rehome a subcloud with 1200 ms latency injected and
50% cap on CPU capacity.
Closes-Bug: 2054462 9c63a7cd855470f 0a777f06e22
Closes-Bug: 2054463
Change-Id: I017fab0ccb1362
Signed-off-by: Rei Oliveira <email address hidden>