AIO-SX Subcloud Rehoming in parallel - Exception when installing system_root_ca_cert

Bug #2029510 reported by Marcelo de Castro Loebens
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Marcelo de Castro Loebens

Bug Description

Brief Description
-----------------
AIO-SX subcloud rehoming in parallel (10 subclouds simultaneously) is failing due to an exception in the playbook when trying to install system_root_ca_cert certificate as a Trusted CA certificate.

Severity
--------
Major.

Steps to Reproduce
------------------
- Deploy/Manage 10 AIO-SX subclouds
- Unamange/Check dates/Check subclouds are free of alarms
- From another SystemController, migrate/rehome 10 subclouds in parallel

Expected Behavior
------------------
Subclouds migrated to the new System Controller.

Actual Behavior
----------------
Some subclouds may fail when trying to install system_root_ca_cert.

Reproducibility
---------------
Intermittent (~20% of subclouds in 10 parallel rehoming)

System Configuration
--------------------
2 DX SystemControllers + 10 SX subclouds

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
NA

Timestamp/Logs
--------------
TASK [common/install-trusted-ca : Install system_root_ca_cert certificate as a Trusted CA certificate] ***
Thursday 13 July 2023 10:25:03 +0000 (0:00:00.017) 0:01:33.929 *********
FAILED - RETRYING: Install system_root_ca_cert certificate as a Trusted CA certificate (3 retries left).
FAILED - RETRYING: Install system_root_ca_cert certificate as a Trusted CA certificate (2 retries left).
FAILED - RETRYING: Install system_root_ca_cert certificate as a Trusted CA certificate (1 retries left).
fatal: [subcloud2001]: FAILED! => changed=true
  attempts: 3
  cmd: source /etc/platform/openrc && system certificate-install -m ssl_ca /tmp/ca_xu18g3y4.pem
  delta: '0:01:56.588025'
  end: '2023-07-13 10:41:03.562522'
  msg: non-zero return code
  rc: 1
  start: '2023-07-13 10:39:06.974497'
  stderr: 'Certificate /tmp/ca_xu18g3y4.pem not installed: Expecting value: line 1 column 1 (char 0)'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>PLAY RECAP *********************************************************************
subcloud2001 : ok=55 changed=33 unreachable=0 failed=1 skipped=43 rescued=0 ignored=0

Test Activity
-------------
Feature Testing (https://storyboard.openstack.org/#!/story/2010815)

Workaround
----------
NA

Changed in starlingx:
status: New → In Progress
assignee: nobody → Marcelo de Castro Loebens (mdecastr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/890438
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/c3239bda273fc2036ecd1fc6b11dfa9d549bc0df
Submitter: "Zuul (22348)"
Branch: master

commit c3239bda273fc2036ecd1fc6b11dfa9d549bc0df
Author: Marcelo de Castro Loebens <email address hidden>
Date: Thu Aug 3 12:11:58 2023 -0400

    Remove CA install in SystemController (rehoming)

    A redundancy inserted to fix the SystemController if the secret
    'system-local-ca' was replaced but not installed as a trusted CA in
    the SystemController was causing it to be overwhelmed in case of
    parallel rehomings (the same certificate install call was being made
    by several subclouds at the same time).

    Since this install is not actually required (the CA is expected to
    be installed as trusted CA), we are removing this step. Also added
    a check to avoid running into a 250.001 alarm in the subcloud after
    installing the certificate there.

    Test Plan:
    PASS: Rehome 10 subclouds in parallel.

    Test Plan:
    PASS: With two DX SystemControllers and 10 SX subclouds:
          1) Install and manage all the 10 subclouds in one of the
             SystemControllers;
          2) Unmanage the subclouds, check dates and if the subclous
             are free of alarms;
          3) In the other SystemController, migrate/rehome the 10
             subclouds in parallel.

    Closes-Bug: 2029510

    Change-Id: Ief56be4bdd0af71a8075aff21e6ebce6798959af
    Signed-off-by: Marcelo de Castro Loebens <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.