Subcloud add failure due to bootstrap replay failure

Bug #2039863 reported by Rupanjan Chakraborty
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Rupanjan Chakraborty

Bug Description

Brief Description

Many subclouds in scale lab could not be re-added due to bootstrap replay failure.

Severity

Major

Steps to Reproduce

To check if this is reproducible in DC1000-2, we had to apply a change and restart dcmanager service. As a result, all subclouds that were in the middle of bootstrap failed

Readd the 250 subclouds

Expected Behavior

Subclouds could be re-added without issues

Actual Behavior

Some subcloud failed to bootstrap with the following error

      Failed to provision the initial system config.
      Traceback (most recent call last):
        File "/tmp/.ansible-sysadmin/tmp/ansible-tmp-1694037682.4409835-4193870-129384925261196/populate_initial_config.py", line 1326, in <module>
          populate_service_parameter_config(client)
        File "/tmp/.ansible-sysadmin/tmp/ansible-tmp-1694037682.4409835-4193870-129384925261196/populate_initial_config.py", line 1046, in populate_service_parameter_config
          populate_docker_kube_config(client)
        File "/tmp/.ansible-sysadmin/tmp/ansible-tmp-1694037682.4409835-4193870-129384925261196/populate_initial_config.py", line 838, in populate_docker_kube_config
          client.sysinv.service_parameter.delete(parameter.uuid)
        File "/usr/lib/python3/dist-packages/cgtsclient/v1/service_parameter.py", line 45, in delete
          return self._delete(self._path(parameter_id))
        File "/usr/lib/python3/dist-packages/cgtsclient/common/base.py", line 95, in _delete
          self.api.raw_request('DELETE', url)
        File "/usr/lib/python3/dist-packages/cgtsclient/common/http.py", line 224, in raw_request
          return self._http_request(url, method, **kwargs)
        File "/usr/lib/python3/dist-packages/cgtsclient/common/http.py", line 186, in _http_request
          raise exceptions.from_response(
      cgtsclient.exc.HTTPBadRequest: Failure deleting configmap: Kubernetes is not configured. API operations will not be available.
    stdout_lines:
An example of these failed subclouds is subcloud104. Ansible bootstrap was terminated while it was in the middle of the following task which should not affect the bootstrap replay.

TASK [bootstrap/bringup-essential-services : Check controller-0 is in online state] ***
Wednesday 06 September 2023 21:31:46 +0000 (0:00:01.269) 0:36:06.602 ***
FAILED - RETRYING: Check controller-0 is in online state (15 retries left).
Reproducibility

100% reproducible

System Configuration

Distrubuted Cloud

Alarms

N/A

Test Activity

Developer Testing

Workaround

AWS subcloud: Delete the VM, redeploy the VM, delete the subcloud and re-add

Hardware subcloud: delete the subcloud and re-add with reinstall option

Changed in starlingx:
assignee: nobody → Rupanjan Chakraborty (rchakrab)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/899303
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/948c93ae7355daf17633304a56bcc11bf4fffc86
Submitter: "Zuul (22348)"
Branch: master

commit 948c93ae7355daf17633304a56bcc11bf4fffc86
Author: Igor Soares <email address hidden>
Date: Wed Oct 25 16:06:48 2023 -0300

    Mark config as incomplete when shutting down k8s

    Delete /etc/platform/.initial_k8s_config_complete file to signal that
    configuration is incomplete when shutting down Kubernetes during
    bootstrap replays.

    The bootstrap playbook creates the
    /etc/platform/.initial_k8s_config_complete file when bringing Kubernetes
    up. This file, however, was not being removed when bringing it down
    during replays which, in turn, was preventing bootstrap from succeeding.

    Deleting service parameters during bootstrap replays relies on the
    .initial_k8s_config_complete file to tell whether configmaps should be
    deleted. If the file is not removed then the service parameter delete
    function mistakenly considers that Kubernetes is running when it is not,
    thus causing a failure when attempting to delete configmaps.

    Test Plan:
    PASS: build-pkgs -a && build-image
    PASS: AIO-SX full deployment without bootstrap replay
    PASS: AIO-SX full deployment with bootstrap replay
    PASS: Install AIO-SX load
          Add apiserver_extra_args:default-not-ready-toleration-seconds
          to localhost.yml
          Run boostrap playbook
          Replay boostrap playbook
          Run 'system service-parameter-list' and check if
          default-not-ready-toleration-seconds is listed
          Check if default-not-ready-toleration-seconds was added to
          kube-apiserver parameters
          Unlock node

    Closes-Bug: 2039863

    Change-Id: I29f7cc1766cb8c218908bfb1c130b0a8507a9965
    Signed-off-by: Igor Soares <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ansible-playbooks (master)

Change abandoned by "Rupanjan Chakraborty <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/ansible-playbooks/+/898841

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.