Backup & Restore: AIO-DX restore Failed to provision initial system configuration

Bug #1854172 reported by Senthil Mukundakumar
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Kristine Bujold

Bug Description

Brief Description
-----------------

Restoring AIO-DX active controller failed by the following error:

TASK [bootstrap/persist-config : Saving config in sysinv database] *************************************
changed: [localhost]

TASK [bootstrap/persist-config : debug] ****************************************************************
ok: [localhost] => {
    "populate_result": {
        "changed": true,
        "failed": false,
        "failed_when_result": false,
        "msg": "non-zero return code",
        "rc": 1,
        "stderr": "No handlers could be found for logger \"cgtsclient.common.http\"\nTraceback (most recent call last):\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1574796115.2-85009451792142/populate_initial_config.py\", line 973, in <module>\n populate_docker_config(client)\n File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1574796115.2-85009451792142/populate_initial_config.py\", line 591, in populate_docker_config\n client.sysinv.service_parameter.create(**values)\n File \"/usr/lib64/python2.7/site-packages/cgtsclient/v1/service_parameter.py\", line 42, in create\n return self._create(self._path(), body)\n File \"/usr/lib64/python2.7/site-packages/cgtsclient/common/base.py\", line 51, in _create\n _, body = self.api.json_request('POST', url, body=body)\n File \"/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py\", line 243, in json_request\n method, **kwargs)\n File \"/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py\", line 219, in _cs_request\n error_json.get('debuginfo'), *args)\ncgtsclient.exc.HTTPBadRequest: The service parameter value is restricted to at most 255 characters.\n",
        "stderr_lines": [
            "No handlers could be found for logger \"cgtsclient.common.http\"",
            "Traceback (most recent call last):",
            " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1574796115.2-85009451792142/populate_initial_config.py\", line 973, in <module>",
            " populate_docker_config(client)",
            " File \"/tmp/.ansible-sysadmin/tmp/ansible-tmp-1574796115.2-85009451792142/populate_initial_config.py\", line 591, in populate_docker_config",
            " client.sysinv.service_parameter.create(**values)",
            " File \"/usr/lib64/python2.7/site-packages/cgtsclient/v1/service_parameter.py\", line 42, in create",
            " return self._create(self._path(), body)",
            " File \"/usr/lib64/python2.7/site-packages/cgtsclient/common/base.py\", line 51, in _create",
            " _, body = self.api.json_request('POST', url, body=body)",
            " File \"/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py\", line 243, in json_request",
            " method, **kwargs)",
            " File \"/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py\", line 219, in _cs_request",
            " error_json.get('debuginfo'), *args)",
            "cgtsclient.exc.HTTPBadRequest: The service parameter value is restricted to at most 255 characters."
        ],
        "stdout": "Populating system config...\nSystem config completed.\nPopulating load config...\nLoad config completed.\nPopulating management network...\nPopulating pxeboot network...\nPopulating oam network...\nPopulating multicast network...\nPopulating cluster host network...\nPopulating cluster pod network...\nPopulating cluster service network...\nNetwork config completed.\nPopulating/Updating DNS config...\nDNS config completed.\nPopulating/Updating docker proxy config...\nFailed to update the initial system config.\n",
        "stdout_lines": [
            "Populating system config...",
            "System config completed.",
            "Populating load config...",
            "Load config completed.",
            "Populating management network...",
            "Populating pxeboot network...",
            "Populating oam network...",
            "Populating multicast network...",
            "Populating cluster host network...",
            "Populating cluster pod network...",
            "Populating cluster service network...",
            "Network config completed.",
            "Populating/Updating DNS config...",
            "DNS config completed.",
            "Populating/Updating docker proxy config...",
            "Failed to update the initial system config."
        ]
    }
}

TASK [bootstrap/persist-config : Fail if populate config script throws an exception] *******************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to provision initial system configuration."}

PLAY RECAP *********************************************************************************************
localhost : ok=199 changed=78 unreachable=0 failed=1

Severity
--------
Critical: Unable to restore active controller in AIO-DX

Steps to Reproduce
------------------
1. Bring up the AIO-DX system
2. Backup the system using ansible locally
3. Re-install the controller with the same load
4. Restore the active controller
5. Unlock active controller

Expected Behavior
------------------
The active controller should be successfully restored and become active

Actual Behavior
----------------
Active controller failed to restore

Reproducibility
---------------
Reproducible (WCP_78_79)

System Configuration
--------------------
Regular System

Branch/Pull Time/Commit
-----------------------
 BUILD_ID="2019-11-25_20-00-00"

Test Activity
-------------
Regression Testing

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Hello, please provide more information:
1. The configuration of the system this was taken on (setup name may also help)
2. The backup archive
3. The collect before the restore

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

1. Configuration: WCP_78_79 (AIO-DX ipv6 system)
2. I am not able to upload the backup file, launchpad reports Timeout error.
Backup file is copied to /folk/cgts_logs/logs/LP-1854172
3. Collect cannot be done before restore since the system is not active.

description: updated
Changed in starlingx:
status: Incomplete → New
Ghada Khalil (gkhalil)
tags: added: stx.update
Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

Thank you!

Re #3. "Collect cannot be done before restore since the system is not active. " what we need is a collect before starting the restore procedure. Therefore collect should be taken after the backup is done but before the re-installation of controller-0. The system should be fully functional during this time.

Re #2, yeah backup files are big, maybe we should have a large storage space publicly accessible?

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :

This was introduced in: https://review.opendev.org/#/c/693213/

Code that creates localhost_override_backup.yml will take the no_proxy list from the DB:
        for docker in docker_list:
            # Get the docker no-proxy info if it exists
            if docker.name == constants.SERVICE_PARAM_NAME_DOCKER_NO_PROXY:
                # Remove the open and close parenthesis if address is IPV6
                _value = docker.value.strip("[]")
                no_proxy_list = _value.split(',')
                data.update({'docker_no_proxy': no_proxy_list})

and write it in localhost_override_backup.yml:

docker_no_proxy:
- localhost
- 127.0.0.1
- registry.local
- '[abcd:204::1]'
- '[abcd:204::2]'
- '[2620:10a:a001:a103::1237]'
- '[2620:10a:a001:a103::1235]'
- '[abcd:204::3]'
- '[2620:10a:a001:a103::1236]'
- tis-lab-registry.cumulus.wrs.com

Problem is that the original list of custom no_proxy registries used at configuration in localhost.yml is:
docker_no_proxy:
- registry.local
- tis-lab-registry.cumulus.wrs.com

Which ansible bootstrap writes to sysinv as:
localhost,127.0.0.1,registry.local,[abcd:204::1],[abcd:204::2],[2620:10a:a001:a103::1237],[2620:10a:a001:a103::1235],[abcd:204::3],[2620:10a:a001:a103::1236],tis-lab-registry.cumulus.wrs.com

That's because ansible appends the two custom ones to a default list in:
    - name: Add user defined no-proxy address list to default
      set_fact:
        docker_no_proxy_combined: "{{ default_no_proxy | union(docker_no_proxy) | unique | ipwrap }}"

This means that at restore the list of defaults entries will be doubled, which leads to a string that is too long (hence the "The service parameter value is restricted to at most 255 characters." error).

Two possible solutions here:
A. Do not write the default values to localhost_override_backup.yml or
B. Do not append the default values to the docker_no_proxy_combined list whne mode == 'restore' with something like:
    - name: Add user defined no-proxy address list to default
      set_fact:
        docker_no_proxy_combined: "{{ default_no_proxy | union(docker_no_proxy) | unique | ipwrap }}"
    when: mode != 'restore'

    - name: Add user defined no-proxy address list to default
      set_fact:
        docker_no_proxy_combined: "{{ union(docker_no_proxy) | unique }}"
    when: mode != 'restore'

What I don't understant is why "unique" doesn't remove duplicates, probably the syntax of IPv6 ip's is sligthly different?

Revision history for this message
Ovidiu Poncea (ovidiuponcea) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.3.0 / high priority - issue introduced by recent code changes and results in breaking Backup & Restore

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Kristine Bujold (kbujold)
tags: added: stx.3.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/696969

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Kristine Bujold (kbujold) wrote :

The unique filter is not working because it comes before the ipwrap filter. The IPV6 addresses in default_no_proxy are not in [] and therefore do not match the ones in docker_no_proxy that are. The proposed fix is to have the ipwrap filter before the unique.

TASK [bootstrap/validate-config : debug] ************************************************************************************************************************************************************
ok: [localhost] => {
    "default_no_proxy": [
        "localhost",
        "127.0.0.1",
        "registry.local",
        "abcd:204::1",
        "abcd:204::2",
        "2620:10a:a001:a103::1237",
        "2620:10a:a001:a103::1235"
    ]
}

TASK [bootstrap/validate-config : debug] ************************************************************************************************************************************************************
ok: [localhost] => {
    "docker_no_proxy": [
        "localhost",
        "127.0.0.1",
        "registry.local",
        "[abcd:204::1]",
        "[abcd:204::2]",
        "[2620:10a:a001:a103::1237]",
        "[2620:10a:a001:a103::1235]",
        "[abcd:204::3]",
        "[2620:10a:a001:a103::1236]",
        "tis-lab-registry.cumulus.wrs.com"
    ]
}

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/696969
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=c1e63734b9654bca633d8380c20beaf118f67b39
Submitter: Zuul
Branch: master

commit c1e63734b9654bca633d8380c20beaf118f67b39
Author: Kristine Bujold <email address hidden>
Date: Mon Dec 2 16:41:53 2019 -0500

    Fix IPV6 restore issue with docker no-proxy

    The unique and ipwrap filter were not properly sequenced and causing
    duplicates in the docker_no_proxy_combined variable.

    Closes-Bug: 1854172
    Change-Id: I07e8031a323abd3cf1d2e765bc49b916adba0e11
    Signed-off-by: Kristine Bujold <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Next step is to cherrypick to r/stx.3.0 branch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (r/stx.3.0)

Fix proposed to branch: r/stx.3.0
Review: https://review.opendev.org/697182

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (r/stx.3.0)

Reviewed: https://review.opendev.org/697182
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=560278771f14f134e8b02295f2fd021a76c1d82f
Submitter: Zuul
Branch: r/stx.3.0

commit 560278771f14f134e8b02295f2fd021a76c1d82f
Author: Kristine Bujold <email address hidden>
Date: Mon Dec 2 16:41:53 2019 -0500

    Fix IPV6 restore issue with docker no-proxy

    The unique and ipwrap filter were not properly sequenced and causing
    duplicates in the docker_no_proxy_combined variable.

    Closes-Bug: 1854172
    Change-Id: I07e8031a323abd3cf1d2e765bc49b916adba0e11
    Signed-off-by: Kristine Bujold <email address hidden>
    (cherry picked from commit c1e63734b9654bca633d8380c20beaf118f67b39)

Ghada Khalil (gkhalil)
tags: added: in-r-stx30
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Senthil Mukundakumar (smukunda) wrote :

Verified using wcp_78_79 with load 2020-02-18_04-10-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.