Distributed Cloud: Delete and re-add subcloud failed at bootstrap after initial configuration failure on controller-0

Bug #1864756 reported by Yosief Gebremariam
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jessica Castelino

Bug Description

Brief Description
-----------------
Initially the "dcmanager subcloud add subcloud4" command failed on subcloud because of missing ceph-cluster backend. After removing the subcloud from the DC system, I attempted to re-add the subcloud. Unfortunately, the replay failed early in bootstrapping the subcloud with the error message below:

failed: [subcloud4] (item={'_ansible_parsed': True, 'stderr_lines': [u'RTNETLINK answers: Cannot assign requested address'], u'changed': True, u'stdout': u'', '_ansible_item_result': True, u'msg': u'non-zero return code', u'delta': u'0:00:00.002213', 'stdout_lines': [], 'failed_when_result': False, '_ansible_item_label': u'ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host', u'end': u'2020-02-25 18:21:44.547733', '_ansible_no_log': False, 'item': u'ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host', u'cmd': u'ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host', u'failed': False, u'stderr': u'RTNETLINK answers: Cannot assign requested address', u'rc': 2, u'invocation': {u'module_args': {u'warn': True, u'executable': None, u'_uses_shell': True, u'_raw_params': u'ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host', u'removes': None, u'argv': None, u'creates': None, u'chdir': None, u'stdin': None}}, u'start': u'2020-02-25 18:21:44.545520', '_ansible_ignore_errors': None}) => {"changed": false, "item": {"changed": true, "cmd": "ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host", "delta": "0:00:00.002213", "end": "2020-02-25 18:21:44.547733", "failed": false, "failed_when_result": false, "invocation": {"module_args": {"_raw_params": "ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host", "_uses_shell": true, "argv": null, "chdir": null, "creates": null, "executable": null, "removes": null, "stdin": null, "warn": true}}, "item": "ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host", "msg": "non-zero return code", "rc": 2, "start": "2020-02-25 18:21:44.545520", "stderr": "RTNETLINK answers: Cannot assign requested address", "stderr_lines": ["RTNETLINK answers: Cannot assign requested address"], "stdout": "", "stdout_lines": []}, "msg": "ip addr delete aefd::2/64 brd aefd::ffff:ffff:ffff:ffff dev lo:5 scope host failed for reason: RTNETLINK answers: Cannot assign requested address."}

PLAY RECAP *********************************************************************
subcloud4 : ok=147 changed=41 unreachable=0 failed=1

Preliminary assessment from Tao Lui:
During the first deployment, the mgmt/cluster interfaces had already been re-configured prior to unlock ( no longer on lo).
The bootstrap replay failed at removing the cluster ip from the lo interface.

Severity
--------
Major

Steps to Reproduce
------------------
1) Setup a DC System Controller
2) Boot a subcloud active controller node
3) Add the subcloud to the DC system: "dcmanager subcloud add subcloud4 ...."
4) The subcloud fails at controller-0 configuration because of missing ceph-cluster backend
5) Delete the failed subcloud from DC system ( dcmanager subcloud delete subcloud4)
6) Re-add the subcloud with ceph-cluster backend ( dcmanager subcloud add subcloud4 ....)
8) The replay failed early on bootstrapping with the above error message

TC-name:

Expected Behavior
------------------
Subcloud added to DC system successfully on replay

Actual Behavior
----------------
Subcloud add failed early on bootstrapping

Reproducibility
---------------
Tested once

System Configuration
--------------------
DC system

Lab-name: wcp_80-91
subcloud4: wcp_85_86

Branch/Pull Time/Commit
-----------------------
2020-02-24_20-23-53

Last Pass
---------
unknown

Timestamp/Logs
--------------
2020-02-25-18-21-02

+----+-----------+------------+--------------+------------------+---------+
| id | name | management | availability | deploy status | sync |
+----+-----------+------------+--------------+------------------+---------+
| 1 | subcloud1 | unmanaged | online | complete | unknown |
| 2 | subcloud5 | managed | online | complete | in-sync |
| 4 | subcloud4 | unmanaged | offline | bootstrap-failed | unknown |

Revision history for this message
Yosief Gebremariam (ygebrema) wrote :
Yang Liu (yliu12)
summary: - Distributed Cloud: Replay on subcloud failed after initial deployment
- failure
+ Distributed Cloud: Delete and re-add subcloud failed at bootstrap after
+ initial deployment failure
summary: Distributed Cloud: Delete and re-add subcloud failed at bootstrap after
- initial deployment failure
+ initial configuration failure on controller-0
description: updated
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - robustness / error handling

tags: added: stx.4.0 stx.distcloud
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
Dariush Eslimi (deslimi)
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
Changed in starlingx:
assignee: Tee Ngo (teewrs) → Jessica Castelino (jcasteli)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/732395

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud-client (master)

Fix proposed to branch: master
Review: https://review.opendev.org/732401

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/732402

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/732395
Committed: https://git.openstack.org/cgit/starlingx/distcloud/commit/?id=4fd65e9913d28bd097b16ad3bfb80c04ca6b5419
Submitter: Zuul
Branch: master

commit 4fd65e9913d28bd097b16ad3bfb80c04ca6b5419
Author: Jessica Castelino <email address hidden>
Date: Mon May 25 13:51:18 2020 -0400

    CLI command to deploy a subcloud

    If deployment failed, the user has no option than to delete and
    re-add it. If the user was re-adding the subcloud without
    re-installing, it would further result in a bootstrap failure.
    Thus, to simplify things, a new CLI command is provided to allow
    re-deployment. Furthermore, if the user still chooses to delete
    the subcloud and re-add it without a re-install, a better error
    message is provided asking them to re-install the host.

    CLI:
    dcmanager subcloud reconfig <id/name> --deploy-config <file>

    Test Cases:
    1) Successfully add a subcloud with or without deployment option
    2) Fail to re-add a subcloud without re-installation after a failed
       deployment
    3) Re-deploy with new CLI command after successful and unsuccessful deployment
    4) Re-deploy with new CLI command before and after the subcloud is unlocked
    5) Test new CLI command by passing wrong parameters

    Change-Id: I9fe7e3791e3887160668281048c3c12a7f40c2af
    Partial-Bug: 1864756
    Signed-off-by: Jessica Castelino <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud-client (master)

Reviewed: https://review.opendev.org/732401
Committed: https://git.openstack.org/cgit/starlingx/distcloud-client/commit/?id=1c12fada9155c5cf609022e6577f93321d844304
Submitter: Zuul
Branch: master

commit 1c12fada9155c5cf609022e6577f93321d844304
Author: Jessica Castelino <email address hidden>
Date: Thu May 21 14:49:29 2020 -0400

    CLI command to deploy a subcloud

    If deployment failed, the user has no option than to delete and
    re-add it. If the user was re-adding the subcloud without
    re-installing, it would further result in a bootstrap failure.
    Thus, to simplify things, a new CLI command is provided to allow
    re-deployment. Furthermore, if the user still chooses to delete
    the subcloud and re-add it without a re-install, a better error
    message is provided asking them to re-install the host.

    Change-Id: I1db9a172aa063e9e97141fc5e9284e8c477851bf
    Depends-On: https://review.opendev.org/#/c/732395/
    Partial-Bug: 1864756
    Signed-off-by: Jessica Castelino <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/732402
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=2ec82be1ea5e5ba2a7267c438fbd86337399b213
Submitter: Zuul
Branch: master

commit 2ec82be1ea5e5ba2a7267c438fbd86337399b213
Author: Jessica Castelino <email address hidden>
Date: Sun May 31 19:30:02 2020 -0400

    Disallow host re-bootstrap after it has been finalized

    Fail bootstrap play with an assistive error message if
    either the host has been unlocked or host configurations
    have started (i.e. bootstrap_finalized file exists).

    Test Cases:
    1) Successfully bootstrap, configure, unlock an AIOSX
    2) Fail to rebootstrap an AIOSX after unlock
    3) Successfully add a subcloud with or without deployment option
    4) Fail to re-add a subcloud without re-installation after a failed deployment

    Change-Id: I39e9794d1f4d03e61133e8c3225a6dc316407a38
    Depends-On: https://review.opendev.org/#/c/732395/
    Closes-Bug: 1864756
    Signed-off-by: Jessica Castelino <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Yosief Gebremariam (ygebrema) wrote :

This has been tested in distributed cloud with subclouds. The subcloud config was corrupted initially to reproduce the issue and as expected the subcloud deploy failed after successful bootstrap. The replay was rejected demanding a re-installation of the subcloud after the initial bootstrap is completed. Re-adding the subcloud, after re-install, completed successfully.
Tested in build: 2020-06-24_22-16-59

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud-client (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud-client (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud-client/+/792255
Reason: Updated merge coming

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud-client (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud-client (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud-client/+/793407

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud-client (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud-client (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud-client/+/793776

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud-client (f/centos8)
Download full text (17.6 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud-client/+/794794
Committed: https://opendev.org/starlingx/distcloud-client/commit/0d0781278fd07d93b65d0be666dd116f14fa5449
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 8d2b5478c68579c1a9121f71b2085dc397ce9b85
Author: albailey <email address hidden>
Date: Wed May 19 15:47:55 2021 -0500

    Specify the nodeset for zuul jobs

    The py2.7 jobs need to specify xenial
    The py3.6 jobs need to specify bionic
    The focal zuul nodes only have python 3.8 installed in them

    The copyright date was updated in order to trigger
    the zuul jobs, as a no-delta type of change

    Note: pep8 and pylint jobs are not being specified for this repo
    because they specify a generic python3 interpreter, and are
    currently passing in python 3.6 and python 3.8.

    Partial-Bug: 1928978
    Signed-off-by: albailey <email address hidden>
    Change-Id: I4de3a640419ed431619cc4154ab928eebef71280

commit d52a9080082db5fda2e77fb9e342f812ea8c17e1
Author: Rafael Jordão Jardim <email address hidden>
Date: Thu May 6 16:27:06 2021 -0400

    Specify the cacert file in the verify option when in secure mode

    When running in secure mode we want to set the "verify"
    option to the path to the cacert file

    Tests:
    1° I generate a certificate using the documentation and installed on the
    controller (activate https), I got the ca cert and I set up a remote
    cli pointing to a DC and I exported the certificate in a variable
    OS_CACERT to the client get it, I ran some commands from DC.
    2° I passed a flag insecure to execute the dcmanager client
    in a insecure mode
    3° Built an ISO to check if something broke and installed on DC using
    VDM to check if the dcmanager keeps the current behavior

    Note: I think is a good ideia to plan a standardization of the clients
    mainly the keystone (authentication) part

    This change is based on the documentation of the requests documentation
    https://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification

    Closes-bug: 1927723
    Signed-off-by: Rafael Jordão Jardim <email address hidden>
    Change-Id: I4221657b97592b319b3fbf54b5b8c6d325ec9aa3

commit 859864c21dadf0fc1888f5df94853a3c6d5472ac
Author: Rafael Jordão Jardim <email address hidden>
Date: Wed Apr 7 13:05:53 2021 -0400

    Python 2 to Python 3 compatibility DC

    The code was adding two content-types, and when executing with
    python3 it got an error from the server, cause it was sending
    content-type application json but it was supposed to send
    a form data, so the fix was just add a verification to ensure that is
    not add 2 content-types if it already exists

    Development: When I was trying to find things to modify I followed the
    approach of build the client, get the tar file, I set up 2 environments
    one based on python2 and another python3, I installed the tar client
    in both environments and i exported the env vars that the client expect
    to get to request the controller, and doing that I could switch between
    t...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud-client (f/centos8)

Change abandoned by "Bart Wensley <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud-client/+/767072
Reason: This patch has been idle for more than six months. I am abandoning it to keep the review queue sane. If you are still interested in working on this patch, please unabandon it and upload a new patchset.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.