Sub clouds going offline and losing sync with the System Controller regularly

Bug #1927007 reported by Thiago Ribeiro Carvalho
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Tao Liu

Bug Description

Brief Description
-----------------
Sub clouds are going offline and losing the sync with System Controller due to the following error: "No auth_url provided: HTTPUnauthorized: No auth_url provided".

Severity
--------
Major.

Steps to Reproduce
------------------
1. Bootstrap and Manage the sub clouds.
2. Monitor the sub clouds using "dcmanager subcloud list" command and verify that the sub clouds go offline and lose the sync with System Controller continuously.

Expected Behavior
------------------
Sub clouds online/in-sync during all the time they are up and running.

Actual Behavior
----------------
Sub clouds that are online/in-sync suddenly go to offline/out-of-sync.

Reproducibility
---------------
Intermittent issue. About 10% of the sub clouds are always going offline and losing the sync.

System Configuration
--------------------
Distributed Cloud with sub clouds deployed on AWS.

Branch/Pull Time/Commit
-----------------------
N/A.

Last Pass
---------
N/A.

Timestamp/Logs
--------------
2021-04-19 17:12:53.135 115026 ERROR dccommon.drivers.openstack.sdk_platform [-] keystone_client region subcloud36 error: Unable to establish connection to https://[2620:10a:a001:ac01::482]:5001/v3/auth/tokens: HTTPSConnectionPool(host='2620:10a:a001:ac01::482', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f553350cf50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)): ConnectFailure: Unable to establish connection to https://[2620:10a:a001:ac01::482]:5001/v3/auth/tokens: HTTPSConnectionPool(host='2620:10a:a001:ac01::482', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f553350cf50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2021-04-19 17:12:53.138 115026 ERROR dcmanager.audit.subcloud_audit_worker_manager [-] Identity or Platform endpoint for online subcloud: subcloud36 not found.: ConnectFailure: Unable to establish connection to https://[2620:10a:a001:ac01::482]:5001/v3/auth/tokens: HTTPSConnectionPool(host='2620:10a:a001:ac01::482', port=5001): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f553350cf50>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2021-04-19 17:12:53.139 115026 INFO dcmanager.audit.subcloud_audit_worker_manager [-] Setting new availability status: offline on subcloud: subcloud36
2021-04-19 17:12:54.131 115026 INFO dcmanager.audit.subcloud_audit_worker_manager [-] Notifying dcmanager, subcloud:subcloud36, availability:offline
2021-04-19 17:13:25.050 115026 INFO dcmanager.audit.subcloud_audit_worker_manager [-] Setting new availability status: online on subcloud: subcloud36
2021-04-19 17:13:25.262 115026 INFO dcmanager.audit.subcloud_audit_worker_manager [-] Notifying dcmanager, subcloud:subcloud36, availability:online
2021-04-19 17:13:25.638 115026 INFO dcmanager.audit.patch_audit [-] Triggered patch audit for subcloud: subcloud36.
2021-04-19 17:13:25.777 115026 INFO dccommon.drivers.openstack.sdk_platform [-] Token for subcloud subcloud36 expires_at=2021-04-19T18:13:23.000000Z

Test Activity
-------------
System Test.

Workaround
----------
N/A - Sub clouds are able to recover by themselves but can go offline again later.

Tao Liu (tliu88)
Changed in starlingx:
assignee: nobody → Tao Liu (tliu88)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/789572

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Screening: Marking for stx.5.0 since this affects large DC configurations.

tags: added: stx.distcloud
Changed in starlingx:
importance: Undecided → High
tags: added: stx.5.0 stx.6.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/789572
Committed: https://opendev.org/starlingx/distcloud/commit/17b5505d9ea9b149cf28236be3c1b4c263a89ffb
Submitter: "Zuul (22348)"
Branch: master

commit 17b5505d9ea9b149cf28236be3c1b4c263a89ffb
Author: Tao Liu <email address hidden>
Date: Mon May 3 12:32:53 2021 -0400

    Fix Sub clouds going offline due to auth failure

    This update contains the following changes that prevent subclouds
    going offline due to authentication failure:
    1. The os region client cache is cleared when a new keystone client
    is created. The os region client will be re-created using the new
    keystone session.
    2. When the user's access info (such as role id) is changed create
    new keystone client and os region clients. This could happen after
    system controller keystone role ids were synced to subclouds
    3. Remove get_admin_backup_session that was only required when
    upgrading to stx 4.0.
    4. Increase AVAIL_FAIL_COUNT_TO_ALARM to 2 as we don't want to alarm
    first failure since there are cases where we expect a transient
    failure in the subcloud (e.g. haproxy process restart to update
    certificates)

    Tested on DC-6:
    1. Adding 50 subclouds twice
    2. Soaking the fix over the weekend

    Closes-Bug: 1927007

    Signed-off-by: Tao Liu <email address hidden>
    Change-Id: I86fdc9a2f062409e704bdfac2119dc488123f7de

Changed in starlingx:
status: In Progress → Fix Released
Bill Zvonar (billzvonar)
tags: added: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/distcloud/+/789810

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/789810
Committed: https://opendev.org/starlingx/distcloud/commit/d8ce118e50f1d30b9b4d6b3e0b17d45d497ab4af
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit d8ce118e50f1d30b9b4d6b3e0b17d45d497ab4af
Author: Tao Liu <email address hidden>
Date: Mon May 3 12:32:53 2021 -0400

    Fix Sub clouds going offline due to auth failure

    This update contains the following changes that prevent subclouds
    going offline due to authentication failure:
    1. The os region client cache is cleared when a new keystone client
    is created. The os region client will be re-created using the new
    keystone session.
    2. When the user's access info (such as role id) is changed create
    new keystone client and os region clients. This could happen after
    system controller keystone role ids were synced to subclouds
    3. Remove get_admin_backup_session that was only required when
    upgrading to stx 4.0.
    4. Increase AVAIL_FAIL_COUNT_TO_ALARM to 2 as we don't want to alarm
    first failure since there are cases where we expect a transient
    failure in the subcloud (e.g. haproxy process restart to update
    certificates)

    Tested on DC-6:
    1. Adding 50 subclouds twice
    2. Soaking the fix over the weekend

    Closes-Bug: 1927007

    Signed-off-by: Tao Liu <email address hidden>
    Change-Id: I86fdc9a2f062409e704bdfac2119dc488123f7de
    (cherry picked from commit 17b5505d9ea9b149cf28236be3c1b4c263a89ffb)

Ghada Khalil (gkhalil)
tags: added: in-r-stx50
removed: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/792298
Reason: Updated merge soon

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/793405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/distcloud/+/796528

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (f/centos8)
Download full text (105.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/796528
Committed: https://opendev.org/starlingx/distcloud/commit/4c5344f8765b372cb84d2b1181589c16db2ae6e4
Submitter: "Zuul (22348)"
Branch: f/centos8

commit cb979811017bd193fc1f06e53bb7830fd3184859
Author: Yuxing Jiang <email address hidden>
Date: Wed Jun 9 11:11:27 2021 -0400

    Format the IP addresses in payload before adding a subcloud

    The IPv6 addresses can be represented in multiple formats. As IP
    addresses are stored as text in database, ansible inventory and
    overrides, this commit converts the IP addresses in payload to
    standard text format of IPv6 address during adding a new subcloud.

    Tested with installing and bootstrapping a new subcloud(RVMC
    configured) with the correct IPv6 address values, but with
    unrecommended upper case letters and '0'. The addresses are
    converted to standard format in database, ansible inventory and
    overrides files.

    Partial-Bug: 1931459
    Signed-off-by: Yuxing Jiang <email address hidden>
    Change-Id: I6c26e749941f1ea2597f91886ad8f7da64521f0d

commit 2cf5d6d5cef0808c354f7575336aec34253993b3
Author: albailey <email address hidden>
Date: Thu May 20 14:19:24 2021 -0500

    Delete existing vim strategy from subcloud during patch orch

    When dcmanager creates a patch strategy, if a subcloud has an
    existing vim patch strategy, it will attempt to re-use
    that strategy during its patching phase, which may result in an
    error.

    This commit deletes the existing vim patch strategy in
    a subcloud, if it exists, so it can be re-created.
    If the strategy cannot be deleted, orchestration fails.

    Change-Id: Id35ef26ed3ddae6d71874fc6bac11df147f72323
    Closes-Bug: 1929221
    Signed-off-by: albailey <email address hidden>

commit 9e14c83f0162549a2a94cb8bc1e73dbc4f4d4887
Author: albailey <email address hidden>
Date: Tue Jun 1 14:37:14 2021 -0500

    Adding activation retry to upgrade orchestration

    When performing an activation, the keystone endpoints may not
    be accessible in the subcloud due to the asyncronous way that
    cert-mon can trigger a restart of keystone.

    This would have occasionally resulted in the upgrade activation
    failing to be initiated, and orchestration needing to be invoked
    again to resume.

    This 'hack' adds retries and sleeps to the initial
    activation action.

    Change-Id: Ic757521dec7bdc248a51a70b5463caafe7927360
    Partial-Bug: 1927550
    Signed-off-by: albailey <email address hidden>

commit bb604c0a9b872efd65fa45f1e2269995818c6262
Author: Tee Ngo <email address hidden>
Date: Thu May 27 22:17:16 2021 -0400

    Fix subcloud show --detail command related issues

    If the subcloud is offline, the command stalls and eventually returns
    the "ERROR (app)" output. If the subcloud is online, the oam_floating_ip
    info is excluded from the output when the subcloud id instead of subcloud
    name is specified.

    This commit fixes both of the above issues.

    Closes-Bug: 1929893
    Change-Id: I995591368564539b0e6af185b1adba2db73e0e46
    Sign...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.